Welcome to the Olot News Crawler! This project scrapes news from the Olot town hall website (https://www.olot.cat/ajuntament/comunicacio/actualitat.htm) using TypeScript and Crawlee. It collects summaries from the news list pages and full details from individual articles, storing them in separate datasets for easy access.
This guide will help you set up and run the project step-by-step, even if you're new to coding!
Before you start, you’ll need to install a few tools on your computer:
- What it is: Node.js is a runtime that lets us run JavaScript (and TypeScript) outside a browser. This project requires Node.js version 22 or higher.
- How to install:
-
Go to the Node.js website.
-
Download the "LTS" version for your operating system (Windows, macOS, or Linux). As of March 2025, this should be v22.x.x or later.
-
Run the installer and follow the prompts (accept defaults if unsure).
-
Open a terminal (Command Prompt on Windows, Terminal on macOS/Linux) and check the version:
node --versionYou should see something like
v22.x.x. If not, reinstall Node.js.
-
- What it is: Yarn is a package manager that installs project dependencies. We’re using Yarn Berry (v4.6.0), which comes built into Node.js via Corepack.
- How to enable:
-
In your terminal, run:
corepack enableThis activates Corepack, which manages Yarn versions.
-
The project specifies Yarn 4.6.0 in
package.json, so you don’t need to install it manually—Corepack will handle it when you set up the project.
-
- What it is: Git is a version control system to clone this project from a repository.
- How to install:
-
Go to the Git website.
-
Download and install Git for your OS (accept defaults).
-
Verify installation:
git --versionYou should see
git version x.x.x.
-
Follow these steps to set up and run the project:
-
Open your terminal and navigate to a folder where you want the project:
cd path/to/your/folder -
Clone the repo (replace
<repo-url>with the actual URL, e.g., from GitHub):git clone <repo-url> olot-news-crawler cd olot-news-crawler
-
Run this command to install all project dependencies using Yarn 4.6.0:
yarn install -
What happens:
- Corepack detects
"packageManager": "yarn@4.6.0..."inpackage.jsonand downloads Yarn 4.6.0. - Yarn installs libraries like
crawleeandtypescript.
- Corepack detects
-
Verify Yarn version:
yarn --versionIt should say
4.6.0. If not, ensurecorepack enablewas run.
-
Start the crawler in development mode (it auto-reloads if you edit the code):
yarn dev -
What it does:
- Crawls
https://www.olot.cat/ajuntament/comunicacio/actualitat.htm. - Collects news summaries and article details.
- Saves data to
storage/datasets/.
- Crawls
-
To stop it, press
Ctrl + Cin the terminal.
-
If you want a compiled version:
yarn build -
This creates a
dist/folder with JavaScript files you can run withnode dist/index.js.
-
To check and fix code style:
yarn lint -
Uses Biome to keep the code consistent.
Here’s what’s in the project:
src/: Source code.index.ts: Entry point that runs the crawler.crawler.ts: Crawling logic using Crawlee.types.ts: Defines theNewsItemdata structure.
storage/: Where data is saved.datasets/news_summaries/: Summaries from list pages (title, date, summary).datasets/news_articles/: Full article details (title, date, summary, full text, image URL).request_queues/default/: Tracks URLs being crawled.
package.json: Lists dependencies and scripts..yarnrc.yml: Configures Yarn settings.tsconfig.json: TypeScript configuration..gitignore: Ignores files likenode_modules/from Git.
When you run yarn dev, the crawler:
- Visits the news list page and its paginated versions (e.g.,
?pg=2). - Extracts summaries from each news item on the list and saves them to
storage/datasets/news_summaries/. - Enqueues and visits each article URL, extracting full details and saving them to
storage/datasets/news_articles/.
-
Summary (
news_summaries/000000001.json):{ "url": "https://www.olot.cat/pl217/ajuntament/comunicacio/actualitat/id8690/nova-edicio-del-casal-esportiu-de-setmana-santa.htm", "title": "Nova edició del casal esportiu de Setmana Santa", "date": "2/3/2025", "summary": "L’Àrea d’Esports de l’Ajuntament d’Olot ha organitzat un any més el casal esportiu de Setmana Santa." } -
Article (
news_articles/000000001.json):{ "url": "https://www.olot.cat/pl217/ajuntament/comunicacio/actualitat/id8690/nova-edicio-del-casal-esportiu-de-setmana-santa.htm", "title": "Nova edició del casal esportiu de Setmana Santa", "date": "2 de març de 2025", "summary": "L’Àrea d’Esports de l’Ajuntament d’Olot ha organitzat un any més el casal esportiu de Setmana Santa.", "fullText": "L’Àrea d’Esports de l’Ajuntament d’Olot ha organitzat un any més el casal esportiu de Setmana Santa. Enguany, aquest casal multiesportiu dirigit a infants nascuts entre el 2013 i el 2020 porta per nom Casal Setmana Santa Esportiva 2025 i es realitzarà del 14 al 17 d’abril de 9 a 13 h.\n[...]", "imageUrl": "https://www.olot.cat/media/site1/cache/images/post-casal-setmana-santa-esportiva-2025.jpg" }
- Node.js Version Error:
- If you see
node: command not foundor a version < 22, reinstall Node.js.
- If you see
- Yarn Install Fails:
- Ensure
corepack enablewas run. Checknode --versionis 22+. - Run
yarn installagain.
- Ensure
- Crawler Stops Early:
- The
maxRequestsPerCrawlis set to 50 for testing. Editsrc/crawler.tsto increase it (e.g., 100) or remove it for a full crawl.
- The
- No Data in
storage/:- Check terminal logs. If it says "No summaries found" or "No data extracted," the website structure might have changed. Tell me, and we’ll fix it!
-
Change the Crawl Limit: In
src/crawler.ts, adjustmaxRequestsPerCrawl:maxRequestsPerCrawl: 100, // Crawl more pages
-
Add More Data: Edit the article handler in
crawler.tsto extract extra fields (e.g., related links<ul class="llista-links">). -
Run Without Watching: Use
tsx src/index.tsinstead ofyarn dev.
- Explore the data in
storage/datasets/using a text editor or JSON viewer. - Learn TypeScript basics to tweak the code: TypeScript Docs.
- Check out Crawlee for more features: Crawlee Docs.
-
File Creation: Save this as
README.mdin the project root and commit it:echo "# Olot News Crawler" > README.md # Paste the content above into README.md using your editor git add README.md git commit -m "Add README guide for setup and usage" git push -
Assumptions: Assumes your friend will get the repo URL from you (e.g., GitHub). Replace
<repo-url>with the actual link when sharing. -
Tone: Kept it friendly and simple, avoiding jargon where possible.