News Crawler

Welcome to the Olot News Crawler! This project scrapes news from the Olot town hall website (https://www.olot.cat/ajuntament/comunicacio/actualitat.htm) using TypeScript and Crawlee. It collects summaries from the news list pages and full details from individual articles, storing them in separate datasets for easy access.

This guide will help you set up and run the project step-by-step, even if you're new to coding!

Prerequisites

Before you start, you’ll need to install a few tools on your computer:

1. Install Node.js

What it is: Node.js is a runtime that lets us run JavaScript (and TypeScript) outside a browser. This project requires Node.js version 22 or higher.
How to install:
1. Go to the Node.js website.
2. Download the "LTS" version for your operating system (Windows, macOS, or Linux). As of March 2025, this should be v22.x.x or later.
3. Run the installer and follow the prompts (accept defaults if unsure).
4. Open a terminal (Command Prompt on Windows, Terminal on macOS/Linux) and check the version:
```
node --version
```
  You should see something like v22.x.x. If not, reinstall Node.js.

2. Enable Yarn Berry (via Corepack)

What it is: Yarn is a package manager that installs project dependencies. We’re using Yarn Berry (v4.6.0), which comes built into Node.js via Corepack.
How to enable:
1. In your terminal, run:
```
corepack enable
```
  This activates Corepack, which manages Yarn versions.
2. The project specifies Yarn 4.6.0 in package.json, so you don’t need to install it manually—Corepack will handle it when you set up the project.

3. Install Git

What it is: Git is a version control system to clone this project from a repository.
How to install:
1. Go to the Git website.
2. Download and install Git for your OS (accept defaults).
3. Verify installation:
```
git --version
```
  You should see git version x.x.x.

Getting Started

Follow these steps to set up and run the project:

1. Clone the Repository

Open your terminal and navigate to a folder where you want the project:
```
cd path/to/your/folder
```
Clone the repo (replace <repo-url> with the actual URL, e.g., from GitHub):
```
git clone <repo-url> olot-news-crawler
cd olot-news-crawler
```

2. Install Dependencies

Run this command to install all project dependencies using Yarn 4.6.0:
```
yarn install
```
What happens:
- Corepack detects "packageManager": "yarn@4.6.0..." in package.json and downloads Yarn 4.6.0.
- Yarn installs libraries like crawlee and typescript.
Verify Yarn version:
```
yarn --version
```
It should say 4.6.0. If not, ensure corepack enable was run.

3. Run the Crawler

Start the crawler in development mode (it auto-reloads if you edit the code):
```
yarn dev
```
What it does:
- Crawls https://www.olot.cat/ajuntament/comunicacio/actualitat.htm.
- Collects news summaries and article details.
- Saves data to storage/datasets/.
To stop it, press Ctrl + C in the terminal.

4. Build for Production (Optional)

If you want a compiled version:
```
yarn build
```
This creates a dist/ folder with JavaScript files you can run with node dist/index.js.

5. Lint and Format Code (Optional)

To check and fix code style:
```
yarn lint
```
Uses Biome to keep the code consistent.

Project Structure

Here’s what’s in the project:

src/: Source code.
- index.ts: Entry point that runs the crawler.
- crawler.ts: Crawling logic using Crawlee.
- types.ts: Defines the NewsItem data structure.
storage/: Where data is saved.
- datasets/news_summaries/: Summaries from list pages (title, date, summary).
- datasets/news_articles/: Full article details (title, date, summary, full text, image URL).
- request_queues/default/: Tracks URLs being crawled.
package.json: Lists dependencies and scripts.
.yarnrc.yml: Configures Yarn settings.
tsconfig.json: TypeScript configuration.
.gitignore: Ignores files like node_modules/ from Git.

Understanding the Output

When you run yarn dev, the crawler:

Visits the news list page and its paginated versions (e.g., ?pg=2).
Extracts summaries from each news item on the list and saves them to storage/datasets/news_summaries/.
Enqueues and visits each article URL, extracting full details and saving them to storage/datasets/news_articles/.

Sample Data

Summary (news_summaries/000000001.json):

{
  "url": "https://www.olot.cat/pl217/ajuntament/comunicacio/actualitat/id8690/nova-edicio-del-casal-esportiu-de-setmana-santa.htm",
  "title": "Nova edició del casal esportiu de Setmana Santa",
  "date": "2/3/2025",
  "summary": "L’Àrea d’Esports de l’Ajuntament d’Olot ha organitzat un any més el casal esportiu de Setmana Santa."
}

Article (news_articles/000000001.json):

{
  "url": "https://www.olot.cat/pl217/ajuntament/comunicacio/actualitat/id8690/nova-edicio-del-casal-esportiu-de-setmana-santa.htm",
  "title": "Nova edició del casal esportiu de Setmana Santa",
  "date": "2 de març de 2025",
  "summary": "L’Àrea d’Esports de l’Ajuntament d’Olot ha organitzat un any més el casal esportiu de Setmana Santa.",
  "fullText": "L’Àrea d’Esports de l’Ajuntament d’Olot ha organitzat un any més el casal esportiu de Setmana Santa. Enguany, aquest casal multiesportiu dirigit a infants nascuts entre el 2013 i el 2020 porta per nom Casal Setmana Santa Esportiva 2025 i es realitzarà del 14 al 17 d’abril de 9 a 13 h.\n[...]",
  "imageUrl": "https://www.olot.cat/media/site1/cache/images/post-casal-setmana-santa-esportiva-2025.jpg"
}

Troubleshooting

Node.js Version Error:
- If you see node: command not found or a version < 22, reinstall Node.js.
Yarn Install Fails:
- Ensure corepack enable was run. Check node --version is 22+.
- Run yarn install again.
Crawler Stops Early:
- The maxRequestsPerCrawl is set to 50 for testing. Edit src/crawler.ts to increase it (e.g., 100) or remove it for a full crawl.
No Data in storage/:
- Check terminal logs. If it says "No summaries found" or "No data extracted," the website structure might have changed. Tell me, and we’ll fix it!

Customizing the Project

Change the Crawl Limit: In src/crawler.ts, adjust maxRequestsPerCrawl:
```
maxRequestsPerCrawl: 100, // Crawl more pages
```
Add More Data: Edit the article handler in crawler.ts to extract extra fields (e.g., related links <ul class="llista-links">).
Run Without Watching: Use tsx src/index.ts instead of yarn dev.

Next Steps

Explore the data in storage/datasets/ using a text editor or JSON viewer.
Learn TypeScript basics to tweak the code: TypeScript Docs.
Check out Crawlee for more features: Crawlee Docs.

Notes for You

File Creation: Save this as README.md in the project root and commit it:

echo "# Olot News Crawler" > README.md
# Paste the content above into README.md using your editor
git add README.md
git commit -m "Add README guide for setup and usage"
git push

Assumptions: Assumes your friend will get the repo URL from you (e.g., GitHub). Replace <repo-url> with the actual link when sharing.
Tone: Kept it friendly and simple, avoiding jargon where possible.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.vscode		.vscode
src		src
storage		storage
.gitignore		.gitignore
.npmrc		.npmrc
.nvmrc		.nvmrc
.yarnrc.yml		.yarnrc.yml
README.md		README.md
biome.json		biome.json
check_package_manager.js		check_package_manager.js
package.json		package.json
tsconfig.json		tsconfig.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

News Crawler

Prerequisites

1. Install Node.js

2. Enable Yarn Berry (via Corepack)

3. Install Git

Getting Started

1. Clone the Repository

2. Install Dependencies

3. Run the Crawler

4. Build for Production (Optional)

5. Lint and Format Code (Optional)

Project Structure

Understanding the Output

Sample Data

Troubleshooting

Customizing the Project

Next Steps

Notes for You

About

Uh oh!

Uh oh!

Languages

guillempuche/news_crawler

Folders and files

Latest commit

History

Repository files navigation

News Crawler

Prerequisites

1. Install Node.js

2. Enable Yarn Berry (via Corepack)

3. Install Git

Getting Started

1. Clone the Repository

2. Install Dependencies

3. Run the Crawler

4. Build for Production (Optional)

5. Lint and Format Code (Optional)

Project Structure

Understanding the Output

Sample Data

Troubleshooting

Customizing the Project

Next Steps

Notes for You

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages