Pythonでcloudscraperを使用する

このガイドでは、cloudscraper Pythonライブラリを使用してCloudflareの保護を回避し、エラーを処理する方法を説明します。

前提条件のインストール
初期スクレイピングコードの作成
cloudscraperの組み込み
cloudscraperの追加機能を使用する
よくあるcloudscraperのエラー
- module not found
- cloudscraper can’t bypass the latest Cloudflare version
cloudscraperの代替案
結論

Install Prerequisites

Python 3がインストールされていることを確認し、必要なパッケージをインストールします。

pip install tqdm==4.66.5 requests==2.32.3 beautifulsoup4==4.12.3

Write Initial Scraping Code

このガイドでは、ChannelsTV website上で特定の日付に公開されたニュース記事からメタデータをスクレイピングすることを前提としています。以下は初期のPythonスクリプトです。

import requests
from bs4 import BeautifulSoup
from datetime import datetime
from tqdm.auto import tqdm

def extract_article_data(article_source, headers):
    response = requests.get(article_source, headers=headers)
    if response.status_code != 200:
        return None

    soup = BeautifulSoup(response.content, 'html.parser')

    title = soup.find(class_="post-title display-3").text.strip()

    date = soup.find(class_="post-meta_time").text.strip()
    date_object = datetime.strptime(date, 'Updated %B %d, %Y').date()

    categories = [category.text.strip() for category in soup.find('nav', {"aria-label": "breadcrumb"}).find_all('li')]

    tags = [tag.text.strip() for tag in soup.find("div", class_="tags").find_all("a")]

    article_data = {
        'date': date_object,
        'title': title,
        'link': article_source,
        'tags': tags,
        'categories': categories
    }

    return article_data

def process_page(articles, headers):
    page_data = []
    for article in tqdm(articles):
        url = article.find('a', href=True).get('href')
        if "https://" not in url:
            continue
        article_data = extract_article_data(url, headers)
        if article_data:
            page_data.append(article_data)
    return page_data

def scrape_articles_per_day(base_url, headers):
    day_data = []
    page = 1

    while True:
        page_url = f"{base_url}/page/{page}"
        response = requests.get(page_url, headers=headers)

        if not response or response.status_code != 200:
            break

        soup = BeautifulSoup(response.content, 'html.parser')
        articles = soup.find_all('article')

        if not articles:
            break
        page_data = process_page(articles, headers)
        day_data.extend(page_data)

        page += 1

    return day_data

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
}

URL = "https://www.channelstv.com/2024/08/01/"

scraped_articles = scrape_articles_per_day(URL, headers)
print(f"{len(scraped_articles)} articles were scraped.")
print("Samples:")
print(scraped_articles[:2])

このスクリプトでは、スクレイピングのための3つの主要な関数を定義しています。extract_article_data関数は記事のWebページからコンテンツを取得し、タイトル、公開日、タグ、カテゴリなどのメタデータを辞書に抽出します。

次に、process_page関数は指定したページ上のすべての記事を反復処理し、extract_article_dataを使用してメタデータを抽出し、その結果をリストにまとめます。

最後に、scrape_articles_per_day関数はページネーションされた結果を体系的に移動し、whileループ内でページ番号をインクリメントし、これ以上ページが見つからなくなるまで処理を続けます。

スクレイパーを実行するために、スクリプトでは2024年8月1日というフィルタリング日付を含むターゲットURLを指定します。User Agentヘッダーを設定し、指定したURLとヘッダーでscrape_articles_per_day関数を呼び出します。スクレイピングされた記事の総数と、最初の2件の結果のプレビューが出力されます。

しかし、このスクリプトは期待どおりに動作しません。ChannelsTV websiteがCloudflareの保護を採用しており、extract_article_dataおよびscrape_articles_per_dayが行う直接のHTTPリクエストがブロックされるためです。

スクリプトを実行すると、通常は次のような出力になります。

0 articles were scraped.
Samples:
[]

Incorporate cloudscraper

Cloudflareを回避するためにcloudscraperをインストールします。

pip install cloudscraper==1.2.71

スクリプトを修正してcloudscraperを使用します。

import cloudscraper

def fetch_html_content(url, headers):
    try:
        scraper = cloudscraper.create_scraper()
        response = scraper.get(url, headers=headers)

        if response.status_code == 200:
            return response
        else:
            print(f"Failed to fetch URL: {url}. Status code: {response.status_code}")
            return None
    except Exception as e:
        print(f"An error occurred while fetching URL: {url}. Error: {str(e)}")
        return None

このfetch_html_content関数は、URLとリクエストヘッダーを入力として受け取ります。cloudscraper.create_scraper()を使用してWebページの取得を試みます。リクエストが成功した場合（ステータスコード200）はレスポンスを返し、それ以外の場合はエラーメッセージを表示してNoneを返します。例外が発生した場合はエラーを捕捉して表示し、Noneを返します。

この更新により、すべてのrequests.get呼び出しがfetch_html_contentに置き換えられ、Cloudflareで保護されたWebサイトとの互換性が確保されます。最初の修正は、上記のとおりextract_article_data関数で行います。

次に、スクレイピング関数内のrequests.get呼び出しをfetch_html_contentに置き換えます。

def extract_article_data(article_source, headers):
    response = fetch_html_content(article_source, headers)

その後、scrape_articles_per_day関数内のrequests.get呼び出しを次のように置き換えます。

def scrape_articles_per_day(base_url, headers):
    day_data = []
    page = 1

    while True:
        page_url = f"{base_url}/page/{page}" 
        response = fetch_html_content(page_url, headers)

この関数を定義することで、cloudscraperライブラリがCloudflareの制限を回避するのに役立ちます。

コードを実行すると、出力は次のようになります。

Failed to fetch URL: https://www.channelstv.com/2024/08/01//page/5. Status code: 404
55 articles were scraped.
Samples:
[{'date': datetime.date(2024, 8, 1),
  'title': 'Resilience, Tear Gas, Looting, Curfew As #EndBadGovernance Protests Hold',
  'link': 'https://www.channelstv.com/2024/08/01/tear-gas-resilience-looting-curfew-as-endbadgovernance-protests-hold/',
  'tags': ['Eagle Square', 'Hunger', 'Looting', 'MKO Abiola Park', 'violence'],
  'categories': ['Headlines']},
 {'date': datetime.date(2024, 8, 1),
  'title': 'Mother Of Russian Artist Freed In Prisoner Swap Waiting To 'Hug' Her',
  'link': 'https://www.channelstv.com/2024/08/01/mother-of-russian-artist-freed-in-prisoner-swap-waiting-to-hug-her/',
  'tags': ['Prisoner Swap', 'Russia'],
  'categories': ['World News']}]

Use Additional cloudscraper Features

Proxies

cloudscraperでは、プロキシを定義し、すでに作成済みのcloudscraperオブジェクトに次のように渡すことができます。

scraper = cloudscraper.create_scraper()
proxy = {
    'http': 'http://your-proxy-ip:port',
    'https': 'https://your-proxy-ip:port'
}
response = scraper.get(URL, proxies=proxy)

ここでは、まずデフォルト値でスクレイパーオブジェクトを定義します。次に、httpおよびhttpsプロキシを含むプロキシ辞書を定義します。その後、通常のrequest.getメソッドと同様に、プロキシ辞書オブジェクトをscraper.getメソッドへproxiesとして渡します。

Change the User Agent and JavaScript Interpreter

cloudscraperライブラリはUser Agentを自動生成でき、さらにスクレイパーで使用するJavaScriptインタープリタとエンジンを指定できます。以下はサンプルコードです。

scraper = cloudscraper.create_scraper(
    interpreter="nodejs",
    browser={
        "browser": "chrome",
        "platform": "ios",
        "desktop": False,
    }
)

上記スクリプトでは、インタープリタを"nodejs"に設定し、browserパラメータに辞書を渡しています。browserはChromeに設定され、platformは"ios"に設定されています。desktopパラメータはFalseに設定されており、ブラウザがモバイルで動作することを示しています。

Handling CAPTCHAs

cloudscraperライブラリは、reCAPTCHA、hCaptchaなどを回避するためのサードパーティ製CAPTCHAソルバーをサポートしています。次のスニペットは、CAPTCHAを処理するためにスクレイパーをどのように修正するかを示します。

scraper = cloudscraper.create_scraper(
  captcha={
    'provider': 'capsolver',
    'api_key': 'your_capsolver_api_key'
  }
)

このコードでは、CAPTCHAプロバイダとしてCapsolverを使用し、Capsolver API keyを指定します。両方の値は辞書に格納され、cloudscraper.create_scraperメソッドのcaptchaパラメータに渡されます。

Common cloudscraper Errors

`module not found`

cloudscraperがインストールされていることを確認します。

pip install cloudscraper

次に、仮想環境が有効化されているかを確認してください。Windowsの場合:

.<venv-name>\Scripts\activate.bat

LinuxまたはmacOSの場合:

source <venv-name>/bin/activate

`cloudscraper can’t bypass the latest Cloudflare version`

パッケージを更新します。

pip install -U cloudscraper

cloudscraper Alternatives

Bright Dataは、Cloudflareを回避するための強力なプロキシネットワークを提供しています。アカウントを作成して設定し、API認証情報を取得します。次に、それらの認証情報を使用して、ターゲットURLのデータに次のようにアクセスします。

import requests

host = 'brd.superproxy.io'
port = 22225

username = 'brd-customer-<customer_id>-zone-<zone_name>'
password = '<zone_password>'

proxy_url = f'http://{username}:{password}@{host}:{port}'

proxies = {
    'http': proxy_url,
    'https': proxy_url
}

response = requests.get(URL, proxies=proxies)

ここでは、Python RequestsライブラリでGETリクエストを行い、proxiesパラメータでプロキシを渡しています。

Conclusion

cloudscraperは有用ですが、限界もあります。Cloudflareで保護されたサイトにアクセスするために、Bright DataのプロキシネットワークおよびWeb Unlockerの利用を検討してください。

今すぐ無料トライアルから始めましょう！

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pythonでcloudscraperを使用する

Install Prerequisites

Write Initial Scraping Code

Incorporate cloudscraper

Use Additional cloudscraper Features

Proxies

Change the User Agent and JavaScript Interpreter

Handling CAPTCHAs

Common cloudscraper Errors

`module not found`

`cloudscraper can’t bypass the latest Cloudflare version`

cloudscraper Alternatives

Conclusion

About

Uh oh!

Releases

Packages

bright-jp/cloudscraper-in-python

Folders and files

Latest commit

History

Repository files navigation

Pythonでcloudscraperを使用する

Install Prerequisites

Write Initial Scraping Code

Incorporate cloudscraper

Use Additional cloudscraper Features

Proxies

Change the User Agent and JavaScript Interpreter

Handling CAPTCHAs

Common cloudscraper Errors

module not found

cloudscraper can’t bypass the latest Cloudflare version

cloudscraper Alternatives

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

`module not found`

`cloudscraper can’t bypass the latest Cloudflare version`

Packages