-
Notifications
You must be signed in to change notification settings - Fork 24
docs: Add Scrapling guide #938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
vdusek
wants to merge
8
commits into
master
Choose a base branch
from
docs/scrapling-guide
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+254
−0
Open
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
54e153d
docs: Add Scrapling guide
vdusek 29c4c8a
docs: Split Scrapling guide example into modules and use code tabs
vdusek 2a41a3f
docs: use Request.crawl_depth for depth tracking in Scrapling example
vdusek 910df14
docs: renumber Scrapling guide to 07 and switch to a single-file example
vdusek 404bdfb
chore: drop unused ruff ignore for the removed Scrapling project
vdusek 55ad62a
docs: reduce clause-gluing dashes in the Scrapling guide
vdusek 440a30e
docs: adjust wording style
vdusek 2ced5c5
Merge branch 'master' into docs/scrapling-guide
vdusek File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,96 @@ | ||
| --- | ||
| id: scrapling | ||
| title: Adaptive scraping with Scrapling | ||
| description: Build an Apify Actor that scrapes web pages using the Scrapling adaptive web scraping library. | ||
| --- | ||
|
|
||
| import CodeBlock from '@theme/CodeBlock'; | ||
| import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; | ||
|
|
||
| import ScraplingExample from '!!raw-loader!roa-loader!./code/07_scrapling.py'; | ||
| import ScraplingBrowserScraper from '!!raw-loader!./code/07_scrapling_browser.py'; | ||
|
|
||
| In this guide, you'll learn how to use the [Scrapling](https://scrapling.readthedocs.io/) library for adaptive web scraping in your Apify Actors. | ||
|
|
||
| ## Introduction | ||
|
|
||
| [Scrapling](https://scrapling.readthedocs.io/) is an adaptive web scraping library for Python that combines fetching and parsing behind a single, high-level API. It can fetch a page with fast HTTP requests or with a real browser, parse the result with familiar CSS selectors and XPath, and even relocate your selectors automatically when a website's structure changes. | ||
|
|
||
| Scrapling is a great fit for Apify Actors: | ||
|
|
||
| - A single API exposes a fast HTTP client with browser TLS-fingerprint impersonation, as well as full browser automation for JavaScript-heavy or protected pages. | ||
| - Scrapling can remember the elements you scraped and find them again after a website redesign. Your scrapers keep working with fewer manual fixes. | ||
| - Built-in stealth features (browser impersonation, realistic headers, and automatic Cloudflare Turnstile solving with the browser fetchers) help you avoid being blocked. | ||
| - Elements are selected with CSS selectors (including the `::text` and `::attr()` pseudo-elements) or XPath, with a Scrapy/Parsel-like `.get()` and `.getall()` interface. | ||
| - Every fetcher has an asynchronous variant, which integrates naturally with the asyncio-based Apify SDK. | ||
|
|
||
| Scrapling's parser works on its own. The fetchers are an optional extra. Install Scrapling with the `fetchers` extra to get the HTTP and browser fetchers: | ||
|
|
||
| ```bash | ||
| pip install "scrapling[fetchers]" | ||
| ``` | ||
|
|
||
| ## Choosing a fetcher | ||
|
|
||
| All of Scrapling's fetchers are importable from `scrapling.fetchers`. Pick the one that matches the website you're scraping: | ||
|
|
||
| - **`Fetcher` / `AsyncFetcher`** - Plain HTTP requests via `.get()`, `.post()`, `.put()`, and `.delete()`. Fast and lightweight, with optional browser TLS-fingerprint impersonation (`impersonate`) and realistic headers (`stealthy_headers`). This is the best choice for static pages and APIs, and it needs no browser binaries. | ||
| - **`DynamicFetcher` / `DynamicSession`** - Full browser automation based on [Playwright](https://playwright.dev/), for pages that require JavaScript rendering or interaction. Fetch a page with `.fetch()` or its async variant `.async_fetch()`. | ||
| - **`StealthyFetcher` / `StealthySession`** - A stealth-hardened browser fetcher that can automatically solve Cloudflare Turnstile challenges (`solve_cloudflare=True`). Use it for the most heavily protected websites. | ||
|
|
||
| The returned `Response` object is also a Scrapling selector, so you can call `.css()`, `.xpath()`, `.find_all()`, and the other parsing methods on it directly. | ||
|
|
||
| The HTTP fetchers work with just the `scrapling[fetchers]` extra. The browser-based fetchers (`DynamicFetcher` and `StealthyFetcher`) additionally need browser binaries, which you download with the `scrapling install` command. See [Running browser-based fetchers](#running-browser-based-fetchers) below. | ||
|
|
||
| The example Actor in this guide uses the HTTP `AsyncFetcher`, which is the simplest to deploy and pairs well with Apify Proxy. | ||
|
|
||
| ## Example Actor | ||
|
|
||
| The following Actor recursively scrapes data from linked pages on the same site, up to a user-defined maximum depth, starting from the URLs in the Actor input. It uses Scrapling's `AsyncFetcher` to fetch each page through [Apify Proxy](https://docs.apify.com/platform/proxy), and CSS selectors to extract the title, headings, and links. | ||
|
|
||
| The whole Actor fits in a single file. A `scrape_page` helper holds the Scrapling-specific fetching and parsing, while the `main` coroutine handles the [Actor](https://docs.apify.com/platform/actors) lifecycle, reads the input, sets up [Apify Proxy](https://docs.apify.com/platform/proxy) and the [request queue](https://docs.apify.com/platform/storage/request-queue), and drives the crawl: | ||
|
|
||
| <RunnableCodeBlock className="language-python" language="python"> | ||
| {ScraplingExample} | ||
| </RunnableCodeBlock> | ||
|
|
||
| Note that: | ||
|
|
||
| - Keeping the fetching and parsing in `scrape_page` separates the Scrapling-specific code from the Actor's orchestration logic. The function returns the extracted data together with the discovered links, so `main` decides what to store and what to enqueue. | ||
| - The response of `AsyncFetcher.get` is a Scrapling selector, so `response.css('title::text').get()` reads the page title and `response.css('a::attr(href)').getall()` returns every link's `href` in one call. | ||
| - `response.urljoin(link_href)` resolves relative links against the page URL, so you can enqueue them directly. | ||
| - The `impersonate='chrome'` and `stealthy_headers=True` options make the request look like it comes from a real Chrome browser, which, combined with Apify Proxy, reduces the chance of being blocked. | ||
|
|
||
| ## Using Apify Proxy | ||
|
|
||
| Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. In the example above, `main` creates a proxy configuration with `Actor.create_proxy_configuration` and passes a fresh proxy URL to `scrape_page` for every request, which forwards it to Scrapling's `proxy` argument. | ||
|
|
||
| Scrapling accepts the proxy as a URL string (for example `http://user:pass@proxy.apify.com:8000`), which is exactly what `ProxyConfiguration.new_url` returns. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For details, see [Proxy management](../concepts/proxy-management). The browser-based fetchers accept the same `proxy` argument. | ||
|
|
||
| ## Running browser-based fetchers | ||
|
|
||
| `DynamicFetcher` and `StealthyFetcher` drive a real browser, so they need the browser binaries installed with the `scrapling install` command. Locally, run it once after installing the `scrapling[fetchers]` extra: | ||
|
|
||
| ```bash | ||
| scrapling install | ||
| ``` | ||
|
|
||
| Switching the example Actor from HTTP to a real browser takes only one code change. Swap the `AsyncFetcher.get` call in `scrape_page` for `DynamicFetcher.async_fetch`. The parsing API is identical, so the rest of the Actor stays the same: | ||
|
|
||
| <CodeBlock className="language-python"> | ||
| {ScraplingBrowserScraper} | ||
| </CodeBlock> | ||
|
|
||
| To run this on the Apify platform, build on top of the [Apify Playwright base image](https://hub.docker.com/r/apify/actor-python-playwright), which already ships a browser together with all of its system-level dependencies, and run `scrapling install` during the Docker build to download the browser binaries that Scrapling expects. | ||
|
|
||
| ## Conclusion | ||
|
|
||
| In this guide, you learned how to use Scrapling in your Apify Actors. You can now fetch pages with Scrapling's HTTP or browser-based fetchers, extract data with its CSS and XPath selectors, route requests through Apify Proxy, and run the whole thing on the Apify platform. To get started with your own scraping tasks, see the [Actor templates](https://apify.com/templates/categories/python). If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! | ||
|
|
||
| ## Additional resources | ||
|
|
||
| - [Scrapling: Official documentation](https://scrapling.readthedocs.io/) | ||
| - [Scrapling: Fetchers](https://scrapling.readthedocs.io/en/latest/fetching/choosing/) | ||
| - [Scrapling: Parsing and selecting elements](https://scrapling.readthedocs.io/en/latest/parsing/selection/) | ||
| - [Scrapling: GitHub repository](https://github.com/D4Vinci/Scrapling) | ||
| - [Apify: Proxy management](https://docs.apify.com/platform/proxy) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,122 @@ | ||
| import asyncio | ||
| from typing import Any | ||
| from urllib.parse import urlsplit | ||
|
|
||
| from scrapling.fetchers import AsyncFetcher | ||
|
|
||
| from apify import Actor, Request | ||
| from apify.storages import RequestQueue | ||
|
|
||
|
|
||
| async def scrape_page( | ||
| url: str, | ||
| *, | ||
| proxy_url: str | None = None, | ||
| ) -> tuple[dict[str, Any], list[str]]: | ||
| """Fetch a page with Scrapling's HTTP fetcher and return data and links.""" | ||
| # `impersonate` and `stealthy_headers` make the request look like Chrome. | ||
| response = await AsyncFetcher.get( | ||
|
vdusek marked this conversation as resolved.
|
||
| url, | ||
| proxy=proxy_url, | ||
| impersonate='chrome', | ||
| stealthy_headers=True, | ||
| timeout=60, | ||
| ) | ||
|
|
||
| data = { | ||
| 'url': url, | ||
| 'title': response.css('title::text').get(), | ||
| 'h1s': response.css('h1::text').getall(), | ||
| 'h2s': response.css('h2::text').getall(), | ||
| 'h3s': response.css('h3::text').getall(), | ||
| } | ||
|
|
||
| # Keep only absolute links on the same host. | ||
| links: list[str] = [] | ||
| host = urlsplit(url).netloc | ||
| for href in response.css('a::attr(href)').getall(): | ||
| link_url = response.urljoin(href) | ||
| if not link_url.startswith(('http://', 'https://')): | ||
| continue | ||
| if urlsplit(link_url).netloc == host: | ||
| links.append(link_url) | ||
|
|
||
| return data, links | ||
|
|
||
|
|
||
| async def enqueue_links( | ||
| request_queue: RequestQueue, | ||
| links: list[str], | ||
| *, | ||
| depth: int, | ||
| max_depth: int, | ||
| ) -> None: | ||
| """Enqueue the links one level deeper, unless max_depth was reached.""" | ||
| if depth >= max_depth: | ||
| return | ||
|
|
||
| for link_url in links: | ||
| Actor.log.info(f'Enqueuing {link_url} ...') | ||
| request = Request.from_url(link_url) | ||
| request.crawl_depth = depth + 1 | ||
| await request_queue.add_request(request) | ||
|
|
||
|
|
||
| async def main() -> None: | ||
| async with Actor: | ||
| # Read the Actor input. | ||
| actor_input = await Actor.get_input() or {} | ||
| start_urls = actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}]) | ||
| max_depth = actor_input.get('maxDepth', 1) | ||
|
|
||
| if not start_urls: | ||
| Actor.log.info('No start URLs specified in Actor input, exiting...') | ||
| await Actor.exit() | ||
|
|
||
| # Set up Apify Proxy and the request queue. | ||
| proxy_configuration = await Actor.create_proxy_configuration() | ||
| request_queue = await Actor.open_request_queue() | ||
|
|
||
| # Enqueue the start URLs (crawl depth defaults to 0). | ||
| for start_url in start_urls: | ||
| url = start_url.get('url') | ||
| Actor.log.info(f'Enqueuing start URL: {url}') | ||
| await request_queue.add_request(Request.from_url(url)) | ||
|
|
||
| # Cap the crawl; raise or remove to follow more pages. | ||
| max_requests = 50 | ||
| handled_requests = 0 | ||
|
|
||
| while handled_requests < max_requests and ( | ||
| request := await request_queue.fetch_next_request() | ||
| ): | ||
| handled_requests += 1 | ||
| url = request.url | ||
| depth = request.crawl_depth | ||
| Actor.log.info(f'Scraping {url} (depth={depth}) ...') | ||
|
|
||
| try: | ||
| # Fresh proxy URL per request (None if no proxy). | ||
| proxy_url = None | ||
| if proxy_configuration: | ||
| proxy_url = await proxy_configuration.new_url() | ||
|
|
||
| data, links = await scrape_page(url, proxy_url=proxy_url) | ||
| await Actor.push_data(data) | ||
| Actor.log.info( | ||
| f'Stored data from {url} ' | ||
| f'(title={data["title"]!r}, {len(links)} links found).' | ||
| ) | ||
| await enqueue_links( | ||
| request_queue, links, depth=depth, max_depth=max_depth | ||
| ) | ||
|
|
||
| except Exception: | ||
| Actor.log.exception(f'Cannot extract data from {url}.') | ||
|
|
||
| finally: | ||
| await request_queue.mark_request_as_handled(request) | ||
|
|
||
|
|
||
| if __name__ == '__main__': | ||
| asyncio.run(main()) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| from typing import Any | ||
|
|
||
| from scrapling.fetchers import DynamicFetcher | ||
|
|
||
|
|
||
| async def scrape_page( | ||
| url: str, | ||
| *, | ||
| proxy_url: str | None = None, | ||
| ) -> tuple[dict[str, Any], list[str]]: | ||
| """Fetch a page in a real browser with Scrapling and return data and links.""" | ||
| # `network_idle` waits until the page stops making network requests. | ||
| response = await DynamicFetcher.async_fetch( | ||
|
vdusek marked this conversation as resolved.
|
||
| url, | ||
| proxy=proxy_url, | ||
| headless=True, | ||
| network_idle=True, | ||
| ) | ||
|
|
||
| data = { | ||
| 'url': url, | ||
| 'title': response.css('title::text').get(), | ||
| 'h1s': response.css('h1::text').getall(), | ||
| 'h2s': response.css('h2::text').getall(), | ||
| 'h3s': response.css('h3::text').getall(), | ||
| } | ||
|
|
||
| # Collect absolute links from the page. | ||
| links: list[str] = [] | ||
| for href in response.css('a::attr(href)').getall(): | ||
| link_url = response.urljoin(href) | ||
| if link_url.startswith(('http://', 'https://')): | ||
| links.append(link_url) | ||
|
|
||
| return data, links | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.