Headless Playwright automation with pagination, selector fallbacks, deduplication, and canary detection. Point it at a page, get structured data back.
Most scrapers treat "page loaded but nothing matched" as success. They return an empty array and move on. Your pipeline keeps running on stale data for days before someone checks.
If a page loads but yields zero product cards, the scraper throws a CanaryError instead of returning empty results. You find out immediately when a site change breaks extraction, not when a customer complains.
Tries data-testid attributes first, falls back to semantic HTML selectors automatically. Your scrape survives markup changes without code edits.
Follows next-page links up to a configurable cap. Rate-limits between requests (default 300ms). No manual page loop needed.
Tracks detail URLs across all pages. Same product on page 2 and page 5? You get it once. Duplicates never reach your dataset.
Prices come back as numbers, not strings with currency symbols. Ratings are integers. Stock status is normalized. Clean data from the start.
Each page navigation gets one automatic retry on failure. Configurable timeouts. Transient network issues don't kill the run.
npm install playwright is the only dependency. The scraper uses Chromium headless under the hood.
Call the scrape function with a target URL and optional config (max pages, rate limit, timeout). Works with live sites and file:// URLs for local testing.
Structured records with typed fields. Prices are numeric, ratings are integers, URLs are absolute. Pipe it straight into your database or API.
One-time purchase. Full source code.
Send us the target URL and the fields you need. We'll tell you what's feasible and how long it takes.
Get in Touch