I extract data
from the web.

10 years building scrapers that work. From simple data collection to bypassing sophisticated anti-bot systems. I specialize in large-scale data extraction and anti-bot bypass solutions.

10 Years
50+ Clients

What I Do

01

Large-Scale Scraping

Extract millions of records from complex websites. Pagination, dynamic content, varying structures—handled reliably.

02

Anti-Bot Bypass

Residential proxies, browser fingerprinting, CAPTCHA solving, session management. I get data others can't.

03

Legal & Regulatory Data

US state government sites, court records, tax codes, Cornell Law. Consistent data from inconsistent sources.

04

Data Pipelines

Automated collection, processing, storage. Daily runs, change detection, error alerting. Set it and forget it.

05

Document Processing

HTML, PDF, EPUB to structured data. Tables, forms, complex layouts parsed into clean formats.

06

API Development

Turn scraped data into RESTful APIs. Real-time or cached endpoints for your applications.

Technologies

Languages

  • Python
  • JavaScript
  • SQL

Scraping & Automation

  • Playwright
  • Selenium
  • SeleniumBase
  • nodriver
  • undetected-chromedriver
  • curl_cffi
  • BeautifulSoup
  • lxml
  • Requests

Databases

  • PostgreSQL
  • MySQL
  • MongoDB
  • SQLite

Infrastructure

  • DigitalOcean
  • AWS (EC2, S3)
  • Docker
  • Ubuntu/Linux
  • Nginx
  • Cron
  • GitHub Actions

Anti-Detection

  • Residential Proxies
  • Datacenter Proxies
  • Rotating User Agents
  • Browser Fingerprinting
  • Session Management
  • CAPTCHA Solving (2Captcha, Anti-Captcha)
  • Stealth Plugins
  • Request Throttling

Document Parsing

  • PyPDF2 / pdfplumber
  • Tabula
  • Camelot
  • Apache Tika
  • Tesseract OCR
  • ebooklib

Output Formats

  • JSON / JSONL
  • CSV / Excel
  • Parquet
  • SQL Dumps
  • REST APIs
  • GraphQL
  • Webhooks

Selected Projects

E-Learning 2025

Udemy Instructor & Course Scraper

Large-scale extraction of instructor profiles and course metadata from Udemy. Bypassed bot detection using undetected-chromedriver. Multi-threaded architecture with resumable execution, proxy rotation, and automatic URL discovery via sitemaps.

Python Selenium undetected-chromedriver CSV

Problem

Client needed to extract comprehensive instructor profiles and course data from Udemy at scale. Udemy employs aggressive bot detection that blocks standard Selenium-based approaches.

Technical Challenges

  • Bypassed bot detection using undetected-chromedriver
  • Built multi-threaded architecture for parallel processing
  • Implemented resumable execution with URL caching
  • Automatic instructor URL discovery via Udemy sitemaps

Data Extracted

Instructor name, bio, photo, total students, reviews, social links, course titles, lecture counts, ratings, pricing, and content duration.

View on GitHub
People Search 2025

People Search Data Extractor

Bypassed aggressive anti-bot protection on a major people search platform. Automated Cloudflare Turnstile CAPTCHA solving via CapSolver API. Async architecture extracts contact info, addresses, work history into structured JSON.

Python nodriver CapSolver asyncio

Problem

Client needed to extract people search data from a platform with extremely aggressive anti-bot protection, including Cloudflare Turnstile CAPTCHAs that block virtually all automation attempts.

Technical Challenges

  • Used nodriver (successor to undetected-chromedriver) for stealth browsing
  • Integrated CapSolver API for automated Turnstile CAPTCHA solving
  • Built async architecture for efficient resource usage
  • Graceful error handling with no crashes on missing elements

Data Extracted

Names, ages, current/past addresses, phone numbers, emails, work history, education records, and possible business ownership.

View on GitHub
Vacation Rentals 2025

Vacation Rental Sync System

Production-grade system scraping 365-day availability and pricing from a major vacation rental platform with enterprise anti-bot (Akamai/Datadome). TLS fingerprint spoofing, 20+ proxy rotation, self-healing cookies. Flask dashboard with Guesty API integration for automated calendar sync.

Python Flask curl_cffi Guesty API

Problem

Client needed automated competitive intelligence - scraping vacation rental availability and pricing, then syncing to their Guesty property management platform. The target site uses enterprise-level Akamai/Datadome protection.

Technical Challenges

  • Bypassed Akamai/Datadome using TLS fingerprint spoofing via curl_cffi
  • Built self-healing cookie management with success/failure tracking
  • Implemented 20+ proxy rotation with sticky session binding
  • Created Flask dashboard for property management and CSV exports

Results

Fully automated pipeline running 3x daily. 365 days of pricing data per property. Zero manual intervention. Seamless Guesty calendar sync with rate limit handling.

Private Codebase

Client Reviews

WordPress Product Feeds
★★★★★ 5.0

"Michal is easy to work with and delivers perfect code. Pleasant communication, too. No complaints. Thank you Michal!"

Nov 2022 - Sep 2025
Reddit Scraper
★★★★★ 5.0

"Michal worked very diligently and was able to complete the task precisely as asked. He even had the finished products submitted a day before the deadline. I would highly recommend as his knowledge with machine learning is incomparable. I look forward to working with him again!"

Apr 2021 - May 2021
Social Media & News Scraper
★★★★★ 5.0

"Had a great experience. Timely and cost-effective delivery of a web-scraper. Kept it running without glitches for the time-period we needed. Handled feedback and changes. Would definitely work with Michael again."

Sep 2020 - Dec 2020
View All Reviews on Upwork

Frequently Asked Questions

How much does a web scraping project cost?

It depends on complexity, scale, and anti-bot protection. Here are real project examples:

  • Single-site product scraper (e-commerce, 10k products): $400-800
  • Social media / news scraper (multiple sources): $800-1,500
  • Anti-bot bypass project (Cloudflare/Akamai protected): $1,500-4,000
  • Multi-state government data (50 states, varying structures): $3,000-6,000
  • Ongoing data pipeline (daily runs, monitoring, maintenance): $500-1,500/month

Most projects fall in the $1,000-2,000 range. Complex anti-bot or enterprise-scale projects can exceed $10,000.

Can you scrape [specific website]?

Probably. I specialize in sites that block standard scraping tools. This includes sites protected by Cloudflare, Akamai, Datadome, PerimeterX, and custom anti-bot systems. If you've tried other scrapers and they got blocked, that's exactly the kind of project I take on.

Send me the URL and I'll give you an honest assessment of feasibility and approach.

Is web scraping legal?

Generally yes, for publicly available data. The 2022 hiQ vs. LinkedIn ruling established that scraping public data does not violate the Computer Fraud and Abuse Act. However, I don't scrape data behind logins without authorization, personal data for prohibited purposes, or content protected by clear legal restrictions.

I'm happy to discuss the specifics of your project to ensure we're on solid legal ground.

How long does a project take?

Simple scrapers: 2-5 days. Complex projects with anti-bot bypass: 1-3 weeks. Multi-state or large-scale systems: 3-6 weeks. I'll give you a realistic timeline after understanding your requirements. I don't pad estimates, but I also don't promise miracles.

What if the website changes and the scraper breaks?

Websites change. It happens. For one-time projects, I include a 30-day bug fix period after delivery. For ongoing data needs, I offer maintenance retainers that include monitoring, updates when sites change, and priority support. Most clients with critical data pipelines opt for this.

What data formats do you deliver?

Whatever you need: JSON, CSV, Excel, direct database insertion (PostgreSQL, MySQL, MongoDB), API endpoints, or webhook delivery. I'll match your existing systems. If you're not sure what format works best, I'll recommend based on your use case.

Let's Work Together

Have a data extraction project? Tell me about it and I'll get back to you within 24 hours.