01 About

I extract data
from the web.

10 years building scrapers that work. From simple data collection to bypassing sophisticated anti-bot systems. I specialize in large-scale data extraction and anti-bot bypass solutions.

10 Years

50+ Clients

100% Success

Start a Project View Work

02 Services

What I Do

01

Large-Scale Scraping

Extract millions of records from complex websites. Pagination, dynamic content, varying structures—handled reliably.

02

Anti-Bot Bypass

Residential proxies, browser fingerprinting, CAPTCHA solving, session management. I get data others can't.

03

Legal & Regulatory Data

US state government sites, court records, tax codes, Cornell Law. Consistent data from inconsistent sources.

04

Data Pipelines

Automated collection, processing, storage. Daily runs, change detection, error alerting. Set it and forget it.

05

Document Processing

HTML, PDF, EPUB to structured data. Tables, forms, complex layouts parsed into clean formats.

06

API Development

Turn scraped data into RESTful APIs. Real-time or cached endpoints for your applications.

03 Stack

Technologies

Languages

Python
JavaScript
SQL

Scraping & Automation

Playwright
Selenium
SeleniumBase
nodriver
undetected-chromedriver
curl_cffi
BeautifulSoup
lxml
Requests

Databases

PostgreSQL
MySQL
MongoDB
SQLite

Infrastructure

DigitalOcean
AWS (EC2, S3)
Docker
Ubuntu/Linux
Nginx
Cron
GitHub Actions

Anti-Detection

Residential Proxies
Datacenter Proxies
Rotating User Agents
Browser Fingerprinting
Session Management
CAPTCHA Solving (2Captcha, Anti-Captcha)
Stealth Plugins
Request Throttling

Document Parsing

PyPDF2 / pdfplumber
Tabula
Camelot
Apache Tika
Tesseract OCR
ebooklib

Output Formats

JSON / JSONL
CSV / Excel
Parquet
SQL Dumps
REST APIs
GraphQL
Webhooks

04 Work

Selected Projects

E-Learning 2025

Udemy Instructor & Course Scraper

Large-scale extraction of instructor profiles and course metadata from Udemy. Bypassed bot detection using undetected-chromedriver. Multi-threaded architecture with resumable execution, proxy rotation, and automatic URL discovery via sitemaps.

Python Selenium undetected-chromedriver CSV

Problem

Client needed to extract comprehensive instructor profiles and course data from Udemy at scale. Udemy employs aggressive bot detection that blocks standard Selenium-based approaches.

Technical Challenges

Bypassed bot detection using undetected-chromedriver
Built multi-threaded architecture for parallel processing
Implemented resumable execution with URL caching
Automatic instructor URL discovery via Udemy sitemaps

Data Extracted

Instructor name, bio, photo, total students, reviews, social links, course titles, lecture counts, ratings, pricing, and content duration.

View on GitHub

People Search 2025

People Search Data Extractor

Bypassed aggressive anti-bot protection on a major people search platform. Automated Cloudflare Turnstile CAPTCHA solving via CapSolver API. Async architecture extracts contact info, addresses, work history into structured JSON.

Python nodriver CapSolver asyncio

Problem

Client needed to extract people search data from a platform with extremely aggressive anti-bot protection, including Cloudflare Turnstile CAPTCHAs that block virtually all automation attempts.

Technical Challenges

Used nodriver (successor to undetected-chromedriver) for stealth browsing
Integrated CapSolver API for automated Turnstile CAPTCHA solving
Built async architecture for efficient resource usage
Graceful error handling with no crashes on missing elements

Data Extracted

Names, ages, current/past addresses, phone numbers, emails, work history, education records, and possible business ownership.

View on GitHub

Vacation Rentals 2025

Vacation Rental Sync System

Production-grade system scraping 365-day availability and pricing from a major vacation rental platform with enterprise anti-bot (Akamai/Datadome). TLS fingerprint spoofing, 20+ proxy rotation, self-healing cookies. Flask dashboard with Guesty API integration for automated calendar sync.

Python Flask curl_cffi Guesty API

Problem

Client needed automated competitive intelligence - scraping vacation rental availability and pricing, then syncing to their Guesty property management platform. The target site uses enterprise-level Akamai/Datadome protection.

Technical Challenges

Bypassed Akamai/Datadome using TLS fingerprint spoofing via curl_cffi
Built self-healing cookie management with success/failure tracking
Implemented 20+ proxy rotation with sticky session binding
Created Flask dashboard for property management and CSV exports

Results

Fully automated pipeline running 3x daily. 365 days of pricing data per property. Zero manual intervention. Seamless Guesty calendar sync with rate limit handling.

Private Codebase

05 Contact

Let's Work Together

Have a data extraction project? Tell me about it and I'll get back to you within 24 hours.

Name

Email

Project Type

Budget Range

Project Details

Or reach out directly:

Email hello@michaldata.dev Upwork View Profile →

I extract data from the web.

What I Do

Large-Scale Scraping

Anti-Bot Bypass

Legal & Regulatory Data

Data Pipelines

Document Processing

API Development

Technologies

Languages

Scraping & Automation

Databases

Infrastructure

Anti-Detection

Document Parsing

Output Formats

Selected Projects

Udemy Instructor & Course Scraper

Problem

Technical Challenges

Data Extracted

People Search Data Extractor

Problem

Technical Challenges

Data Extracted

Vacation Rental Sync System

Problem

Technical Challenges

Results

Let's Work Together

I extract data
from the web.