GitHub - D4Vinci/Scrapling: ๐Ÿ•ท๏ธ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!
Service

GitHub - D4Vinci/Scrapling: ๐Ÿ•ท๏ธ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!

D4Vinci
2026.03.01
ยทGitHubยทby ์ดํ˜ธ๋ฏผ
#AI#Automation#Framework#Python#Web Scraping

ํ•ต์‹ฌ ํฌ์ธํŠธ

  • 1Scrapling์€ ๋‹จ์ผ ์š”์ฒญ๋ถ€ํ„ฐ ๋Œ€๊ทœ๋ชจ ํฌ๋กค๋ง๊นŒ์ง€ ๋ชจ๋“  ๊ฒƒ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ์ ์‘ํ˜• ์›น ์Šคํฌ๋ž˜ํ•‘ ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค.
  • 2์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์›น์‚ฌ์ดํŠธ ๋ณ€๊ฒฝ์— ์ ์‘ํ•˜๋Š” ํŒŒ์„œ๋ฅผ ํ†ตํ•ด ์š”์†Œ ์œ„์น˜๋ฅผ ์ž๋™ ์กฐ์ •ํ•˜๊ณ , Fetcher๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Cloudflare Turnstile๊ณผ ๊ฐ™์€ anti-bot ์‹œ์Šคํ…œ์„ ์šฐํšŒํ•˜๋ฉฐ, Spider API๋กœ ๋ฉ€ํ‹ฐ์„ธ์…˜ ํฌ๋กค๋ง, ํ”„๋ก์‹œ ๋กœํ…Œ์ด์…˜, ์ผ์‹œ ์ค‘์ง€/์žฌ๊ฐœ๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
  • 3Scrapling์€ ๊ธฐ์กด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋Œ€๋น„ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ์„ ์ œ๊ณตํ•˜๋ฉฐ, CLI ๋ฐ ๋Œ€ํ™”ํ˜• ์…ธ์„ ํฌํ•จํ•œ ๊ฐœ๋ฐœ์ž ์นœํ™”์ ์ธ ๊ธฐ๋Šฅ๊ณผ AI ์—ฐ๋™(MCP server)์„ ํ†ตํ•ด ์›น ์Šคํฌ๋ž˜ํ•‘ ์ž‘์—…์„ ๊ฐ„์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค.

Scrapling์€ ๋‹จ์ผ ์š”์ฒญ๋ถ€ํ„ฐ ๋Œ€๊ทœ๋ชจ ํฌ๋กค๋ง๊นŒ์ง€ ๋ชจ๋“  ๊ฒƒ์„ ์ฒ˜๋ฆฌํ•˜๋„๋ก ์„ค๊ณ„๋œ ์ ์‘ํ˜•(adaptive) ์›น ์Šคํฌ๋ž˜ํ•‘ ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค. ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์›น์‚ฌ์ดํŠธ ๋ณ€๊ฒฝ์— ๋”ฐ๋ผ ํŒŒ์„œ(parser)๊ฐ€ ์Šค์Šค๋กœ ํ•™์Šตํ•˜์—ฌ ์š”์†Œ(element)์˜ ์œ„์น˜๋ฅผ ์ž๋™์œผ๋กœ ์žฌ์กฐ์ •ํ•˜๋ฉฐ, Cloudflare Turnstile๊ณผ ๊ฐ™์€ anti-bot ์‹œ์Šคํ…œ์„ ๊ธฐ๋ณธ์ ์œผ๋กœ ์šฐํšŒํ•  ์ˆ˜ ์žˆ๋Š” Fetcher๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ๋™์‹œ์„ฑ(concurrent) ๋ฐ ๋‹ค์ค‘ ์„ธ์…˜(multi-session) ํฌ๋กค๋ง์„ ์ง€์›ํ•˜๋Š” Spider ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์ผ์‹œ์ •์ง€/์žฌ๊ฐœ(pause/resume) ๊ธฐ๋Šฅ๊ณผ ์ž๋™ ํ”„๋ก์‹œ ๋กœํ…Œ์ด์…˜(proxy rotation) ๊ธฐ๋Šฅ์„ ๋‚ด์žฅํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ฃผ์š” ํŠน์ง• ๋ฐ ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. Spider (ํฌ๋กค๋ง ํ”„๋ ˆ์ž„์›Œํฌ)
    • Scrapy-like Spider API: start_urls, ๋น„๋™๊ธฐ parse ์ฝœ๋ฐฑ, Request/Response ๊ฐ์ฒด ๋“ฑ Scrapy์™€ ์œ ์‚ฌํ•œ API๋ฅผ ์ œ๊ณตํ•˜์—ฌ ์ŠคํŒŒ์ด๋” ์ •์˜๋ฅผ ์šฉ์ดํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
    • Concurrent Crawling: ์„ค์ • ๊ฐ€๋Šฅํ•œ ๋™์‹œ์„ฑ ์ œํ•œ, ๋„๋ฉ”์ธ๋ณ„ ์Šค๋กœํ‹€๋ง(throttling), ๋‹ค์šด๋กœ๋“œ ์ง€์—ฐ(download delays)์„ ํ†ตํ•ด ํšจ์œจ์ ์ธ ๋™์‹œ ํฌ๋กค๋ง์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
    • Multi-Session Support: HTTP ์š”์ฒญ๊ณผ ์Šคํ…”์Šค(stealthy) headless ๋ธŒ๋ผ์šฐ์ € ์š”์ฒญ์„ ๋‹จ์ผ ์ŠคํŒŒ์ด๋” ๋‚ด์—์„œ ํ†ตํ•ฉ ๊ด€๋ฆฌํ•˜๋ฉฐ, sid (session ID)๋ฅผ ํ†ตํ•ด ๋‹ค๋ฅธ ์„ธ์…˜์œผ๋กœ ์š”์ฒญ์„ ๋ผ์šฐํŒ…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • Pause & Resume: ์ฒดํฌํฌ์ธํŠธ(checkpoint) ๊ธฐ๋ฐ˜์˜ ํฌ๋กค๋ง ์ง€์†์„ฑ(persistence)์„ ์ œ๊ณตํ•˜์—ฌ, Ctrl+C๋กœ ์ข…๋ฃŒ ์‹œ ์ง„ํ–‰ ์ƒํ™ฉ์„ ์ €์žฅํ•˜๊ณ  ๋‚˜์ค‘์— ๋‹ค์‹œ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • Streaming Mode: async for item in spider.stream() ๊ตฌ๋ฌธ์„ ํ†ตํ•ด ์‹ค์‹œ๊ฐ„ ํ†ต๊ณ„์™€ ํ•จ๊ป˜ ์Šคํฌ๋ž˜ํ•‘๋œ ํ•ญ๋ชฉ์„ ์ŠคํŠธ๋ฆผ์œผ๋กœ ๋ฐ›์•„๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • Blocked Request Detection: ์ฐจ๋‹จ๋œ ์š”์ฒญ์„ ์ž๋™์œผ๋กœ ๊ฐ์ง€ํ•˜๊ณ  ์‚ฌ์šฉ์ž ์ •์˜ ๊ฐ€๋Šฅํ•œ ๋กœ์ง์œผ๋กœ ์žฌ์‹œ๋„(retry)ํ•ฉ๋‹ˆ๋‹ค.
    • Built-in Export: result.items.to_json() ๋˜๋Š” result.items.to_jsonl()๋ฅผ ํ†ตํ•ด JSON/JSONL ํ˜•์‹์œผ๋กœ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋ณด๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  1. Advanced Website Fetching (์„ธ์…˜ ์ง€์›)
    • HTTP Requests (Fetcher): ๋ธŒ๋ผ์šฐ์ €์˜ TLS fingerprint, ํ—ค๋”๋ฅผ ๋ชจ๋ฐฉํ•˜๊ณ  HTTP/3๋ฅผ ์ง€์›ํ•˜์—ฌ ๋น ๋ฅด๊ณ  ์Šคํ…”์Šคํ•œ HTTP ์š”์ฒญ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
    • Dynamic Loading (DynamicFetcher): Playwright์˜ Chromium ๋ฐ Google Chrome์„ ํ†ตํ•ด ์ „์ฒด ๋ธŒ๋ผ์šฐ์ € ์ž๋™ํ™”(full browser automation)๋ฅผ ์ œ๊ณตํ•˜์—ฌ ๋™์  ์›น์‚ฌ์ดํŠธ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.
    • Anti-bot Bypass (StealthyFetcher): ๊ณ ๊ธ‰ ์Šคํ…”์Šค ๊ธฐ๋Šฅ๊ณผ fingerprint ์Šคํ‘ธํ•‘(spoofing)์„ ํ†ตํ•ด Cloudflare Turnstile/Interstitial๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ anti-bot ์‹œ์Šคํ…œ์„ ์šฐํšŒํ•ฉ๋‹ˆ๋‹ค.
    • Session Management: FetcherSession, StealthySession, DynamicSession ํด๋ž˜์Šค๋ฅผ ํ†ตํ•ด ์ฟ ํ‚ค(cookie) ๋ฐ ์ƒํƒœ ๊ด€๋ฆฌ(state management)๋ฅผ ์œ„ํ•œ ์˜๊ตฌ์ ์ธ ์„ธ์…˜ ์ง€์›์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
    • Proxy Rotation: ProxyRotator๋ฅผ ๋‚ด์žฅํ•˜์—ฌ ์ˆœํ™˜(cyclic) ๋˜๋Š” ์‚ฌ์šฉ์ž ์ •์˜ ๋กœํ…Œ์ด์…˜ ์ „๋žต์„ ์ง€์›ํ•˜๋ฉฐ, ์š”์ฒญ๋ณ„ ํ”„๋ก์‹œ ์žฌ์ •์˜๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
    • Async Support: ๋ชจ๋“  fetcher ๋ฐ ์ „์šฉ ๋น„๋™๊ธฐ ์„ธ์…˜ ํด๋ž˜์Šค์— ๊ฑธ์ณ ์™„์ „ํ•œ ๋น„๋™๊ธฐ ์ง€์›์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  1. Adaptive Scraping & AI Integration (ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก )
    • Smart Element Tracking: ์ด ๊ธฐ๋Šฅ์€ Scrapling์˜ ํ•ต์‹ฌ์ ์ธ ์ ์‘ ๋ฉ”์ปค๋‹ˆ์ฆ˜์œผ๋กœ, ์›น์‚ฌ์ดํŠธ ๋””์ž์ธ ๋ณ€๊ฒฝ ํ›„์—๋„ "์ง€๋Šฅํ˜• ์œ ์‚ฌ์„ฑ ์•Œ๊ณ ๋ฆฌ์ฆ˜(intelligent similarity algorithms)"์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด์ „์— ์‹๋ณ„๋œ ์š”์†Œ๋ฅผ ์žฌ๋ฐฐ์น˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” CSS ์…€๋ ‰ํ„ฐ๋‚˜ XPath๊ฐ€ ์›น์‚ฌ์ดํŠธ ๊ตฌ์กฐ ๋ณ€๊ฒฝ์œผ๋กœ ์ธํ•ด ๋ฌดํšจํ™”๋  ๊ฒฝ์šฐ, ์‹œ๊ฐ์  ๋˜๋Š” ๊ตฌ์กฐ์  ํŠน์ง•์˜ ์œ ์‚ฌ์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ชฉํ‘œ ์š”์†Œ๋ฅผ ๋‹ค์‹œ ์ฐพ์•„๋‚ด๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์š”์†Œ์˜ ํ…์ŠคํŠธ ๋‚ด์šฉ, ์ฃผ๋ณ€ ์š”์†Œ์™€์˜ ๊ด€๊ณ„, ์†์„ฑ(attributes) ๋“ฑ์„ ์ข…ํ•ฉ์ ์œผ๋กœ ๋ถ„์„ํ•˜์—ฌ ๋ณ€ํ™”๋œ ํŽ˜์ด์ง€ ๋‚ด์—์„œ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๋Œ€์ƒ์„ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค.
    • Smart Flexible Selection: CSS ์…€๋ ‰ํ„ฐ, XPath ์…€๋ ‰ํ„ฐ, ํ•„ํ„ฐ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰, ํ…์ŠคํŠธ ๊ฒ€์ƒ‰, ์ •๊ทœ์‹(regex) ๊ฒ€์ƒ‰ ๋“ฑ ๋‹ค์–‘ํ•œ ์„ ํƒ ๋ฐฉ๋ฒ•์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. find_all, find_by_text์™€ ๊ฐ™์€ ๋ฉ”์„œ๋“œ๋ฅผ ์ œ๊ณตํ•˜์—ฌ ์œ ์—ฐํ•œ ์š”์†Œ ํƒ์ƒ‰์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
    • Find Similar Elements: ๋ฐœ๊ฒฌ๋œ ์š”์†Œ์™€ ์œ ์‚ฌํ•œ ๋‹ค๋ฅธ ์š”์†Œ๋ฅผ ์ž๋™์œผ๋กœ ์ฐพ์•„๋ƒ…๋‹ˆ๋‹ค. ์ด๋Š” ์›น ํŽ˜์ด์ง€์—์„œ ๋ฐ˜๋ณต๋˜๋Š” ๊ตฌ์กฐ(์˜ˆ: ์ƒํ’ˆ ๋ชฉ๋ก)๋ฅผ ์Šคํฌ๋ž˜ํ•‘ํ•  ๋•Œ ๋งค์šฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • MCP Server to be used with AI: ๋‚ด์žฅ๋œ MCP(Minimum Cost Page) ์„œ๋ฒ„๋Š” AI(Claude/Cursor ๋“ฑ)์™€ ์—ฐ๋™ํ•˜์—ฌ AI ๋ณด์กฐ ์›น ์Šคํฌ๋ž˜ํ•‘ ๋ฐ ๋ฐ์ดํ„ฐ ์ถ”์ถœ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์ด ์„œ๋ฒ„๋Š” Scrapling์„ ํ™œ์šฉํ•˜์—ฌ ํŠน์ • ์ฝ˜ํ…์ธ ๋ฅผ ์ถ”์ถœํ•œ ํ›„ AI์— ์ „๋‹ฌํ•จ์œผ๋กœ์จ, AI์˜ ํ† ํฐ ์‚ฌ์šฉ๋Ÿ‰์„ ์ตœ์†Œํ™”ํ•˜๊ณ  ๋น„์šฉ์„ ์ ˆ๊ฐํ•˜๋ฉฐ ์ž‘์—… ์†๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. ์ด๋Š” ์Šคํฌ๋ž˜ํ•‘๋œ ๋ฐ์ดํ„ฐ๋ฅผ AI ๋ชจ๋ธ์— ์ตœ์ ํ™”๋œ ํ˜•ํƒœ๋กœ ์ „์ฒ˜๋ฆฌํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.
  1. High-Performance & Battle-tested Architecture
    • ์ตœ์ ํ™”๋œ ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์ž๋ž‘ํ•˜๋ฉฐ, ๋‹ค๋ฅธ Python ์Šคํฌ๋ž˜ํ•‘ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ณด๋‹ค ๋น ๋ฅธ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, JSON ์ง๋ ฌํ™”(serialization)๋Š” ํ‘œ์ค€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ณด๋‹ค 10๋ฐฐ ๋น ๋ฆ…๋‹ˆ๋‹ค. 92%์˜ ํ…Œ์ŠคํŠธ ์ปค๋ฒ„๋ฆฌ์ง€์™€ ์™„๋ฒฝํ•œ ํƒ€์ž… ํžŒํŠธ(type hints) ์ปค๋ฒ„๋ฆฌ์ง€๋ฅผ ํ†ตํ•ด ๋†’์€ ์•ˆ์ •์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.
  1. Developer/Web Scraper Friendly Experience
    • ๋Œ€ํ™”ํ˜•(interactive) ์›น ์Šคํฌ๋ž˜ํ•‘ ์…ธ, curl ์š”์ฒญ์„ Scrapling ์š”์ฒญ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ธฐ๋Šฅ, extract ๋ช…๋ น์–ด๋ฅผ ํ†ตํ•œ ์ฝ”๋“œ ์—†๋Š” ์Šคํฌ๋ž˜ํ•‘ ๋“ฑ ๋‹ค์–‘ํ•œ ๊ฐœ๋ฐœ์ž ํŽธ์˜ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. parent, sibling, child์™€ ๊ฐ™์€ DOM ํŠธ๋ž˜๋ฒ„์Šค(traverse) ๋ฉ”์„œ๋“œ์™€ ํ–ฅ์ƒ๋œ ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ ๊ธฐ๋Šฅ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, Scrapy/BeautifulSoup์™€ ์œ ์‚ฌํ•œ ์นœ์ˆ™ํ•œ API๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์„ค์น˜ ๋ฐ ๋ฐฐํฌ๋Š” pip install scrapling ๋ช…๋ น์–ด๋กœ ๊ฐ€๋Šฅํ•˜๋ฉฐ, fetcher, AI, ์…ธ ๊ธฐ๋Šฅ ๋“ฑ์„ ์œ„ํ•œ ์ถ”๊ฐ€ ์˜์กด์„ฑ(dependencies)์€ scrapling[fetchers], scrapling[ai], scrapling[shell], scrapling[all]๊ณผ ๊ฐ™์ด ์„ค์น˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Docker ์ด๋ฏธ์ง€๋„ ์ œ๊ณตํ•˜์—ฌ ์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ํ™˜๊ฒฝ์„ ๊ตฌ์ถ•ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Scrapling์€ ๊ต์œก ๋ฐ ์—ฐ๊ตฌ ๋ชฉ์ ์œผ๋กœ ์ œ๊ณต๋˜๋ฉฐ, ๋ฐ์ดํ„ฐ ์Šคํฌ๋ž˜ํ•‘ ๋ฐ ๊ฐœ์ธ ์ •๋ณด ๋ณดํ˜ธ๋ฒ•์„ ์ค€์ˆ˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.