scrapy - Scraping large amount of heterogenous data into structured datasets -
i have been evaluating science of web scraping. framework use python/scrapy. sure there may many more. question more around basics. suppose have scrape news content. so, crawl page , write selectors extract content, images, author, published date, sub description, comments etc. writing code no big deal.
the question how can optimize scalable large number of data sources. instance, there may thousands of news sites, each own html/page structure, inevitably need write scraping logic each 1 of them. although possible, require big team of resources working large duration of time create , update these crawlers/scrapers.
is there easy way this? can somehow ease process of creating different scraper each , every data source (website)?
how sites recordedfuture it? have big team working round clock claim extract data 250000+ distinct sources?
looking forward enlightening answers.
thanks abhi
Comments
Post a Comment