web crawler - Scraping for research projects: some issues -

- April 15, 2010

i scraping websites research project, encountering issues think might useful many users. given well-defined topic (e.g. bird-watching or astro-physics), objectives are:

identify important websites spread these ideas
crawl representative sample of these websites
perform network analysis , thematic analysis of data (e.g. topic models)
publish results in academic venue, without publishing of crawled data

to achieve goal, finding following obstacles:

sampling method: obviously, it's impossible establish boundaries of sites of interest. impossible know size of dataset, how can establish representativeness of sample? can crawl 10k, 1m, or 10m pages without knowing when should stop.
detection/banning issue: crawler, based on scrapy, following robots.txt , tries not hammer servers introducing increasing delays between requests, starting 25ms. however, lot of servers still detect crawler , block it. in sense, sampling process totally biased servers cut me off.
legal issues: gray area, feel if don't publish actual pages should on safe side. precautions can take avoid upsetting people, if results of study annoy website owners?

it nice outline methodology researchers, because sure many people encounter these problems start crawling non-trivial number of pages.

thanks suggestions!

Search This Blog

HR

web crawler - Scraping for research projects: some issues -

Comments

Post a Comment

Popular posts from this blog

c++ - How to add Crypto++ library to Qt project -

jQuery Mobile app not scrolling in Firefox -

How to use vim as editor in Matlab GUI -