python - Indexing steps in a web crawler -
i writing web crawler (focused web crawler) where:
input : seedsurl
output: bigger seedsurl
def crawl(seedurl, pageslimit): crawling code ... return list of urls crawled
now need index , store data facilitate fast , accurate information retrieval(search engine).
- my crawler returns list of urls, how can pass them indexing phase? should download content of each page in text file?
- are there tools or library indexing step? or has done manually?
you should use scrapy job of web crawling. i'm going give example of how can used , how web index should be. other question, go check site out!
using xpath expression provided scrapy, can extract resources want including whole file.
for example: <h1>darwin - evolution of exhibition</h1>
the xpath expression: //h1/text()
why this? h1 tag, can make key in dictionary. having dictionary, can access files easier. so:
web_index = { 'darwin': 'example.html', 'evolution': 'example.html' }
it's best web index in dictionary key-value pair can 'search' from, not in list rely on index.
Comments
Post a Comment