python - Indexing steps in a web crawler -


i writing web crawler (focused web crawler) where:
input : seedsurl
output: bigger seedsurl

  def crawl(seedurl, pageslimit):       crawling code ...        return list of urls crawled  

now need index , store data facilitate fast , accurate information retrieval(search engine).

  1. my crawler returns list of urls, how can pass them indexing phase? should download content of each page in text file?
  2. are there tools or library indexing step? or has done manually?

you should use scrapy job of web crawling. i'm going give example of how can used , how web index should be. other question, go check site out!

using xpath expression provided scrapy, can extract resources want including whole file.

for example: <h1>darwin - evolution of exhibition</h1>

the xpath expression: //h1/text()

why this? h1 tag, can make key in dictionary. having dictionary, can access files easier. so:

web_index = {     'darwin': 'example.html',     'evolution': 'example.html' } 

it's best web index in dictionary key-value pair can 'search' from, not in list rely on index.


Comments

Popular posts from this blog

c++ - How to add Crypto++ library to Qt project -

jQuery Mobile app not scrolling in Firefox -

how to receive file in java(servlet/jsp) -