Nutch - how to crawl by small patches?
I can't get Nutch to crawl for me by small patches. I start it by bin/nutch crawl command with parameters -depth 7 and -topN 10000. And it never ends. Ends only when my HDD is empty. What I need to do:
- Start to crawl my seeds with possibility to go further on outlinks.
- Crawl 20000 pages, then index them.
- Crawl another 20000 pages, index them and merge with first index.
- Loop step 3 n times.
Tried also with scripts found in wiki, but all scripts I found don't go further. If I run them again, they do everything from beginning. And in the end of script I have the same index I had, when started to crawl. But, I need to continue my crawl.
You have to understand the Nutch generate/fetch/update cycles.
The generate step of the cycle will take urls (you can set a max number with the topN parameter) from the crawl db and generate a new segment. Initially, the crawl db will only contain the seed urls.
The fetch step does the actual crawling. The actual content of the pages are stored in the segment.
Finally, the update step updates the crawl db with the results from the fetch (add new urls, set the last fetch time for an url, set the http status code of the fetch for an url, etc).
The crawl tool will run this cycle n times, configurable with the depth parameter.
After all cycles are complete, the crawl tool will delete all indexes in the folder from which it is launch and create a new one from all the segments and the crawl db.
So in order to do what you are asking, you should probably not use the crawl tool but instead call the individual Nutch commands, which is what the crawl tool is doing behind the scene. With that, you will be able to control how many times you crawl and also make sure that the indexes are always merge and not delete at each iteration.
I suggest you start with the script define here and change it to your needs.