Nutch - how to crawl by small patches?

I can't get Nutch to crawl for me by small patches. I start it by bin/nutch crawl command with parameters -depth 7 and -topN 10000. And it never ends. Ends only when my HDD is empty. What I need to do:

  1. Start to crawl my seeds with possibility to go further on outlinks.
  2. Crawl 20000 pages, then index them.
  3. Crawl another 20000 pages, index them and merge with first index.
  4. Loop step 3 n times.

Tried also with scripts found in wiki, but all scripts I found don't go further. If I run them again, they do everything from beginning. And in the end of script I have the same index I had, when started to crawl. But, I need to continue my crawl.

Answers


You have to understand the Nutch generate/fetch/update cycles.

The generate step of the cycle will take urls (you can set a max number with the topN parameter) from the crawl db and generate a new segment. Initially, the crawl db will only contain the seed urls.

The fetch step does the actual crawling. The actual content of the pages are stored in the segment.

Finally, the update step updates the crawl db with the results from the fetch (add new urls, set the last fetch time for an url, set the http status code of the fetch for an url, etc).

The crawl tool will run this cycle n times, configurable with the depth parameter.

After all cycles are complete, the crawl tool will delete all indexes in the folder from which it is launch and create a new one from all the segments and the crawl db.

So in order to do what you are asking, you should probably not use the crawl tool but instead call the individual Nutch commands, which is what the crawl tool is doing behind the scene. With that, you will be able to control how many times you crawl and also make sure that the indexes are always merge and not delete at each iteration.

I suggest you start with the script define here and change it to your needs.


Need Your Help

Using itms-services links in an app on the App Store

ios iphone adhoc ad-hoc-distribution

I've been working on a simplified, web-based ad-hoc build management service for a while now, and I'm going over the possibilities to my make clients life easier.

Run-time error of Spark code in Intellij

intellij-idea apache-spark runtime-error

Running spark code in IDEA Intellij is painful as a new Spark/Intellij user.