How to design a web crawler in Java?

I'm working on a project which needs to design a web crawler in Java which can take a user query about a particular news subject and then visit different news websites and then extract news content from those pages and store it in some files/databases. I need this to make a summary of overall stored content. I'm new to this field and so expect some help from people who have any experience how to do it.

Right now I have the code to extract news content from a single page which takes the page manually, but I have no idea how to integrate it in a web crawler to extract content from different pages.

Can anyone give some good links to tutorials or implementations in Java which I can use or modify according to my needs?

Answers


http://jsoup.org/

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

Need Your Help

Check if string in strings

python string list comparison

I have a huge list containing many strings like:

Getting text between anchor tags

asp.net vb.net

I have a label that I put a hyperlink into with the function: