How to design a web crawler in Java?
I'm working on a project which needs to design a web crawler in Java which can take a user query about a particular news subject and then visit different news websites and then extract news content from those pages and store it in some files/databases. I need this to make a summary of overall stored content. I'm new to this field and so expect some help from people who have any experience how to do it.
Right now I have the code to extract news content from a single page which takes the page manually, but I have no idea how to integrate it in a web crawler to extract content from different pages.
Can anyone give some good links to tutorials or implementations in Java which I can use or modify according to my needs?
Document doc = Jsoup.connect("http://en.wikipedia.org/").get(); Elements newsHeadlines = doc.select("#mp-itn b a");