Quickest/easiest way to parse HTML of a website?

I need to parse the contents of this website and store it in a MySQL database. I'm making a competitor site to that one as the creator never completely finished his, but he has newer game data than I do and won't release it, so I need to collect it manually. Here is an example of the specific type of page I need to parse.

I've done HTML parsing before with PHP and regex, but it was painfully tedious and I would much rather not go through the hassle of that again. I've been procrastinating on finishing my database for months because of this issue. Is there a faster and/or easier way of going about this? Most C-style languages are fine for me (C, C++, Perl, PHP, Python, etc., are all fine, but not C#, Java, or Objective-C).

P.S.: I don't care how dirty the script/program turns out or anything like that, so long as it gets the job done.

Answers


Any of the languages you mentioned can do that, as long as you use the correct third-party libraries to help you.

You'll need something that crawls the site. Actually, this could be a completely different program that just downloads the .html files to your computer, on which you'd then let the parser run. Such robots exist, consider wget or curl -- they both have spider options.

You'll need a parser for the site. Don't use regexp to parse HTML, use an HTML or XML parser (like Perl's HTML::Parser). Then you'll have to convert the resulting datastructure to usable data (for example, first table>tr>td is monster name, second td is race, etc.

Finally, you'll need to store those into your database in a way you can recuperate them later to serve for your site.

Actually, writing the code won't be the hardest thing, but the mapping on "which item on the page means what and should be stored where and how" will be.


I did that few months ago, and after some investigation I decided to go with LXML python library. See parsing tutorial here. And yes, it's not only for xml parsing it does HTML as well.

I like it, because it's powerful, easy to use.


You can use php with simpleHtmlDom to parse html, and simpleHtmlDom is very easy..

http://simplehtmldom.sourceforge.net/manual.htm


I used http://htmlagilitypack.codeplex.com/ and http://code.google.com/p/fizzler/ to parse HTML and grab necessary information. It works very well.


just use embed string mysql functions, no need to write code to run on your computer, make your mysql server do all the work.

SUBSTRING(page, INSTR(page, '<title>')+7,(INSTR( page, '</title>'))-(INSTR( page, '<title>')+7) )

examples

UPDATE url2 SET title = SUBSTRING(page, INSTR(page, '')+7,(INSTR( page, ''))-(INSTR( page, '')+7) )

or test by

SELECT SUBSTRING(page, INSTR(page, '')+7,(INSTR( page, ''))-(INSTR( page, '')+7) ) ,page FROM url2 WHERE url = 'http://en.wikipedia.org/wiki/File:Nag_Nathaiya_festival_in_Varanasi.jpg';


Need Your Help

Counting numbers and letters

java

I have to write a program that has to read a set of 13 cards in a String with inputs like C3567JD798S4H687 (C standing for Clubs, D for Diamonds, S for Spades, H for Hearts) and the output needs to

npm: how to run test & lint on each change?

node.js npm nodemon

I am using a bare npm ( no grunt/gulp) approach to develop my new MEAN project.