Quickest/easiest way to parse HTML of a website?
I need to parse the contents of this website and store it in a MySQL database. I'm making a competitor site to that one as the creator never completely finished his, but he has newer game data than I do and won't release it, so I need to collect it manually. Here is an example of the specific type of page I need to parse.
I've done HTML parsing before with PHP and regex, but it was painfully tedious and I would much rather not go through the hassle of that again. I've been procrastinating on finishing my database for months because of this issue. Is there a faster and/or easier way of going about this? Most C-style languages are fine for me (C, C++, Perl, PHP, Python, etc., are all fine, but not C#, Java, or Objective-C).
P.S.: I don't care how dirty the script/program turns out or anything like that, so long as it gets the job done.
Any of the languages you mentioned can do that, as long as you use the correct third-party libraries to help you.
You'll need something that crawls the site. Actually, this could be a completely different program that just downloads the .html files to your computer, on which you'd then let the parser run. Such robots exist, consider wget or curl -- they both have spider options.
You'll need a parser for the site. Don't use regexp to parse HTML, use an HTML or XML parser (like Perl's HTML::Parser). Then you'll have to convert the resulting datastructure to usable data (for example, first table>tr>td is monster name, second td is race, etc.
Finally, you'll need to store those into your database in a way you can recuperate them later to serve for your site.
Actually, writing the code won't be the hardest thing, but the mapping on "which item on the page means what and should be stored where and how" will be.
I did that few months ago, and after some investigation I decided to go with LXML python library. See parsing tutorial here. And yes, it's not only for xml parsing it does HTML as well.
I like it, because it's powerful, easy to use.
You can use php with simpleHtmlDom to parse html, and simpleHtmlDom is very easy..
just use embed string mysql functions, no need to write code to run on your computer, make your mysql server do all the work.
SUBSTRING(page, INSTR(page, '<title>')+7,(INSTR( page, '</title>'))-(INSTR( page, '<title>')+7) )
UPDATE url2 SET title = SUBSTRING(page, INSTR(page, '')+7,(INSTR( page, ''))-(INSTR( page, '')+7) )
or test by
SELECT SUBSTRING(page, INSTR(page, '')+7,(INSTR( page, ''))-(INSTR( page, '')+7) ) ,page FROM url2 WHERE url = 'http://en.wikipedia.org/wiki/File:Nag_Nathaiya_festival_in_Varanasi.jpg';