How to Store Entire WebPages for Later Parsing?
I've been doing a lot of parsing of webpages lately and my process usually looks something like this:
- Obtain list of links to Parse
- Import list into database
- Download Entire Webpage for each link and store into mysql
- Add Index for each scraping session
- Scrape relevant sections (content, metas, whatever)
- Steps 4,5 -- Rinse/Repeat -- as it is common to want to scrape diff. content from the same page later on or modify your xpath or scrub said content or whatever.
- Export Scraping Database to Real Database and Remove Webpage column and scraping indexes
Now, the easiest answer is of course do the scraping at the same time that you are downloading the webpage but I don't think this lends itself to modular design very well as I'd like to be able to grow this process a bit more.
Let me give you some examples of the problems I keep running into: For 50k pages (rows) I have around a 6gig database. Remember, we are storing the ENTIRE webpage into one column and extracting relevant data from it and storing that into a different column.
Throwing an index on the table can take 7-10 minutes on a quad core with 6 gig of ram. God forbid you screw up on something and watch mysqld jump to 70% cpu and ALL of your ram. This is wnhy I have step 4 -- every operation I do I'll throw an index on the column before I do it -- so if I want to grab metas I'd throw an index on say the title column and then update each row where title is not null.
I should state that I do NOT do all rows in one go -- that tends to really screw me over bad -- as it should -- you are loading 6gig into memory. ;)
What I suppose the solution to this problem is -- grab a total count and then iterate through an offset of a 100 or so at a time.
Still -- I think there are some storage problems here as well. Should I be storing the original webpages on the filesystem instead? I have thought about storing the pages into a document oriented database such as mongo or couch.
EDIT Just to be clear here -- any solution presented should take into account the fact that 50k pages is just ONE BATCH by ONE USER. I'm not trying to have multiple users quite yet but I would like the ability to store more than a couple of batches at a time.
Why don't you add the index to the table BEFORE inserting your data? This way the index is built as the rows are added to the table.
If you have more hardware to throw at the problem, you can start distributing your database over multiple servers using via sharding.
You could use an existing web crawler such as wget or one of the many others. This can download the files to the hard disk and then you can parse the files afterwards and store information about them in the database.
Thanks for helping me think this out everyone!
I'm going to try a hybrid approach here:
1) Pull down pages to a tree structure on the filesystem.
2) Put content into a generic content table that does not contain any full webpage (this means that our average 63k column is now maybe a 1/10th of a k.
1) My tree structure for housing the webpages will look like this:
-- usr_id1k | |-- user1 | | |-- job1 | | | |-- pg_id1k | | | | |-- p1 | | | | |-- p2 | | | | `-- p3 | | | |-- pg_id2k | | | `-- pg_id3k | | |-- job2 | | `-- job3 | |-- user2 | `-- user3 |-- usr_id2k `-- usr_id3k
2) Instead of creating a table for each 'job' and then exporting it we'll have a couple different tables -- the primary one being a 'content' table.
content_type, Integer # fkey to content_types table user_id, Integer # fkey to users table content, Text # actual content, no full webpages
.... other stuff like created_at, updated_at, perms, etc...