how to store data crawled from website

I want to crawl a website and store the content on my computer for later analysis. However my OS file system has a limit on the number of sub directories, meaning storing the original folder structure is not going to work.


Map the URL to some filename so can store flatly? Or just shove it in a database like sqlite to avoid file system limitations?


It all depends on the effective amount of text and/or web pages you intent to crawl. A generic solution is probably to

  • use an RDBMS (SQL server of sorts) to store the meta-data associated with the pages. Such info would be stored in a simple table (maybe with a very few support/related tables) containing fields such as Url, FileName (where you'll be saving it), Offset in File where stored (the idea is to keep several pages in the same file) date of crawl, size, and a few other fields.
  • use a flat file storage for the text proper. The file name and path matters little (i.e. the path may be shallow and the name cryptic/automatically generated). This name / path is stored in the meta-data. Several crawled pages are stored in the same flat file, to optimize the overhead in the OS to manage too many files. The text itself may be compressed (ZIP etc.) on a per-page basis (there's little extra compression gain to be had by compressing bigger chunks.), allowing a per-file handling (no need to decompress all the text before it!). The decision to use compression depends on various factors; the compression/decompression overhead is typically relatively minimal, CPU-wise, and offers a nice saving on HD Space and generally disk I/O performance.

The advantage of this approach is that the DBMS remains small, but is available for SQL-driven queries (of an ad-hoc or programmed nature) to search on various criteria. There is typically little gain (and a lot of headache) associated with storing many/big files within the SQL server itself. Furthermore as each page gets processed / analyzed, additional meta-data (such as say title, language, most repeated 5 words, whatever) can be added to the database.

Having it in a database will help search through the content and page matadata. You can also try in-memory databases or "memcached" like storage to speed in up.

Depending on the processing power of the PC which will do the data mining, you could add the scraped data to a compressible archive like a 7zip, zip, or tarball. You'll be able to keep the directory structure intact and may end up saving a great deal of disk space - if that happens to be a concern.

On the other hand, a RDBMS like SqLite will balloon out really fast but wont mind ridiculously long directory hierarchies.

Need Your Help

can't see layout outline in eclipse

android eclipse adt

I open the layout file with Android Common XML Editor, in the graphic mode, it works fine,

Textbox.TextChanged triggering when page is loaded. How do I prevent it?

wpf data-binding textbox textchanged

I am having trouble with textbox.textchanged event. my textboxes are data-bound and when the page is loaded, textchanged event triggers. How do I prevent it from happening and only trigger when the...