How can I tell when my dataset in R is going to be too large?
I am going to be undertaking some logfile analyses in R (unless I can't do it in R), and I understand that my data needs to fit in RAM (unless I use some kind of fix like an interface to a keyval store, maybe?). So I am wondering how to tell ahead of time how much room my data is going to take up in RAM, and whether I will have enough. I know how much RAM I have (not a huge amount - 3GB under XP), and I know how many rows and cols my logfile will end up as and what data types the col entries ought to be (which presumably I need to check as it reads).
How do I put this together into a go/nogo decision for undertaking the analysis in R? (Presumably R needs to be able to have some RAM to do operations, as well as holding the data!) My immediate required output is a bunch of simple summary stats, frequencies, contingencies, etc, and so I could probably write some kind of parser/tabulator that will give me the output I need short term, but I also want to play around with lots of different approaches to this data as a next step, so am looking at feasibility of using R.
I have seen lots of useful advice about large datasets in R here, which I have read and will reread, but for now I would like to understand better how to figure out whether I should (a) go there at all, (b) go there but expect to have to do some extra stuff to make it manageable, or (c) run away before it's too late and do something in some other language/environment (suggestions welcome...!). thanks!
R is well suited for big datasets, either using out-of-the-box solutions like bigmemory or the ff package (especially read.csv.ffdf) or processing your stuff in chunks using your own scripts. In almost all cases a little programming makes processing large datasets (>> memory, say 100 Gb) very possible. Doing this kind of programming yourself takes some time to learn (I don't know your level), but makes you really flexible. If this is your cup of tea, or if you need to run depends on the time you want to invest in learning these skills. But once you have them, they will make your life as a data analyst much easier.
In regard to analyzing logfiles, I know that stats pages generated from Call of Duty 4 (computer multiplayer game) work by parsing the log file iteratively into a database, and then retrieving the statsistics per user from the database. See here for an example of the interface. The iterative (in chunks) approach means that logfile size is (almost) unlimited. However, getting good performance is not trivial.
A lot of the stuff you can do in R, you can do in Python or Matlab, even C++ or Fortran. But only if that tool has out-of-the-box support for what you want, I could see a distinct advantage of that tool over R. For processing large data see the HPC Task view. See also an earlier answer of min for reading a very large text file in chunks. Other related links that might be interesting for you:
- Quickly reading very large tables as dataframes in R
- https://stackoverflow.com/questions/1257021/suitable-functional-language-for-scientific-statistical-computing (discussion includes that to use for large data processing).
- Trimming a huge (3.5 GB) csv file to read into R
- A blog post of mine showing how to estimate the RAM usage of a dataset. Note that this assumes that the data will be stored in a matrix or array, and is just one datatype.
- Log file processing with R
In regard to choosing R or some other tool, I'd say if it's good enough for Google it is good enough for me ;).