Pandas equivalent of Python's readlines function

With python's readlines() function I can retrieve a list of each line in a file:

with open('dat.csv', 'r') as dat:
    lines = dat.readlines()

I am working on a problem involving a very large file and this method is producing a memory error. Is there a pandas equivalent to Python's readlines() function? The pd.read_csv() option chunksize seems to append numbers to my lines, which is far from ideal.

Minimal example:

In [1]: lines = []

In [2]: for df in pd.read_csv('s.csv', chunksize = 100):
   ...:     lines.append(df)
In [3]: lines
Out[3]: 
[   hello here is a line
 0  here is another line
 1  here is my last line]

In [4]: with open('s.csv', 'r') as dat:
   ...:     lines = dat.readlines()
   ...:     

In [5]: lines
Out[5]: ['hello here is a line\n', 'here is another line\n', 'here is my last line\n']

In [6]: cat s.csv
hello here is a line
here is another line
here is my last line

Answers


You should try to use the chunksize option of pd.read_csv(), as mentioned in some of the comments.

This will force pd.read_csv() to read in a defined amount of lines at a time, instead of trying to read the entire file in one go. It would look like this:

>> df = pd.read_csv(filepath, chunksize=1, header=None, encoding='utf-8')

In the above example the file will be read line by line.

Now, in fact, according to the documentation of pandas.read_csv, it is not a pandas.DataFrame object that is being returned here, but a TextFileReader object instead.

  • chunksize : int, default None

Return TextFileReader object for iteration. See IO Tools docs for more information on iterator and chunksize.

Therefore, in order to complete the exercise, you would need to put this in a loop like this:

In [385]: cat data_sample.tsv
This is a new line
This is another line of text
And this is the last line of text in this file

In [386]: lines = []

In [387]: for line in pd.read_csv('./data_sample.tsv', encoding='utf-8', header=None, chunksize=1):
    lines.append(line.iloc[0,0])
   .....:     

In [388]: print(lines)
['This is a new line', 'This is another line of text', 'And this is the last line of text in this file']

I hope this helps!


Need Your Help

Efficient development workflow with Nuget?

.net nuget

Nuget is a great tool but it seems to complicate the common process of iteratively modifying libraries and the hosting application at the same time.

Manipulating button via DOM, JavaScript

javascript dom

I'm trying to find the index of a clicked button, so that I can manipulate other elements with the same index, but I don't know how!