pandas distinction between str and object types

Numpy seems to make a distinction between str and object types. For instance I can do ::

>>> import pandas as pd
>>> import numpy as np
>>> np.dtype(str)
dtype('S')
>>> np.dtype(object)
dtype('O')

Where dtype('S') and dtype('O') corresponds to str and object respectively.

However pandas seem to lack that distinction and coerce str to object. ::

>>> df = pd.DataFrame({'a': np.arange(5)})
>>> df.a.dtype
dtype('int64')
>>> df.a.astype(str).dtype
dtype('O')
>>> df.a.astype(object).dtype
dtype('O')

Forcing the type to dtype('S') does not help either. ::

>>> df.a.astype(np.dtype(str)).dtype
dtype('O')
>>> df.a.astype(np.dtype('S')).dtype
dtype('O')

Is there any explanation for this behavior?

Answers


Numpy's string dtypes aren't python strings.

Therefore, pandas deliberately uses native python strings, which require an object dtype.

First off, let me demonstrate a bit of what I mean by numpy's strings being different:

In [1]: import numpy as np
In [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7')
In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object)

Now, 'x' is a numpy string dtype (fixed-width, c-like string) and y is an array of native python strings.

If we try to go beyond 7 characters, we'll see an immediate difference. The string dtype versions will be truncated:

In [4]: x[1] = 'a really really really long'
In [5]: x
Out[5]:
array(['Testing', 'a reall', 'string'],
      dtype='|S7')

While the object dtype versions can be arbitrary length:

In [6]: y[1] = 'a really really really long'

In [7]: y
Out[7]: array(['Testing', 'a really really really long', 'string'], dtype=object)

Next, the |S dtype strings can't hold unicode properly, though there is a unicode fixed-length string dtype, as well. I'll skip an example, for the moment.

Finally, numpy's strings are actually mutable, while Python strings are not. For example:

In [8]: z = x.view(np.uint8)
In [9]: z += 1
In [10]: x
Out[10]:
array(['Uftujoh', 'b!sfbmm', 'tusjoh\x01'],
      dtype='|S7')

For all of these reasons, pandas chose not to ever allow C-like, fixed-length strings as a datatype. As you noticed, attempting to coerce a python string into a fixed-with numpy string won't work in pandas. Instead, it always uses native python strings, which behave in a more intuitive way for most users.


Need Your Help

Javascript: Calling object methods within that object

javascript oop methods

What is the best design pattern for achieving the following (which doesn't work)?

Including non-Java sources in a Maven project

maven maven-2

I'm starting on a project that I expect will include a substantial amount of non-Java code (mostly shell and SQL scripts).