Storing and retrieving datasets within Treants¶
The functionality of a Treant
can be expanded to
conveniently store numpy
and pandas
objects in a couple different ways.
If we have an existing Treant:
>>> import datreant.core as dtr
>>> s = dtr.Treant('sequoia')
>>> s
<Treant: 'sequoia'>
We can attach the Data
limb to only this instance
with:
>>> import datreant.data
>>> s.attach('data')
>>> s.data
<Data([])>
Alternatively, we could attach the Data
and
AggData
limbs to every object they apply for
by doing:
>>> import datreant.data.attach
If you want explicit control of which objects have this limb, the first approach is the one to use, but the second one is useful for interactive work.
Storing and retrieving numpy
arrays¶
Perhaps we have generated a numpy array of dimension (10^6, 3) that we wish to have easy access to later
>>> import numpy as np
>>> a = np.random.randn(1000000, 3)
>>> a.shape
(1000000, 3)
We can store this easily
>>> s.data['something wicked'] = a
>>> s.data
<Data(['something wicked'])>
Looking at the contents of the directory sequoia
, we see it has a new
subdirectory corresponding to the name of our stored dataset
>>> s.draw()
sequoia/
+-- something wicked/
| +-- npData.h5
+-- Treant.608f7463-5063-450a-96eb-c5c93f16dc32.json
and inside of this is a new HDF5 file (npData.h5
). Our numpy
array is
stored inside, and we can recall it just as easily as we stored it:
>>> s.data['something wicked']
array([[ 0.49884872, -0.30062622, 0.64513512],
[-0.12839311, 0.68467086, -0.96125085],
[ 0.36655902, -0.13178154, -0.58137863],
...,
[-0.20229488, -0.30303892, 1.44345568],
[ 0.10119334, -0.50691484, 0.05854653],
[-2.0551924 , 0.80378532, -0.28869459]])
Storing and retrieving pandas
objects¶
pandas is the de facto standard for working
with tabular data in Python. It’s most-used objects, the
Series
and DataFrame
are just as easily
stored as numpy
arrays. If we have a DataFrame we wish to store:
>>> import pandas as pd
>>> df = pd.DataFrame(np.random.randn(1000, 3), columns=['A', 'B', 'C'])
>>> df.head()
A B C
0 -0.474337 -1.257253 0.497824
1 -1.057806 -1.393081 0.628394
2 0.063369 -1.820173 -1.178128
3 -0.747949 0.607452 -1.509302
4 -0.031547 -0.680997 1.127573
then as you can expect, we can store it with:
>>> s.data['something terrible'] = df
and recall it with:
>>> s.data['something terrible'].head()
A B C
0 -0.474337 -1.257253 0.497824
1 -1.057806 -1.393081 0.628394
2 0.063369 -1.820173 -1.178128
3 -0.747949 0.607452 -1.509302
4 -0.031547 -0.680997 1.127573
Our data is stored in its own HDF5 file (pdData.h5
) in the subdirectory we
specified, so now our Treant looks like this:
s.draw()
sequoia/
+-- something wicked/
| +-- npData.h5
+-- Treant.608f7463-5063-450a-96eb-c5c93f16dc32.json
+-- something terrible/
+-- pdData.h5
Alternatively, we can use the add()
method to
store datasets:
>>> s.data.add('something terrible')
but the effect is the same. Since internally this uses the pandas.HDFStore class for storing pandas objects, all limitations for the types of indexes and objects it can store apply.
Appending to existing data¶
Sometimes we may have code that will generate a Series
or
DataFrame
that is rather large, perhaps larger than our
machine’s memory. In these cases we can
append()
to an existing store instead of writing
out a single, huge DataFrame all at once:
>>> s.data['something terrible'].shape # before
(1000, 3)
>>> df2 = pd.DataFrame(np.random.randn(2000, 3), columns=['A', 'B', 'C'])
>>> s.data.append('something terrible', df2)
>>> s.data['something terrible'].shape # after
(3000, 3)
Have code that will generate a DataFrame with 10^8 rows? No problem:
>>> for i in range(10**2):
... a_piece = pd.DataFrame(np.random.randn(10**6, 3),
... columns=['A', 'B', 'C'],
... index=pd.Int64Index(np.arange(10**6) + i*10**6))
...
... s.data.append('something enormous', a_piece)
Note that the DataFrame
appended must have the same column
names and dtypes as that already stored, and that only rows can be appended,
not columns. For pandas.Series
objects the dtype must match.
Appending of pandas.Panel
objects also works, but the limitations are
more stringent. See the pandas HDFStore documentation for more details on
what is technically possible.
Retrieving subselections¶
For pandas stores that are very large, we may not want or be able to pull the
full object into memory. For these cases we can use
retrieve()
to get subselections of our data.
Taking our large 10^8 row DataFrame, we can get at rows 1000000 to 2000000
with something like:
>>> s.data.retrieve('something enormous', start=10000000, stop=2000000).shape
(1000000, 3)
If we only wanted columns ‘B’ and ‘C’, we could get only those, too:
>>> s.data.retrieve('something enormous', start=10000000, stop=2000000,
... columns=['B', 'C']).shape
(1000000, 2)
These operations are performed “out-of-core”, meaning that the full dataset is never read entirely into memory to get back the result of our subselection.
Retrieving from a query¶
For large datasets it can also be useful to retrieve only rows that match some
set of conditions. We can do this with the where
keyword, for example
getting all rows for which column ‘A’ is less than -2:
>>> s.data.retrieve('something enormous', where="A < -2").head()
A B C
131 -2.177729 -0.797003 0.401288
134 -2.017321 0.750593 -1.366106
198 -2.203170 -0.670188 0.494191
246 -2.156695 1.107288 -0.065875
309 -2.334792 0.984636 0.006232
321 -3.784861 -1.222399 0.038717
346 -2.057103 -0.230953 0.732774
364 -2.418875 0.250880 -0.850418
413 -2.528563 -0.261624 1.233367
480 -2.205484 0.036570 0.501868
Note
Since our data is randomly generated in this example, the rows you get running the same example will be different.
Or perhaps when both column ‘A’ is less than -2 and column ‘C’ is greater than 2:
>>> s.data.retrieve('something enormous', where="A < -2 & C > 2").head()
A B C
1790 -3.103821 -0.616780 2.714530
5635 -2.431589 -0.580400 3.163408
7664 -2.364559 0.304764 2.884965
9208 -2.569256 1.105211 2.008396
9487 -2.028096 0.146484 2.234081
9968 -2.362063 0.544276 2.469602
11503 -2.494900 -0.005465 2.487311
12725 -2.353478 -0.001569 2.274861
14991 -2.129492 -1.889708 2.324640
15178 -2.327528 1.852786 2.425977
See the documentation for querying with pandas.HDFStore.select()
for
more information on the range of possibilities for the where
keyword.
Bonus: storing anything pickleable¶
As a bit of a bonus, we can use the same basic storage and retrieval mechanisms
that work for numpy
and pandas
objects to store Python object
that is pickleable. For example, doing:
>>> s.data['a grocery list'] = ['ham', 'eggs', 'spam']
will store this list as a pickle:
>>> s.draw()
sequoia/
+-- a grocery list/
| +-- pyData.pkl
+-- something wicked/
| +-- npData.h5
+-- Treant.608f7463-5063-450a-96eb-c5c93f16dc32.json
+-- something enormous/
| +-- pdData.h5
+-- something terrible/
+-- pdData.h5
And we can get it back:
>>> s.data['a grocery list']
['ham', 'eggs', 'spam']
In this way we don’t have to care too much about what type of object we are
trying to store; the Data
limb will try to pickle
anything that isn’t a numpy
or pandas
object.
Deleting datasets¶
We can delete stored datasets with the remove()
method:
>>> s.data.remove('something terrible')
>>> s.draw()
sequoia/
+-- a grocery list/
| +-- pyData.pkl
+-- Treant.608f7463-5063-450a-96eb-c5c93f16dc32.json
+-- something enormous/
| +-- pdData.h5
+-- something wicked/
+-- npData.h5
This will remove not only the file in which the data is actually stored, but also the directory if there are no other files present inside of it. If there are other files present, the data file will be deleted but the directory will not.
But since datasets live in the filesystem, we can also delete datasets by deleting them more directly, e.g. through a shell:
> rm -r sequoia/"something terrible"
and it will work just as well.