datreant.data: convenient data storage and retrieval for Treants¶
This datreant submodule adds a convenience interface for numpy and
pandas data storage and retrieval using HDF5 within a Treant’s directory
structure. It provides the data
limb for Treants, Trees, Bundles, and Views
from datreant.core.
Warning
This module is experimental. It is not API stable, and has many rough edges and limitations. It is, however, usable.
Getting datreant.data¶
See the installation instructions for installation details. The package itself is pure Python, but it is dependent on HDF5 libraries and the Python interfaces to these.
If you want to work on the code, either for yourself or to contribute back to the project, clone the repository to your local machine with:
git clone https://github.com/datreant/datreant.git
Contributing¶
This project is still under heavy development, and there are certainly rough edges and bugs. Issues and pull requests welcome! Check out our contributor’s guide to learn how to get started with contributing back.
Installing datreant.data¶
Since datreant.data
uses HDF5 as the file format of choice for persistence,
you will first need to install the HDF5 libraries either using your package
manager or manually.
On Ubuntu 14.04 this will be
apt-get install libhdf5-serial-1.8.4 libhdf5-serial-dev
and on Arch Linux
pacman -S hdf5
You can then install datreant.data
from PyPI
using pip:
pip install datreant.data
It is also possible to use --user
to install into your user’s site-packages
directory:
pip install --user datreant.data
Dependencies¶
The dependencies of datreant.data
are:
These are installed automatically when installing with pip.
Installing from source¶
To install from source, clone the repository and switch to the master branch
git clone git@github.com:datreant/datreant.data.git
cd datreant.data
git checkout master
Installation of the packages is as simple as
pip install .
This installs datreant.data
in the system wide python directory; this may
require administrative privileges.
It is also possible to use --user
to install into your user’s site-packages
directory:
pip install --user .
Storing and retrieving datasets within Treants¶
The functionality of a Treant
can be expanded to
conveniently store numpy
and pandas
objects in a couple different ways.
If we have an existing Treant:
>>> import datreant.core as dtr
>>> s = dtr.Treant('sequoia')
>>> s
<Treant: 'sequoia'>
We can attach the Data
limb to only this instance
with:
>>> import datreant.data
>>> s.attach('data')
>>> s.data
<Data([])>
Alternatively, we could attach the Data
and
AggData
limbs to every object they apply for
by doing:
>>> import datreant.data.attach
If you want explicit control of which objects have this limb, the first approach is the one to use, but the second one is useful for interactive work.
Storing and retrieving numpy
arrays¶
Perhaps we have generated a numpy array of dimension (10^6, 3) that we wish to have easy access to later
>>> import numpy as np
>>> a = np.random.randn(1000000, 3)
>>> a.shape
(1000000, 3)
We can store this easily
>>> s.data['something wicked'] = a
>>> s.data
<Data(['something wicked'])>
Looking at the contents of the directory sequoia
, we see it has a new
subdirectory corresponding to the name of our stored dataset
>>> s.draw()
sequoia/
+-- something wicked/
| +-- npData.h5
+-- Treant.608f7463-5063-450a-96eb-c5c93f16dc32.json
and inside of this is a new HDF5 file (npData.h5
). Our numpy
array is
stored inside, and we can recall it just as easily as we stored it:
>>> s.data['something wicked']
array([[ 0.49884872, -0.30062622, 0.64513512],
[-0.12839311, 0.68467086, -0.96125085],
[ 0.36655902, -0.13178154, -0.58137863],
...,
[-0.20229488, -0.30303892, 1.44345568],
[ 0.10119334, -0.50691484, 0.05854653],
[-2.0551924 , 0.80378532, -0.28869459]])
Storing and retrieving pandas
objects¶
pandas is the de facto standard for working
with tabular data in Python. It’s most-used objects, the
Series
and DataFrame
are just as easily
stored as numpy
arrays. If we have a DataFrame we wish to store:
>>> import pandas as pd
>>> df = pd.DataFrame(np.random.randn(1000, 3), columns=['A', 'B', 'C'])
>>> df.head()
A B C
0 -0.474337 -1.257253 0.497824
1 -1.057806 -1.393081 0.628394
2 0.063369 -1.820173 -1.178128
3 -0.747949 0.607452 -1.509302
4 -0.031547 -0.680997 1.127573
then as you can expect, we can store it with:
>>> s.data['something terrible'] = df
and recall it with:
>>> s.data['something terrible'].head()
A B C
0 -0.474337 -1.257253 0.497824
1 -1.057806 -1.393081 0.628394
2 0.063369 -1.820173 -1.178128
3 -0.747949 0.607452 -1.509302
4 -0.031547 -0.680997 1.127573
Our data is stored in its own HDF5 file (pdData.h5
) in the subdirectory we
specified, so now our Treant looks like this:
s.draw()
sequoia/
+-- something wicked/
| +-- npData.h5
+-- Treant.608f7463-5063-450a-96eb-c5c93f16dc32.json
+-- something terrible/
+-- pdData.h5
Alternatively, we can use the add()
method to
store datasets:
>>> s.data.add('something terrible')
but the effect is the same. Since internally this uses the pandas.HDFStore class for storing pandas objects, all limitations for the types of indexes and objects it can store apply.
Appending to existing data¶
Sometimes we may have code that will generate a Series
or
DataFrame
that is rather large, perhaps larger than our
machine’s memory. In these cases we can
append()
to an existing store instead of writing
out a single, huge DataFrame all at once:
>>> s.data['something terrible'].shape # before
(1000, 3)
>>> df2 = pd.DataFrame(np.random.randn(2000, 3), columns=['A', 'B', 'C'])
>>> s.data.append('something terrible', df2)
>>> s.data['something terrible'].shape # after
(3000, 3)
Have code that will generate a DataFrame with 10^8 rows? No problem:
>>> for i in range(10**2):
... a_piece = pd.DataFrame(np.random.randn(10**6, 3),
... columns=['A', 'B', 'C'],
... index=pd.Int64Index(np.arange(10**6) + i*10**6))
...
... s.data.append('something enormous', a_piece)
Note that the DataFrame
appended must have the same column
names and dtypes as that already stored, and that only rows can be appended,
not columns. For pandas.Series
objects the dtype must match.
Appending of pandas.Panel
objects also works, but the limitations are
more stringent. See the pandas HDFStore documentation for more details on
what is technically possible.
Retrieving subselections¶
For pandas stores that are very large, we may not want or be able to pull the
full object into memory. For these cases we can use
retrieve()
to get subselections of our data.
Taking our large 10^8 row DataFrame, we can get at rows 1000000 to 2000000
with something like:
>>> s.data.retrieve('something enormous', start=10000000, stop=2000000).shape
(1000000, 3)
If we only wanted columns ‘B’ and ‘C’, we could get only those, too:
>>> s.data.retrieve('something enormous', start=10000000, stop=2000000,
... columns=['B', 'C']).shape
(1000000, 2)
These operations are performed “out-of-core”, meaning that the full dataset is never read entirely into memory to get back the result of our subselection.
Retrieving from a query¶
For large datasets it can also be useful to retrieve only rows that match some
set of conditions. We can do this with the where
keyword, for example
getting all rows for which column ‘A’ is less than -2:
>>> s.data.retrieve('something enormous', where="A < -2").head()
A B C
131 -2.177729 -0.797003 0.401288
134 -2.017321 0.750593 -1.366106
198 -2.203170 -0.670188 0.494191
246 -2.156695 1.107288 -0.065875
309 -2.334792 0.984636 0.006232
321 -3.784861 -1.222399 0.038717
346 -2.057103 -0.230953 0.732774
364 -2.418875 0.250880 -0.850418
413 -2.528563 -0.261624 1.233367
480 -2.205484 0.036570 0.501868
Note
Since our data is randomly generated in this example, the rows you get running the same example will be different.
Or perhaps when both column ‘A’ is less than -2 and column ‘C’ is greater than 2:
>>> s.data.retrieve('something enormous', where="A < -2 & C > 2").head()
A B C
1790 -3.103821 -0.616780 2.714530
5635 -2.431589 -0.580400 3.163408
7664 -2.364559 0.304764 2.884965
9208 -2.569256 1.105211 2.008396
9487 -2.028096 0.146484 2.234081
9968 -2.362063 0.544276 2.469602
11503 -2.494900 -0.005465 2.487311
12725 -2.353478 -0.001569 2.274861
14991 -2.129492 -1.889708 2.324640
15178 -2.327528 1.852786 2.425977
See the documentation for querying with pandas.HDFStore.select()
for
more information on the range of possibilities for the where
keyword.
Bonus: storing anything pickleable¶
As a bit of a bonus, we can use the same basic storage and retrieval mechanisms
that work for numpy
and pandas
objects to store Python object
that is pickleable. For example, doing:
>>> s.data['a grocery list'] = ['ham', 'eggs', 'spam']
will store this list as a pickle:
>>> s.draw()
sequoia/
+-- a grocery list/
| +-- pyData.pkl
+-- something wicked/
| +-- npData.h5
+-- Treant.608f7463-5063-450a-96eb-c5c93f16dc32.json
+-- something enormous/
| +-- pdData.h5
+-- something terrible/
+-- pdData.h5
And we can get it back:
>>> s.data['a grocery list']
['ham', 'eggs', 'spam']
In this way we don’t have to care too much about what type of object we are
trying to store; the Data
limb will try to pickle
anything that isn’t a numpy
or pandas
object.
Deleting datasets¶
We can delete stored datasets with the remove()
method:
>>> s.data.remove('something terrible')
>>> s.draw()
sequoia/
+-- a grocery list/
| +-- pyData.pkl
+-- Treant.608f7463-5063-450a-96eb-c5c93f16dc32.json
+-- something enormous/
| +-- pdData.h5
+-- something wicked/
+-- npData.h5
This will remove not only the file in which the data is actually stored, but also the directory if there are no other files present inside of it. If there are other files present, the data file will be deleted but the directory will not.
But since datasets live in the filesystem, we can also delete datasets by deleting them more directly, e.g. through a shell:
> rm -r sequoia/"something terrible"
and it will work just as well.
Using Trees to subselect datasets¶
The Data
limb isn’t just for Treants; it works
for Tree
objects as well. So we could use our
Treant ‘sequoia’ directly as a Tree instead of a Treant if we wanted:
>>> import datreant.core as dtr
>>> import datreant.data
>>> t = dtr.Tree('sequoia/')
>>> t.attach('data')
>>> t.data
<Data(['a grocery list', 'something enormous', 'something wicked'])>
and it would work all the same. This behavior is most useful, however, when nesting datasets.
Nesting within a tree¶
Dataset names are their paths downward relative to the Tree/Treant they are called from, so we can store a dataset like:
>>> t.data['a/better/grocery/list'] = ['ham', 'eggs', 'steak']
>>> t.data
<Data(['a grocery list', 'a/better/grocery/list', 'something enormous', 'something wicked'])>
and this creates the directory structure you might expect:
>>> t.draw()
sequoia/
+-- a grocery list/
| +-- pyData.pkl
+-- Treant.608f7463-5063-450a-96eb-c5c93f16dc32.json
+-- something enormous/
| +-- pdData.h5
+-- a/
| +-- better/
| +-- grocery/
| +-- list/
| +-- pyData.pkl
+-- something wicked/
+-- npData.h5
This allows us to group together related datasets in a natural way, as we would
probably do even if we weren’t using datreant
objects. So if we had several
shopping lists, we might put them under a directory of their own:
>>> t.data['shopping lists/food'] = ['milk', 'ham', 'eggs', 'steak']
>>> t.data['shopping lists/clothes'] = ['shirts', 'pants', 'shoes']
>>> t.data['shopping lists/misc'] = ['dish soap']
which would give us:
>>> t['shopping lists'].draw()
shopping lists/
+-- misc/
| +-- pyData.pkl
+-- food/
| +-- pyData.pkl
+-- clothes/
+-- pyData.pkl
and we could always get them back easily enough:
>>> t.data['shopping lists/food']
['milk', 'ham', 'eggs', 'steak']
Trees as subselections¶
But since Trees can access datasets inside them, we could work more directly with our shopping lists by using the ‘shopping lists’ Tree
>>> lets_go_shopping = t['shopping lists']
>>> lets_go_shopping.data
<Data(['clothes', 'food', 'misc'])>
and now selecting is a bit less verbose:
>>> lets_go_shopping['food']
['milk', 'ham', 'eggs', 'steak']
Aggregating datasets with Views and Bundles¶
Just as Treants and Trees have the Data
limb
for storing and retrieving datasets in their filesystem trees, the
View
and Bundle
objects also
have the AggData
limb for accessing these
datasets in aggregate.
Given a directory with four Treants
> ls
elm/ maple/ oak/ sequoia/
we’ll gather these up in a Bundle
>>> import datreant.core as dtr
>>> import glob
>>> b = dtr.Bundle(glob.glob('*'))
>>> b
<Bundle([<Treant: 'sequoia'>, <Treant: 'maple'>, <Treant: 'oak'>, <Treant: 'elm'>])>
and then attach the AggData
limb to only this
Bundle instance with:
>>> import datreant.data
>>> b.attach('data')
Note
Attaching a limb like AggData
to a
Bundle or View with the attach()
method
will attach the required limb to each member instance. In this case,
each member gets a Data
limb.
and so we can now do:
>>> b.data
<AggData([])>
This tells us that there are no datasets with the same key within every member of the Bundle. So, let’s make something that does. Let’s build a “dataset” that gives us a sinusoid based on a characteristic of each Treant in the Bundle:
>>> import numpy as np
>>> b.categories['frequency'] = [1, 2, 3, 4]
>>> for member in b:
... member.data['sinusoid/array'] = np.sin(
... member.categories['frequency'] * np.linspace(0, 8*np.pi,
... num=200))
So now if we do:
>>> b.data
<AggData(['sinusoid/array'])>
we see we now have a dataset name in common among all members. If we recall it
>>> sines = b.data['sinusoid/array']
>>> type(sines)
dict
we get back a dictionary with the full path to each member as keys:
>>> sines.keys()
['/home/bob/research/arborea/sequoia/',
'/home/bob/research/arborea/oak/',
'/home/bob/research/arborea/elm/',
'/home/bob/research/arborea/maple/']
and the values are the :numpy arrays we stored for each member. If we’d
rather get back a dictionary with names instead of paths, we could do that
with the retrieve()
method:
>>> b.data.retrieve('sinusoid/array', by='name').keys()
['sequoia', 'oak', 'maple', 'elm']
Getting uuids as the keys is also possible, and is often useful since these will be unique among Treants, while names (and in some cases, paths) are generally not.
Aggregating datasets not represented among all members¶
We can still aggregate over datasets even if their keys are not present among all members. We can see what keys are available among at least one member in the Bundle with:
>>> b.data.any
['a grocery list',
'a/better/grocery/list',
'shopping lists/clothes',
'shopping lists/food',
'shopping lists/misc',
'sinusoid/array',
'something enormous',
'something wicked']
and we see the datasets we stored using the single Treant earlier. If we recall one of these, we get an aggregation
>>> b.data['shopping lists/clothes']
{'/home/bob/research/arborea/sequoia/': ['shirts', 'pants', 'shoes']}
with only the datasets present for that key. Since it’s only the one Treant that has a dataset with this name, we get a dictionary with one key-value pair.
MultiIndex aggregation for pandas
objects¶
numpy
arrays or pickled datasets are always retrieved in aggregate as
dictionaries, since this is the simplest way of aggregating these objects while
retaining the ability to identify datasets from individual members. Aggregation
is most useful, however, for pandas
objects, since for these we can
naturally build versions of the same data structure with an additional index
for data membership.
We’ll make a pandas.Series
version of the same dataset we stored
before:
>>> import pandas as pd
>>> for member in b:
... member.data['sinusoid/series'] = pd.Series(member.data['sinusoid/array'])
So now when we retrieve this aggregated dataset by name, we get a series with an outermost index of member names:
>>> sines = b.data.retrieve('sinusoid/series', by='name')
>>> sines.groupby(level=0).head()
sequoia 0 0.000000
1 0.125960
2 0.249913
3 0.369885
4 0.483966
oak 0 0.000000
1 0.369885
2 0.687304
3 0.907232
4 0.998474
maple 0 0.000000
1 0.249913
2 0.483966
3 0.687304
4 0.847024
elm 0 0.000000
1 0.483966
2 0.847024
3 0.998474
4 0.900479
dtype: float64
So we can immediately use this for aggregated analysis, or perhaps just pretty plots:
>>> for name, group in sines.groupby(level=0):
... group.reset_index(level=0, drop=True).plot(legend=True, label=name)
Subselection with Views¶
Just as we can subselect datasets with Trees, we
can use View
objects to work with subselections in
aggregate. Using our Bundle from above, we can construct a View:
>>> sinusoids = dtr.View(b).trees['sinusoid']
>>> sinusoids
<View([<Tree: 'sequoia/sinusoid/'>, <Tree: 'maple/sinusoid/'>, <Tree: 'oak/sinusoid/'>, <Tree: 'elm/sinusoid/'>])>
And just like a Tree
can access datasets with the
Data
limb in the same way a
Treant
can, a View
can access
datasets in aggregate in the same way as a Bundle
:
>>> sinusoids.attach('data')
>>> sinusoids.data
<AggData(['array', 'series'])>
These are the datasets common to all the Trees in this View. We can retrieve an aggregation as before:
>>> sinusoids.data['series'].groupby(level=0).head()
/home/bob/research/arborea/sequoia/sinusoid/ 0 0.000000
1 0.031569
2 0.063106
3 0.094580
4 0.125960
/home/bob/research/arborea/maple/sinusoid/ 0 0.000000
1 0.063106
2 0.125960
3 0.188312
4 0.249913
/home/bob/research/arborea/oak/sinusoid/ 0 0.000000
1 0.094580
2 0.188312
3 0.280355
4 0.369885
/home/bob/research/arborea/elm/sinusoid/ 0 0.000000
1 0.125960
2 0.249913
3 0.369885
4 0.483966
dtype: float64
Note
For aggregations from a View, it is not possible to aggregate by
uuid because Trees do not have them. Also, in many cases, as here,
aggregating by name will not give unique keys. When the aggregation
keys are not unique, a KeyError
is raised.
API Reference¶
This is an overview of the datreant.data
API components.
Individual datasets¶
These are the API components of datreant.data
for storing and retrieving
datasets from individual Treants and Trees.
Data¶
The class datreant.data.limbs.Data
is the interface used by Treants to
access their stored datasets.
-
class
datreant.data.limbs.
Data
(tree)¶ Interface to stored data.
-
add
(handle, *args, **kwargs)¶ Store data in Treant.
A data instance can be a pandas object (Series, DataFrame, Panel), a numpy array, or a pickleable python object. If the dataset doesn’t exist, it is added. If a dataset already exists for the given handle, it is replaced.
Arguments: - handle
name given to data; needed for retrieval
- data
data structure to store
-
append
(handle, *args, **kwargs)¶ Append rows to an existing dataset.
The object must be of the same pandas class (Series, DataFrame, Panel) as the existing dataset, and it must have exactly the same columns (names included).
Arguments: - handle
name of data to append to
- data
data to append
-
keys
()¶ List available datasets.
Returns: - handles
list of handles to available datasets
-
remove
(handle, **kwargs)¶ Remove a dataset, or some subset of a dataset.
Note: in the case the whole dataset is removed, the directory containing the dataset file (
Data.h5
) will NOT be removed if it still contains file(s) after the removal of the dataset file.For pandas objects (Series, DataFrame, or Panel) subsets of the whole dataset can be removed using keywords such as start and stop for ranges of rows, and columns for selected columns.
Arguments: - handle
name of dataset to delete
Keywords: - where
conditions for what rows/columns to remove
- start
row number to start selection
- stop
row number to stop selection
- columns
columns to remove
-
retrieve
(handle, *args, **kwargs)¶ Retrieve stored data.
The stored data structure is read from disk and returned.
If dataset doesn’t exist,
None
is returned.For pandas objects (Series, DataFrame, or Panel) subsets of the whole dataset can be returned using keywords such as start and stop for ranges of rows, and columns for selected columns.
Also for pandas objects, the where keyword takes a string as input and can be used to filter out rows and columns without loading the full object into memory. For example, given a DataFrame with handle ‘mydata’ with columns (A, B, C, D), one could return all rows for columns A and C for which column D is greater than .3 with:
retrieve('mydata', where='columns=[A,C] & D > .3')
Or, if we wanted all rows with index = 3 (there could be more than one):
retrieve('mydata', where='index = 3')
See
pandas.HDFStore.select()
for more information.Arguments: - handle
name of data to retrieve
Keywords: - where
conditions for what rows/columns to return
- start
row number to start selection
- stop
row number to stop selection
- columns
list of columns to return; all columns returned by default
- iterator
if True, return an iterator [
False
]- chunksize
number of rows to include in iteration; implies
iterator=True
Returns: - data
stored data;
None
if nonexistent
-
Aggregated data¶
These are the API components of datreant.data
for working with datasets
from multiple Treants at once, and treating them in aggregate.
AggData¶
The class datreant.data.agglimbs.AggData
is the interface used by
Bundles and Views to access their members’ datasets in aggregate.
-
class
datreant.data.agglimbs.
AggData
(collection)¶ Manipulators for collection data.
-
keys
(scope='all')¶ List available datasets.
Parameters: scope ({'all', 'any'}) – Keys to list. ‘all’ returns only handles that are present in all members. ‘any’ returns a list of all handles present in at least one member. Returns: handles – list of handles to available datasets Return type: list
-
retrieve
(handle, by='path', **kwargs)¶ Retrieve aggregated dataset from all members.
This is a convenience method. The stored data structure for each member is read from disk and aggregated. The aggregation scheme is dependent on the form of the data structures pulled from each member:
- pandas DataFrames or Series
- the structures are appended together, with a new level added to the index giving the member (see by) each set of rows came from
- pandas Panel or Panel4D, numpy arrays, pickled python objects
- the structures are returned as a dictionary, with keys giving the member (see by) and each value giving the corresponding data structure
This method tries to do smart things with the data it reads from each member. In particular:
- members for which there is no data with the given handle are skipped
- the lowest-common-denominator data structure is output; this means that if all data structures read are pandas DataFrames, then a multi-index DataFrame is returned; if some structures are pandas DataFrames, while some are anything else, a dictionary is returned
Arguments: - handle
name of data to retrieve
Keywords: - by
top-level index of output data structure; ‘path’ uses member path, ‘name’ uses member names, ‘uuid’ uses member uuids; if names are not unique, it is better to go with ‘path’ or ‘uuid’ [‘path’]
See
datreant.data.limbs.Data.retrieve()
for more information on keyword usage.Keywords for pandas data structures: - where
conditions for what rows/columns to return
- start
row number to start selection
- stop
row number to stop selection
- columns
list of columns to return; all columns returned by default
- iterator
if True, return an iterator [
False
]- chunksize
number of rows to include in iteration; implies
iterator=True
Returns: - data
aggregated data structure
-
\ Sort by:\ best rated\ newest\ oldest\
\\
Add a comment\ (markup):
\``code``
, \ code blocks:::
and an indented block after blank line