Aggregating datasets with Views and Bundles

Just as Treants and Trees have the Data limb for storing and retrieving datasets in their filesystem trees, the View and Bundle objects also have the AggData limb for accessing these datasets in aggregate.

Given a directory with four Treants

> ls
elm/  maple/  oak/  sequoia/

we’ll gather these up in a Bundle

>>> import datreant.core as dtr
>>> import glob
>>> b = dtr.Bundle(glob.glob('*'))
>>> b
<Bundle([<Treant: 'sequoia'>, <Treant: 'maple'>, <Treant: 'oak'>, <Treant: 'elm'>])>

and then attach the AggData limb to only this Bundle instance with:

>>> import datreant.data
>>> b.attach('data')

Note

Attaching a limb like AggData to a Bundle or View with the attach() method will attach the required limb to each member instance. In this case, each member gets a Data limb.

and so we can now do:

>>> b.data
<AggData([])>

This tells us that there are no datasets with the same key within every member of the Bundle. So, let’s make something that does. Let’s build a “dataset” that gives us a sinusoid based on a characteristic of each Treant in the Bundle:

>>> import numpy as np
>>> b.categories['frequency'] = [1, 2, 3, 4]
>>> for member in b:
...     member.data['sinusoid/array'] = np.sin(
...         member.categories['frequency'] * np.linspace(0, 8*np.pi,
...                                                      num=200))

So now if we do:

>>> b.data
<AggData(['sinusoid/array'])>

we see we now have a dataset name in common among all members. If we recall it

>>> sines = b.data['sinusoid/array']
>>> type(sines)
dict

we get back a dictionary with the full path to each member as keys:

>>> sines.keys()
['/home/bob/research/arborea/sequoia/',
 '/home/bob/research/arborea/oak/',
 '/home/bob/research/arborea/elm/',
 '/home/bob/research/arborea/maple/']

and the values are the :numpy arrays we stored for each member. If we’d rather get back a dictionary with names instead of paths, we could do that with the retrieve() method:

>>> b.data.retrieve('sinusoid/array', by='name').keys()
['sequoia', 'oak', 'maple', 'elm']

Getting uuids as the keys is also possible, and is often useful since these will be unique among Treants, while names (and in some cases, paths) are generally not.

Aggregating datasets not represented among all members

We can still aggregate over datasets even if their keys are not present among all members. We can see what keys are available among at least one member in the Bundle with:

>>> b.data.any
['a grocery list',
'a/better/grocery/list',
'shopping lists/clothes',
'shopping lists/food',
'shopping lists/misc',
'sinusoid/array',
'something enormous',
'something wicked']

and we see the datasets we stored using the single Treant earlier. If we recall one of these, we get an aggregation

>>> b.data['shopping lists/clothes']
{'/home/bob/research/arborea/sequoia/': ['shirts', 'pants', 'shoes']}

with only the datasets present for that key. Since it’s only the one Treant that has a dataset with this name, we get a dictionary with one key-value pair.

MultiIndex aggregation for pandas objects

numpy arrays or pickled datasets are always retrieved in aggregate as dictionaries, since this is the simplest way of aggregating these objects while retaining the ability to identify datasets from individual members. Aggregation is most useful, however, for pandas objects, since for these we can naturally build versions of the same data structure with an additional index for data membership.

We’ll make a pandas.Series version of the same dataset we stored before:

>>> import pandas as pd
>>> for member in b:
...     member.data['sinusoid/series'] = pd.Series(member.data['sinusoid/array'])

So now when we retrieve this aggregated dataset by name, we get a series with an outermost index of member names:

>>> sines = b.data.retrieve('sinusoid/series', by='name')
>>> sines.groupby(level=0).head()
sequoia  0    0.000000
         1    0.125960
         2    0.249913
         3    0.369885
         4    0.483966
oak      0    0.000000
         1    0.369885
         2    0.687304
         3    0.907232
         4    0.998474
maple    0    0.000000
         1    0.249913
         2    0.483966
         3    0.687304
         4    0.847024
elm      0    0.000000
         1    0.483966
         2    0.847024
         3    0.998474
         4    0.900479
dtype: float64

So we can immediately use this for aggregated analysis, or perhaps just pretty plots:

>>> for name, group in sines.groupby(level=0):
...     group.reset_index(level=0, drop=True).plot(legend=True, label=name)
_images/sines.png

Subselection with Views

Just as we can subselect datasets with Trees, we can use View objects to work with subselections in aggregate. Using our Bundle from above, we can construct a View:

>>> sinusoids = dtr.View(b).trees['sinusoid']
>>> sinusoids
<View([<Tree: 'sequoia/sinusoid/'>, <Tree: 'maple/sinusoid/'>, <Tree: 'oak/sinusoid/'>, <Tree: 'elm/sinusoid/'>])>

And just like a Tree can access datasets with the Data limb in the same way a Treant can, a View can access datasets in aggregate in the same way as a Bundle:

>>> sinusoids.attach('data')
>>> sinusoids.data
<AggData(['array', 'series'])>

These are the datasets common to all the Trees in this View. We can retrieve an aggregation as before:

>>> sinusoids.data['series'].groupby(level=0).head()
/home/bob/research/arborea/sequoia/sinusoid/  0    0.000000
                                              1    0.031569
                                              2    0.063106
                                              3    0.094580
                                              4    0.125960
/home/bob/research/arborea/maple/sinusoid/    0    0.000000
                                              1    0.063106
                                              2    0.125960
                                              3    0.188312
                                              4    0.249913
/home/bob/research/arborea/oak/sinusoid/      0    0.000000
                                              1    0.094580
                                              2    0.188312
                                              3    0.280355
                                              4    0.369885
/home/bob/research/arborea/elm/sinusoid/      0    0.000000
                                              1    0.125960
                                              2    0.249913
                                              3    0.369885
                                              4    0.483966
dtype: float64

Note

For aggregations from a View, it is not possible to aggregate by uuid because Trees do not have them. Also, in many cases, as here, aggregating by name will not give unique keys. When the aggregation keys are not unique, a KeyError is raised.

API reference: AggData

See the AggData API reference for more details.