Wrapper to access dataframe with multi-level column index from Parquet
This subclass of `ParquetTable` to handle the multi-level is necessary
because there is not a convenient way to request specific table subsets
by level via Parquet through pyarrow, as there is with a `pandas.DataFrame`.
Additionally, pyarrow stores multilevel index information in a very strange
way. Pandas stores it as a tuple, so that one can access a single column
from a pandas dataframe as `df[('ref', 'HSC-G', 'coord_ra')]`. However, for
some reason pyarrow saves these indices as "stringified" tuples, such that
in order to read thissame column from a table written to Parquet, you would
have to do the following:
pf = pyarrow.ParquetFile(filename)
df = pf.read(columns=["('ref', 'HSC-G', 'coord_ra')"])
See also https://github.com/apache/arrow/issues/1771, where we've raised
this issue.
As multilevel-indexed dataframes can be very useful to store data like
multiple filters' worth of data in the same table, this case deserves a
wrapper to enable easier access;
that's what this object is for. For example,
parq = MultilevelParquetTable(filename)
columnDict = {'dataset':'meas',
'filter':'HSC-G',
'column':['coord_ra', 'coord_dec']}
df = parq.toDataFrame(columns=columnDict)
will return just the coordinate columns; the equivalent of calling
`df['meas']['HSC-G'][['coord_ra', 'coord_dec']]` on the total dataframe,
but without having to load the whole frame into memory---this reads just
those columns from disk. You can also request a sub-table; e.g.,
parq = MultilevelParquetTable(filename)
columnDict = {'dataset':'meas',
'filter':'HSC-G'}
df = parq.toDataFrame(columns=columnDict)
and this will be the equivalent of `df['meas']['HSC-G']` on the total dataframe.
Parameters
----------
filename : str, optional
Path to Parquet file.
dataFrame : dataFrame, optional
Definition at line 148 of file parquetTable.py.
def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.toDataFrame |
( |
|
self, |
|
|
|
columns = None , |
|
|
|
droplevels = True |
|
) |
| |
Get table (or specified columns) as a pandas DataFrame
To get specific columns in specified sub-levels:
parq = MultilevelParquetTable(filename)
columnDict = {'dataset':'meas',
'filter':'HSC-G',
'column':['coord_ra', 'coord_dec']}
df = parq.toDataFrame(columns=columnDict)
Or, to get an entire subtable, leave out one level name:
parq = MultilevelParquetTable(filename)
columnDict = {'dataset':'meas',
'filter':'HSC-G'}
df = parq.toDataFrame(columns=columnDict)
Parameters
----------
columns : list or dict, optional
Desired columns. If `None`, then all columns will be
returned. If a list, then the names of the columns must
be *exactly* as stored by pyarrow; that is, stringified tuples.
If a dictionary, then the entries of the dictionary must
correspond to the level names of the column multi-index
(that is, the `columnLevels` attribute). Not every level
must be passed; if any level is left out, then all entries
in that level will be implicitly included.
droplevels : bool
If True drop levels of column index that have just one entry
Definition at line 235 of file parquetTable.py.