lsst.pipe.tasks g68a3911fdd+c3f02514e0
Loading...
Searching...
No Matches
Public Member Functions | Public Attributes | Protected Member Functions | Protected Attributes | List of all members
lsst.pipe.tasks.parquetTable.MultilevelParquetTable Class Reference
Inheritance diagram for lsst.pipe.tasks.parquetTable.MultilevelParquetTable:
lsst.pipe.tasks.parquetTable.ParquetTable

Public Member Functions

 __init__ (self, *args, **kwargs)
 
 columnLevelNames (self)
 
 columnLevels (self)
 
 toDataFrame (self, columns=None, droplevels=True)
 

Public Attributes

 columns
 
 columnLevels
 

Protected Member Functions

 _getColumnIndex (self)
 
 _getColumns (self)
 
 _colsFromDict (self, colDict)
 
 _stringify (self, cols)
 

Protected Attributes

 _columnLevelNames
 

Detailed Description

Wrapper to access dataframe with multi-level column index from Parquet

This subclass of `ParquetTable` to handle the multi-level is necessary
because there is not a convenient way to request specific table subsets
by level via Parquet through pyarrow, as there is with a `pandas.DataFrame`.

Additionally, pyarrow stores multilevel index information in a very strange
way. Pandas stores it as a tuple, so that one can access a single column
from a pandas dataframe as `df[('ref', 'HSC-G', 'coord_ra')]`.  However, for
some reason pyarrow saves these indices as "stringified" tuples, such that
in order to read thissame column from a table written to Parquet, you would
have to do the following:

    pf = pyarrow.ParquetFile(filename)
    df = pf.read(columns=["('ref', 'HSC-G', 'coord_ra')"])

See also https://github.com/apache/arrow/issues/1771, where we've raised
this issue.

As multilevel-indexed dataframes can be very useful to store data like
multiple filters' worth of data in the same table, this case deserves a
wrapper to enable easier access;
that's what this object is for.  For example,

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
                  'filter':'HSC-G',
                  'column':['coord_ra', 'coord_dec']}
    df = parq.toDataFrame(columns=columnDict)

will return just the coordinate columns; the equivalent of calling
`df['meas']['HSC-G'][['coord_ra', 'coord_dec']]` on the total dataframe,
but without having to load the whole frame into memory---this reads just
those columns from disk.  You can also request a sub-table; e.g.,

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
                  'filter':'HSC-G'}
    df = parq.toDataFrame(columns=columnDict)

and this will be the equivalent of `df['meas']['HSC-G']` on the total dataframe.

Parameters
----------
filename : str, optional
    Path to Parquet file.
dataFrame : dataFrame, optional

Definition at line 157 of file parquetTable.py.

Constructor & Destructor Documentation

◆ __init__()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable.__init__ (   self,
args,
**  kwargs 
)

Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.

Definition at line 207 of file parquetTable.py.

Member Function Documentation

◆ _colsFromDict()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable._colsFromDict (   self,
  colDict 
)
protected

Definition at line 317 of file parquetTable.py.

◆ _getColumnIndex()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable._getColumnIndex (   self)
protected

Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.

Definition at line 227 of file parquetTable.py.

◆ _getColumns()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable._getColumns (   self)
protected

Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.

Definition at line 234 of file parquetTable.py.

◆ _stringify()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable._stringify (   self,
  cols 
)
protected

Definition at line 332 of file parquetTable.py.

◆ columnLevelNames()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevelNames (   self)

Definition at line 213 of file parquetTable.py.

◆ columnLevels()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevels (   self)
Names of levels in column index

Definition at line 222 of file parquetTable.py.

◆ toDataFrame()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable.toDataFrame (   self,
  columns = None,
  droplevels = True 
)
Get table (or specified columns) as a pandas DataFrame

To get specific columns in specified sub-levels:

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
              'filter':'HSC-G',
              'column':['coord_ra', 'coord_dec']}
    df = parq.toDataFrame(columns=columnDict)

Or, to get an entire subtable, leave out one level name:

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
                  'filter':'HSC-G'}
    df = parq.toDataFrame(columns=columnDict)

Parameters
----------
columns : list or dict, optional
    Desired columns.  If `None`, then all columns will be
    returned.  If a list, then the names of the columns must
    be *exactly* as stored by pyarrow; that is, stringified tuples.
    If a dictionary, then the entries of the dictionary must
    correspond to the level names of the column multi-index
    (that is, the `columnLevels` attribute).  Not every level
    must be passed; if any level is left out, then all entries
    in that level will be implicitly included.
droplevels : bool
    If True drop levels of column index that have just one entry

Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.

Definition at line 244 of file parquetTable.py.

Member Data Documentation

◆ _columnLevelNames

lsst.pipe.tasks.parquetTable.MultilevelParquetTable._columnLevelNames
protected

Definition at line 210 of file parquetTable.py.

◆ columnLevels

lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevels

Definition at line 217 of file parquetTable.py.

◆ columns

lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columns

Definition at line 216 of file parquetTable.py.


The documentation for this class was generated from the following file: