lsst.pipe.tasks gc750389ca5+5b520963fe
Loading...
Searching...
No Matches
Public Member Functions | List of all members
lsst.pipe.tasks.parquetTable.MultilevelParquetTable Class Reference
Inheritance diagram for lsst.pipe.tasks.parquetTable.MultilevelParquetTable:
lsst.pipe.tasks.parquetTable.ParquetTable

Public Member Functions

def __init__ (self, *args, **kwargs)
 
def columnLevelNames (self)
 
def columnLevels (self)
 
def toDataFrame (self, columns=None, droplevels=True)
 

Detailed Description

Wrapper to access dataframe with multi-level column index from Parquet

This subclass of `ParquetTable` to handle the multi-level is necessary
because there is not a convenient way to request specific table subsets
by level via Parquet through pyarrow, as there is with a `pandas.DataFrame`.

Additionally, pyarrow stores multilevel index information in a very strange
way. Pandas stores it as a tuple, so that one can access a single column
from a pandas dataframe as `df[('ref', 'HSC-G', 'coord_ra')]`.  However, for
some reason pyarrow saves these indices as "stringified" tuples, such that
in order to read thissame column from a table written to Parquet, you would
have to do the following:

    pf = pyarrow.ParquetFile(filename)
    df = pf.read(columns=["('ref', 'HSC-G', 'coord_ra')"])

See also https://github.com/apache/arrow/issues/1771, where we've raised
this issue.

As multilevel-indexed dataframes can be very useful to store data like
multiple filters' worth of data in the same table, this case deserves a
wrapper to enable easier access;
that's what this object is for.  For example,

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
                  'filter':'HSC-G',
                  'column':['coord_ra', 'coord_dec']}
    df = parq.toDataFrame(columns=columnDict)

will return just the coordinate columns; the equivalent of calling
`df['meas']['HSC-G'][['coord_ra', 'coord_dec']]` on the total dataframe,
but without having to load the whole frame into memory---this reads just
those columns from disk.  You can also request a sub-table; e.g.,

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
                  'filter':'HSC-G'}
    df = parq.toDataFrame(columns=columnDict)

and this will be the equivalent of `df['meas']['HSC-G']` on the total dataframe.

Parameters
----------
filename : str, optional
    Path to Parquet file.
dataFrame : dataFrame, optional

Definition at line 151 of file parquetTable.py.

Constructor & Destructor Documentation

◆ __init__()

def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.__init__ (   self,
args,
**  kwargs 
)

Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.

Definition at line 201 of file parquetTable.py.

Member Function Documentation

◆ columnLevelNames()

def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevelNames (   self)

Definition at line 207 of file parquetTable.py.

◆ columnLevels()

def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevels (   self)
Names of levels in column index

Definition at line 216 of file parquetTable.py.

◆ toDataFrame()

def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.toDataFrame (   self,
  columns = None,
  droplevels = True 
)
Get table (or specified columns) as a pandas DataFrame

To get specific columns in specified sub-levels:

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
              'filter':'HSC-G',
              'column':['coord_ra', 'coord_dec']}
    df = parq.toDataFrame(columns=columnDict)

Or, to get an entire subtable, leave out one level name:

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
                  'filter':'HSC-G'}
    df = parq.toDataFrame(columns=columnDict)

Parameters
----------
columns : list or dict, optional
    Desired columns.  If `None`, then all columns will be
    returned.  If a list, then the names of the columns must
    be *exactly* as stored by pyarrow; that is, stringified tuples.
    If a dictionary, then the entries of the dictionary must
    correspond to the level names of the column multi-index
    (that is, the `columnLevels` attribute).  Not every level
    must be passed; if any level is left out, then all entries
    in that level will be implicitly included.
droplevels : bool
    If True drop levels of column index that have just one entry

Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.

Definition at line 238 of file parquetTable.py.


The documentation for this class was generated from the following file: