lsst.pipe.tasks g5580bffe14+01995c1c9a
|
Public Member Functions | |
def | __init__ (self, *args, **kwargs) |
def | columnLevelNames (self) |
def | columnLevels (self) |
def | toDataFrame (self, columns=None, droplevels=True) |
Wrapper to access dataframe with multi-level column index from Parquet This subclass of `ParquetTable` to handle the multi-level is necessary because there is not a convenient way to request specific table subsets by level via Parquet through pyarrow, as there is with a `pandas.DataFrame`. Additionally, pyarrow stores multilevel index information in a very strange way. Pandas stores it as a tuple, so that one can access a single column from a pandas dataframe as `df[('ref', 'HSC-G', 'coord_ra')]`. However, for some reason pyarrow saves these indices as "stringified" tuples, such that in order to read thissame column from a table written to Parquet, you would have to do the following: pf = pyarrow.ParquetFile(filename) df = pf.read(columns=["('ref', 'HSC-G', 'coord_ra')"]) See also https://github.com/apache/arrow/issues/1771, where we've raised this issue. As multilevel-indexed dataframes can be very useful to store data like multiple filters' worth of data in the same table, this case deserves a wrapper to enable easier access; that's what this object is for. For example, parq = MultilevelParquetTable(filename) columnDict = {'dataset':'meas', 'filter':'HSC-G', 'column':['coord_ra', 'coord_dec']} df = parq.toDataFrame(columns=columnDict) will return just the coordinate columns; the equivalent of calling `df['meas']['HSC-G'][['coord_ra', 'coord_dec']]` on the total dataframe, but without having to load the whole frame into memory---this reads just those columns from disk. You can also request a sub-table; e.g., parq = MultilevelParquetTable(filename) columnDict = {'dataset':'meas', 'filter':'HSC-G'} df = parq.toDataFrame(columns=columnDict) and this will be the equivalent of `df['meas']['HSC-G']` on the total dataframe. Parameters ---------- filename : str, optional Path to Parquet file. dataFrame : dataFrame, optional
Definition at line 148 of file parquetTable.py.
def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.__init__ | ( | self, | |
* | args, | ||
** | kwargs | ||
) |
Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.
Definition at line 198 of file parquetTable.py.
def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevelNames | ( | self | ) |
Definition at line 204 of file parquetTable.py.
def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevels | ( | self | ) |
Names of levels in column index
Definition at line 213 of file parquetTable.py.
def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.toDataFrame | ( | self, | |
columns = None , |
|||
droplevels = True |
|||
) |
Get table (or specified columns) as a pandas DataFrame To get specific columns in specified sub-levels: parq = MultilevelParquetTable(filename) columnDict = {'dataset':'meas', 'filter':'HSC-G', 'column':['coord_ra', 'coord_dec']} df = parq.toDataFrame(columns=columnDict) Or, to get an entire subtable, leave out one level name: parq = MultilevelParquetTable(filename) columnDict = {'dataset':'meas', 'filter':'HSC-G'} df = parq.toDataFrame(columns=columnDict) Parameters ---------- columns : list or dict, optional Desired columns. If `None`, then all columns will be returned. If a list, then the names of the columns must be *exactly* as stored by pyarrow; that is, stringified tuples. If a dictionary, then the entries of the dictionary must correspond to the level names of the column multi-index (that is, the `columnLevels` attribute). Not every level must be passed; if any level is left out, then all entries in that level will be implicitly included. droplevels : bool If True drop levels of column index that have just one entry
Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.
Definition at line 235 of file parquetTable.py.