Coverage for python/lsst/daf/butler/registries/sqlPreFlight.py : 95%

Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
# This file is part of daf_butler. # # Developed for the LSST Data Management System. # This product includes software developed by the LSST Project # (http://www.lsst.org). # See the COPYRIGHT file at the top-level directory of this distribution # for details of code ownership. # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>.
"""Recursively scan units and their optional dependencies, return their names"""
"""Filter out DataUnitJoins that summarize other DataUnitJoins.
Parameters ---------- dataUnitJoins : iterable of `DataUnitJoin`
Yields ------ dataUnitJoin : `DataUnitJoin` DataUnitJoin which do not summarize any of the DataUnitJoins in the input set. """ # If it summarizes some other joins and all those joins are in the # set of joins then we do not need it.
"""Return topologically sorted DataUnits.
Ordering is based on dependencies, units with no dependencies on other units are returned first.
Parameters ---------- dataUnits : iterable of `DataUnit` """
"""Class implementing part of preflight solver which extracts units data from registry.
This is an implementation detail only to be used by SqlRegistry class, not supposed to be used anywhere else.
Parameters ---------- schema : `Schema` Schema instance dataUnits : `DataUnitRegistry` Description of DataUnit dimensions and joins. connection : `sqlalchmey.Connection` Connection to use for database access. """
"""Add new table for join clause.
Assumption here is that this unit table has a foreign key to all other tables and names of columns are the same in both tables, so we just get primary key columns from other tables and join on them.
Parameters ---------- fromClause : `sqlalchemy.FromClause` May be `None`, in that case ``otherDataUnits`` is expected to be empty and is ignored. dataUnit : `DataUnit` DataUnit to join with ``fromClause``. otherDataUnits : iterable of `DataUnit` DataUnits whose tables have PKs for ``dataUnit`` table's FK. They all have to be in ``fromClause`` already.
Returns ------- fromClause : `sqlalchemy.FromClause` SQLAlchemy FROM clause extended with new join. """ # starting point, first table in JOIN else: for name in otherUnit.primaryKey} otherUnit.name, list(primaryKeyColumns.keys())) else: # Completely unrelated tables, e.g. joining SkyMap and Camera. # We need a cross join here but SQLAlchemy does not have specific # method for that. Using join() without `onclause` will try to # join on FK and will raise an exception for unrelated tables, # so we have to use `onclause` which is always true.
"""Evaluate a filter expression and lists of `DatasetTypes <DatasetType>` and return a set of data unit values.
Returned set consists of combinations of units participating in data transformation from ``neededDatasetTypes`` to ``futureDatasetTypes``, restricted by existing data and filter expression.
Parameters ---------- originInfo : `DatasetOriginInfo` Object which provides names of the input/output collections. expression : `str` An expression that limits the `DataUnits <DataUnit>` and (indirectly) the Datasets returned. neededDatasetTypes : `list` of `DatasetType` The `list` of `DatasetTypes <DatasetType>` whose DataUnits will be included in the returned column set. Output is limited to the the Datasets of these DatasetTypes which already exist in the registry. futureDatasetTypes : `list` of `DatasetType` The `list` of `DatasetTypes <DatasetType>` whose DataUnits will be included in the returned column set. It is expected that Datasets for these DatasetTypes do not exist in the registry, but presently this is not checked.
Yields ------ row : `PreFlightUnitsRow` Single row is a unique combination of units in a transform. """
# Brief overview of the code below: # - extract all DataUnits used by all input/output dataset types # - build a complex SQL query to run against registry database: # - first do (natural) join for all tables for all DataUnits # involved based on their foreign keys # - then add Join tables to the mix, only use Join tables which # have their lhs/rhs links in the above DataUnits set, also # ignore Joins which summarize other Joins # - next join with Dataset for each input dataset type, this # limits result only to existing input dataset # - also do outer join with Dataset for each output dataset type # to see which output datasets are already there # - append user filter expression # - query returns all DataUnit values, regions for region-based # joins, and dataset IDs for all existing datasets # - run this query # - filter out records whose regions do not overlap # - return result as iterator of records containing DataUnit values
# Collect unit names in both input and output dataset types
# Build select column list # take link column names, usually there is one
# Extend units set with the "optional" superset from schema, so that # joins work correctly. This may bring more tables into query than # really needed, potential for optimization.
# All DataUnit instances in a subset that we need
# joins for all unit tables continue
# joins between skymap and camera units if dataUnitJoin.lhs.issubset(allUnitNames) and dataUnitJoin.rhs.issubset(allUnitNames)]
# only use most specific joins
# Some `DataUnitJoin`s have an associated region (e.g. they are spatial) # in that case they shouldn't be joined separately in the region lookup.
# TODO: do not know yet how to handle MultiCameraExposureJoin, # skip it for now
# Look at each side of the DataUnitJoin and join it with # corresponding DataUnit tables, including making all necessary # joins for special multi-DataUnit region table(s). # For DataUnits like Patch we need to extend list with their required # units which are also spatial. # if one of the joins is with Visit/Sensor then also bring # VisitSensorRegion table in and join it with the units # TODO: need a better way to recognize this special case _LOG.debug("region table already joined with units: %s", regionHolder.name) else:
# add to the list of tables that we need to join with
# We also have to include regions from each side of the join # into resultset so that we can filter-out non-overlapping # regions.
# join with input datasets to restrict to existing inputs [(dsType, True) for dsType in futureDatasetTypes]
"output" if isOutput else "input", dsType.name)
# Build a sub-query. # If there nothing to join (e.g. we know that output # collection is empty) then just pass None as column # index for this dataset type to the code below.
# Join sub-query with all units on their link names, # OUTER JOIN is used for output datasets (they don't usually exist)
# remember dataset_id column index for this dataset
# build full query # TODO: potentially transform query from user-friendly expression
# execute and return result iterator
"""Build a sub-query for a dataset type to be joined with "big join".
If there is only one collection then there is a guarantee that DataIds are all unique (by DataId I mean combination of all link values relevant for this dataset), in that case subquery can be written as:
SELECT Dataset.dataset_id AS dataset_id, Dataset.link1 AS link1 ... FROM Dataset JOIN DatasetCollection ON Dataset.dataset_id = DatasetCollection.dataset_id WHERE Dataset.dataset_type_name = :dsType_name AND DatasetCollection.collection = :collection_name
We only have single collection for output DatasetTypes so for them subqueries always look like above.
If there are multiple collections then there can be multiple matching Datasets for the same DataId. In that case we need only one Dataset record which comes from earliest collection (in the user-provided order). Here things become complicated, we have to: - replace collection names with their order in input list - select all combinations of rows from Dataset and DatasetCollection which match collection names and dataset type name - from those only select rows with lowest collection position if there are multiple collections for the same DataId
Replacing collection names with positions is easy:
SELECT dataset_id, CASE collection WHEN 'collection1' THEN 0 WHEN 'collection2' THEN 1 ... END AS collorder FROM DatasetCollection
Combined query will look like (CASE ... END is as above):
SELECT Dataset.dataset_id AS dataset_id, CASE DatasetCollection.collection ... END AS collorder, Dataset.DataId FROM Dataset JOIN DatasetCollection ON Dataset.dataset_id = DatasetCollection.dataset_id WHERE Dataset.dataset_type_name = <dsType.name> AND DatasetCollection.collection IN (<collections>)
(here ``Dataset.DataId`` means ``Dataset.link1, Dataset.link2, etc.``)
Filtering is complicated, it is simpler to use Common Table Expression (WITH clause) but not all databases support CTEs so we will have to do with the repeating sub-queries. Use GROUP BY for DataId and MIN(collorder) to find ``collorder`` for given DataId, then join it with previous combined selection:
SELECT DS.dataset_id AS dataset_id, DS.link1 AS link1 ... FROM (SELECT Dataset.dataset_id AS dataset_id, CASE ... END AS collorder, Dataset.DataId FROM Dataset JOIN DatasetCollection ON Dataset.dataset_id = DatasetCollection.dataset_id WHERE Dataset.dataset_type_name = <dsType.name> AND DatasetCollection.collection IN (<collections>)) DS INNER JOIN (SELECT MIN(CASE ... END AS) collorder, Dataset.DataId FROM Dataset JOIN DatasetCollection ON Dataset.dataset_id = DatasetCollection.dataset_id WHERE Dataset.dataset_type_name = <dsType.name> AND DatasetCollection.collection IN (<collections>) GROUP BY Dataset.DataId) DSG ON DS.colpos = DSG.colpos AND DS.DataId = DSG.DataId
Parameters ---------- dsType : `DatasetType` originInfo : `DatasetOriginInfo` Object which provides names of the input/output collections. isOutput : `bool` `True` for output datasets.
Returns ------- subquery : `sqlalchemy.FromClause` or `None` """
# helper method """Return list of columns for given column names"""
# No output collection means no output datasets exist, we do # not need to do any joins here.
else:
# full set of link names for this DatasetType
# single collection, easy-peasy dsCollTable.c.collection == dsCollections[0])
else:
# multiple collections dsCollTable.c.collection.in_(dsCollections))
# CASE caluse (dsCollTable.c.collection == coll, pos) for pos, coll in enumerate(dsCollections) ])
# first GROUP BY sub-query, find minimum `collorder` for each DataId
# next combined sub-query
# now join these two [groupSubq.c[colName] == combined.c[colName] for colName in links]
# need a unique alias name for it, otherwise we'll see name conflicts
"""Convert query result rows into `PreFlightUnitsRow` instances.
Parameters ---------- rowIter : iterable Iterator for rows returned by the query on registry unitLinkColumns : `dict` Dictionary of (unit link name, column index), column contains DataUnit value regionColumns : `dict` Dictionary of (DataUnit name, column index), column contains encoded region data dsIdColumns : `dict` Dictionary of (DatasetType, column index), column contains dataset Id, or None if dataset does not exist
Yields ------ row : `PreFlightUnitsRow` """
# Filter result rows that have non-overlapping regions. # Result set generated by query in selectDataUnits() method can include # set of regions in each row (encoded as bytes). Due to pixel-based # matching some regions may not overlap, this generator method filters # rows that have disjoint regions. If result row contains more than two # regions (this should not happen with our current schema) then row is # filtered if any of two regions are disjoint. if reg1.relate(reg2) == DISJOINT: disjoint = True break continue
# for each dataset get ids DataRef
|