Hide keyboard shortcuts

Hot-keys on this page

r m x p   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

1# This file is part of daf_butler. 

2# 

3# Developed for the LSST Data Management System. 

4# This product includes software developed by the LSST Project 

5# (http://www.lsst.org). 

6# See the COPYRIGHT file at the top-level directory of this distribution 

7# for details of code ownership. 

8# 

9# This program is free software: you can redistribute it and/or modify 

10# it under the terms of the GNU General Public License as published by 

11# the Free Software Foundation, either version 3 of the License, or 

12# (at your option) any later version. 

13# 

14# This program is distributed in the hope that it will be useful, 

15# but WITHOUT ANY WARRANTY; without even the implied warranty of 

16# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 

17# GNU General Public License for more details. 

18# 

19# You should have received a copy of the GNU General Public License 

20# along with this program. If not, see <http://www.gnu.org/licenses/>. 

21 

22""" 

23Butler top level classes. 

24""" 

25from __future__ import annotations 

26 

27__all__ = ("Butler", "ButlerValidationError") 

28 

29import os 

30from collections import defaultdict 

31import contextlib 

32import logging 

33from typing import ( 

34 Any, 

35 ClassVar, 

36 ContextManager, 

37 Dict, 

38 Iterable, 

39 List, 

40 Mapping, 

41 MutableMapping, 

42 Optional, 

43 Tuple, 

44 Union, 

45) 

46 

47try: 

48 import boto3 

49except ImportError: 

50 boto3 = None 

51 

52from lsst.utils import doImport 

53from .core import ( 

54 ButlerURI, 

55 CompositesMap, 

56 Config, 

57 ConfigSubset, 

58 DataCoordinate, 

59 DataId, 

60 DatasetRef, 

61 DatasetType, 

62 Datastore, 

63 FileDataset, 

64 Quantum, 

65 RepoExport, 

66 StorageClassFactory, 

67 ValidationError, 

68) 

69from .core.repoRelocation import BUTLER_ROOT_TAG 

70from .core.safeFileIo import safeMakeDir 

71from .core.utils import transactional, getClassOf 

72from ._deferredDatasetHandle import DeferredDatasetHandle 

73from ._butlerConfig import ButlerConfig 

74from .registry import Registry, RegistryConfig, CollectionType 

75from .registry.wildcards import CollectionSearch 

76 

77log = logging.getLogger(__name__) 

78 

79 

80class ButlerValidationError(ValidationError): 

81 """There is a problem with the Butler configuration.""" 

82 pass 

83 

84 

85class Butler: 

86 """Main entry point for the data access system. 

87 

88 Parameters 

89 ---------- 

90 config : `ButlerConfig`, `Config` or `str`, optional. 

91 Configuration. Anything acceptable to the 

92 `ButlerConfig` constructor. If a directory path 

93 is given the configuration will be read from a ``butler.yaml`` file in 

94 that location. If `None` is given default values will be used. 

95 butler : `Butler`, optional. 

96 If provided, construct a new Butler that uses the same registry and 

97 datastore as the given one, but with the given collection and run. 

98 Incompatible with the ``config``, ``searchPaths``, and ``writeable`` 

99 arguments. 

100 collections : `Any`, optional 

101 An expression specifying the collections to be searched (in order) when 

102 reading datasets, and optionally dataset type restrictions on them. 

103 This may be: 

104 - a `str` collection name; 

105 - a tuple of (collection name, *dataset type restriction*); 

106 - an iterable of either of the above; 

107 - a mapping from `str` to *dataset type restriction*. 

108 

109 See :ref:`daf_butler_collection_expressions` for more information, 

110 including the definition of a *dataset type restriction*. All 

111 collections must either already exist or be specified to be created 

112 by other arguments. 

113 run : `str`, optional 

114 Name of the run datasets should be output to. If the run 

115 does not exist, it will be created. If ``collections`` is `None`, it 

116 will be set to ``[run]``. If this is not set (and ``writeable`` is 

117 not set either), a read-only butler will be created. 

118 tags : `Iterable` [ `str` ], optional 

119 A list of `~CollectionType.TAGGED` collections that datasets should be 

120 associated with in `put` or `ingest` and disassociated from in 

121 `pruneDatasets`. If any of these collections does not exist, it will 

122 be created. 

123 chains : `Mapping` [ `str`, `Iterable` [ `str` ] ], optional 

124 A mapping from the names of new `~CollectionType.CHAINED` collections 

125 to an expression identifying their child collections (which takes the 

126 same form as the ``collections`` argument. Chains may be nested only 

127 if children precede their parents in this mapping. 

128 searchPaths : `list` of `str`, optional 

129 Directory paths to search when calculating the full Butler 

130 configuration. Not used if the supplied config is already a 

131 `ButlerConfig`. 

132 writeable : `bool`, optional 

133 Explicitly sets whether the butler supports write operations. If not 

134 provided, a read-write butler is created if any of ``run``, ``tags``, 

135 or ``chains`` is non-empty. 

136 

137 Examples 

138 -------- 

139 While there are many ways to control exactly how a `Butler` interacts with 

140 the collections in its `Registry`, the most common cases are still simple. 

141 

142 For a read-only `Butler` that searches one collection, do:: 

143 

144 butler = Butler("/path/to/repo", collections=["u/alice/DM-50000"]) 

145 

146 For a read-write `Butler` that writes to and reads from a 

147 `~CollectionType.RUN` collection:: 

148 

149 butler = Butler("/path/to/repo", run="u/alice/DM-50000/a") 

150 

151 The `Butler` passed to a ``PipelineTask`` is often much more complex, 

152 because we want to write to one `~CollectionType.RUN` collection but read 

153 from several others (as well), while defining a new 

154 `~CollectionType.CHAINED` collection that combines them all:: 

155 

156 butler = Butler("/path/to/repo", run="u/alice/DM-50000/a", 

157 collections=["u/alice/DM-50000"], 

158 chains={ 

159 "u/alice/DM-50000": ["u/alice/DM-50000/a", 

160 "u/bob/DM-49998", 

161 "raw/hsc"] 

162 }) 

163 

164 This butler will `put` new datasets to the run ``u/alice/DM-50000/a``, but 

165 they'll also be available from the chained collection ``u/alice/DM-50000``. 

166 Datasets will be read first from that run (since it appears first in the 

167 chain), and then from ``u/bob/DM-49998`` and finally ``raw/hsc``. 

168 If ``u/alice/DM-50000`` had already been defined, the ``chain`` argument 

169 would be unnecessary. We could also construct a butler that performs 

170 exactly the same `put` and `get` operations without actually creating a 

171 chained collection, just by passing multiple items is ``collections``:: 

172 

173 butler = Butler("/path/to/repo", run="u/alice/DM-50000/a", 

174 collections=["u/alice/DM-50000/a", 

175 "u/bob/DM-49998", 

176 "raw/hsc"]) 

177 

178 Finally, one can always create a `Butler` with no collections:: 

179 

180 butler = Butler("/path/to/repo", writeable=True) 

181 

182 This can be extremely useful when you just want to use ``butler.registry``, 

183 e.g. for inserting dimension data or managing collections, or when the 

184 collections you want to use with the butler are not consistent. 

185 Passing ``writeable`` explicitly here is only necessary if you want to be 

186 able to make changes to the repo - usually the value for ``writeable`` is 

187 can be guessed from the collection arguments provided, but it defaults to 

188 `False` when there are not collection arguments. 

189 """ 

190 def __init__(self, config: Union[Config, str, None] = None, *, 

191 butler: Optional[Butler] = None, 

192 collections: Any = None, 

193 run: Optional[str] = None, 

194 tags: Iterable[str] = (), 

195 chains: Optional[Mapping[str, Any]] = None, 

196 searchPaths: Optional[List[str]] = None, 

197 writeable: Optional[bool] = None): 

198 # Transform any single-pass iterator into an actual sequence so we 

199 # can see if its empty 

200 self.tags = tuple(tags) 

201 # Load registry, datastore, etc. from config or existing butler. 

202 if butler is not None: 

203 if config is not None or searchPaths is not None or writeable is not None: 

204 raise TypeError("Cannot pass 'config', 'searchPaths', or 'writeable' " 

205 "arguments with 'butler' argument.") 

206 self.registry = butler.registry 

207 self.datastore = butler.datastore 

208 self.storageClasses = butler.storageClasses 

209 self._composites = butler._composites 

210 self._config = butler._config 

211 else: 

212 self._config = ButlerConfig(config, searchPaths=searchPaths) 

213 if "root" in self._config: 

214 butlerRoot = self._config["root"] 

215 else: 

216 butlerRoot = self._config.configDir 

217 if writeable is None: 

218 writeable = run is not None or chains is not None or self.tags 

219 self.registry = Registry.fromConfig(self._config, butlerRoot=butlerRoot, writeable=writeable) 

220 self.datastore = Datastore.fromConfig(self._config, self.registry, butlerRoot=butlerRoot) 

221 self.storageClasses = StorageClassFactory() 

222 self.storageClasses.addFromConfig(self._config) 

223 self._composites = CompositesMap(self._config, universe=self.registry.dimensions) 

224 # Check the many collection arguments for consistency and create any 

225 # needed collections that don't exist. 

226 if collections is None: 

227 if run is not None: 

228 collections = (run,) 

229 else: 

230 collections = () 

231 self.collections = CollectionSearch.fromExpression(collections) 

232 if chains is None: 

233 chains = {} 

234 self.run = run 

235 if "run" in self._config or "collection" in self._config: 

236 raise ValueError("Passing a run or collection via configuration is no longer supported.") 

237 if self.run is not None: 

238 self.registry.registerCollection(self.run, type=CollectionType.RUN) 

239 for tag in self.tags: 

240 self.registry.registerCollection(tag, type=CollectionType.TAGGED) 

241 for parent, children in chains.items(): 

242 self.registry.registerCollection(parent, type=CollectionType.CHAINED) 

243 self.registry.setCollectionChain(parent, children) 

244 

245 GENERATION: ClassVar[int] = 3 

246 """This is a Generation 3 Butler. 

247 

248 This attribute may be removed in the future, once the Generation 2 Butler 

249 interface has been fully retired; it should only be used in transitional 

250 code. 

251 """ 

252 

253 @staticmethod 

254 def makeRepo(root: str, config: Union[Config, str, None] = None, standalone: bool = False, 

255 createRegistry: bool = True, searchPaths: Optional[List[str]] = None, 

256 forceConfigRoot: bool = True, outfile: Optional[str] = None, 

257 overwrite: bool = False) -> Config: 

258 """Create an empty data repository by adding a butler.yaml config 

259 to a repository root directory. 

260 

261 Parameters 

262 ---------- 

263 root : `str` 

264 Filesystem path to the root of the new repository. Will be created 

265 if it does not exist. 

266 config : `Config` or `str`, optional 

267 Configuration to write to the repository, after setting any 

268 root-dependent Registry or Datastore config options. Can not 

269 be a `ButlerConfig` or a `ConfigSubset`. If `None`, default 

270 configuration will be used. Root-dependent config options 

271 specified in this config are overwritten if ``forceConfigRoot`` 

272 is `True`. 

273 standalone : `bool` 

274 If True, write all expanded defaults, not just customized or 

275 repository-specific settings. 

276 This (mostly) decouples the repository from the default 

277 configuration, insulating it from changes to the defaults (which 

278 may be good or bad, depending on the nature of the changes). 

279 Future *additions* to the defaults will still be picked up when 

280 initializing `Butlers` to repos created with ``standalone=True``. 

281 createRegistry : `bool`, optional 

282 If `True` create a new Registry. 

283 searchPaths : `list` of `str`, optional 

284 Directory paths to search when calculating the full butler 

285 configuration. 

286 forceConfigRoot : `bool`, optional 

287 If `False`, any values present in the supplied ``config`` that 

288 would normally be reset are not overridden and will appear 

289 directly in the output config. This allows non-standard overrides 

290 of the root directory for a datastore or registry to be given. 

291 If this parameter is `True` the values for ``root`` will be 

292 forced into the resulting config if appropriate. 

293 outfile : `str`, optional 

294 If not-`None`, the output configuration will be written to this 

295 location rather than into the repository itself. Can be a URI 

296 string. Can refer to a directory that will be used to write 

297 ``butler.yaml``. 

298 overwrite : `bool`, optional 

299 Create a new configuration file even if one already exists 

300 in the specified output location. Default is to raise 

301 an exception. 

302 

303 Returns 

304 ------- 

305 config : `Config` 

306 The updated `Config` instance written to the repo. 

307 

308 Raises 

309 ------ 

310 ValueError 

311 Raised if a ButlerConfig or ConfigSubset is passed instead of a 

312 regular Config (as these subclasses would make it impossible to 

313 support ``standalone=False``). 

314 FileExistsError 

315 Raised if the output config file already exists. 

316 os.error 

317 Raised if the directory does not exist, exists but is not a 

318 directory, or cannot be created. 

319 

320 Notes 

321 ----- 

322 Note that when ``standalone=False`` (the default), the configuration 

323 search path (see `ConfigSubset.defaultSearchPaths`) that was used to 

324 construct the repository should also be used to construct any Butlers 

325 to avoid configuration inconsistencies. 

326 """ 

327 if isinstance(config, (ButlerConfig, ConfigSubset)): 

328 raise ValueError("makeRepo must be passed a regular Config without defaults applied.") 

329 

330 # for "file" schemes we are assuming POSIX semantics for paths, for 

331 # schemeless URIs we are assuming os.path semantics. 

332 uri = ButlerURI(root) 

333 if uri.scheme == "file" or not uri.scheme: 

334 if not os.path.isdir(uri.ospath): 

335 safeMakeDir(uri.ospath) 

336 elif uri.scheme == "s3": 

337 s3 = boto3.resource("s3") 

338 # implies bucket exists, if not another level of checks 

339 bucket = s3.Bucket(uri.netloc) 

340 bucket.put_object(Bucket=uri.netloc, Key=uri.relativeToPathRoot) 

341 else: 

342 raise ValueError(f"Unrecognized scheme: {uri.scheme}") 

343 config = Config(config) 

344 

345 # If we are creating a new repo from scratch with relative roots, 

346 # do not propagate an explicit root from the config file 

347 if "root" in config: 

348 del config["root"] 

349 

350 full = ButlerConfig(config, searchPaths=searchPaths) # this applies defaults 

351 datastoreClass = doImport(full["datastore", "cls"]) 

352 datastoreClass.setConfigRoot(BUTLER_ROOT_TAG, config, full, overwrite=forceConfigRoot) 

353 

354 # if key exists in given config, parse it, otherwise parse the defaults 

355 # in the expanded config 

356 if config.get(("registry", "db")): 

357 registryConfig = RegistryConfig(config) 

358 else: 

359 registryConfig = RegistryConfig(full) 

360 defaultDatabaseUri = registryConfig.makeDefaultDatabaseUri(BUTLER_ROOT_TAG) 

361 if defaultDatabaseUri is not None: 

362 Config.updateParameters(RegistryConfig, config, full, 

363 toUpdate={"db": defaultDatabaseUri}, 

364 overwrite=forceConfigRoot) 

365 else: 

366 Config.updateParameters(RegistryConfig, config, full, toCopy=("db",), 

367 overwrite=forceConfigRoot) 

368 

369 if standalone: 

370 config.merge(full) 

371 if outfile is not None: 

372 # When writing to a separate location we must include 

373 # the root of the butler repo in the config else it won't know 

374 # where to look. 

375 config["root"] = uri.geturl() 

376 configURI = outfile 

377 else: 

378 configURI = uri 

379 config.dumpToUri(configURI, overwrite=overwrite) 

380 

381 # Create Registry and populate tables 

382 Registry.fromConfig(config, create=createRegistry, butlerRoot=root) 

383 return config 

384 

385 @classmethod 

386 def _unpickle(cls, config: ButlerConfig, collections: Optional[CollectionSearch], run: Optional[str], 

387 tags: Tuple[str, ...], writeable: bool) -> Butler: 

388 """Callable used to unpickle a Butler. 

389 

390 We prefer not to use ``Butler.__init__`` directly so we can force some 

391 of its many arguments to be keyword-only (note that ``__reduce__`` 

392 can only invoke callables with positional arguments). 

393 

394 Parameters 

395 ---------- 

396 config : `ButlerConfig` 

397 Butler configuration, already coerced into a true `ButlerConfig` 

398 instance (and hence after any search paths for overrides have been 

399 utilized). 

400 collections : `CollectionSearch` 

401 Names of collections to read from. 

402 run : `str`, optional 

403 Name of `~CollectionType.RUN` collection to write to. 

404 tags : `tuple` [`str`] 

405 Names of `~CollectionType.TAGGED` collections to associate with. 

406 writeable : `bool` 

407 Whether the Butler should support write operations. 

408 

409 Returns 

410 ------- 

411 butler : `Butler` 

412 A new `Butler` instance. 

413 """ 

414 return cls(config=config, collections=collections, run=run, tags=tags, writeable=writeable) 

415 

416 def __reduce__(self): 

417 """Support pickling. 

418 """ 

419 return (Butler._unpickle, (self._config, self.collections, self.run, self.tags, 

420 self.registry.isWriteable())) 

421 

422 def __str__(self): 

423 return "Butler(collections={}, run={}, tags={}, datastore='{}', registry='{}')".format( 

424 self.collections, self.run, self.tags, self.datastore, self.registry) 

425 

426 def isWriteable(self) -> bool: 

427 """Return `True` if this `Butler` supports write operations. 

428 """ 

429 return self.registry.isWriteable() 

430 

431 @contextlib.contextmanager 

432 def transaction(self): 

433 """Context manager supporting `Butler` transactions. 

434 

435 Transactions can be nested. 

436 """ 

437 with self.registry.transaction(): 

438 with self.datastore.transaction(): 

439 yield 

440 

441 def _standardizeArgs(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

442 dataId: Optional[DataId] = None, **kwds: Any) -> Tuple[DatasetType, DataId]: 

443 """Standardize the arguments passed to several Butler APIs. 

444 

445 Parameters 

446 ---------- 

447 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

448 When `DatasetRef` the `dataId` should be `None`. 

449 Otherwise the `DatasetType` or name thereof. 

450 dataId : `dict` or `DataCoordinate` 

451 A `dict` of `Dimension` link name, value pairs that label the 

452 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

453 should be provided as the second argument. 

454 kwds 

455 Additional keyword arguments used to augment or construct a 

456 `DataCoordinate`. See `DataCoordinate.standardize` 

457 parameters. 

458 

459 Returns 

460 ------- 

461 datasetType : `DatasetType` 

462 A `DatasetType` instance extracted from ``datasetRefOrType``. 

463 dataId : `dict` or `DataId`, optional 

464 Argument that can be used (along with ``kwds``) to construct a 

465 `DataId`. 

466 

467 Notes 

468 ----- 

469 Butler APIs that conceptually need a DatasetRef also allow passing a 

470 `DatasetType` (or the name of one) and a `DataId` (or a dict and 

471 keyword arguments that can be used to construct one) separately. This 

472 method accepts those arguments and always returns a true `DatasetType` 

473 and a `DataId` or `dict`. 

474 

475 Standardization of `dict` vs `DataId` is best handled by passing the 

476 returned ``dataId`` (and ``kwds``) to `Registry` APIs, which are 

477 generally similarly flexible. 

478 """ 

479 externalDatasetType = None 

480 internalDatasetType = None 

481 if isinstance(datasetRefOrType, DatasetRef): 

482 if dataId is not None or kwds: 

483 raise ValueError("DatasetRef given, cannot use dataId as well") 

484 externalDatasetType = datasetRefOrType.datasetType 

485 dataId = datasetRefOrType.dataId 

486 else: 

487 # Don't check whether DataId is provided, because Registry APIs 

488 # can usually construct a better error message when it wasn't. 

489 if isinstance(datasetRefOrType, DatasetType): 

490 externalDatasetType = datasetRefOrType 

491 else: 

492 internalDatasetType = self.registry.getDatasetType(datasetRefOrType) 

493 

494 # Check that they are self-consistent 

495 if externalDatasetType is not None: 

496 internalDatasetType = self.registry.getDatasetType(externalDatasetType.name) 

497 if externalDatasetType != internalDatasetType: 

498 raise ValueError(f"Supplied dataset type ({externalDatasetType}) inconsistent with " 

499 f"registry definition ({internalDatasetType})") 

500 

501 return internalDatasetType, dataId 

502 

503 def _findDatasetRef(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

504 dataId: Optional[DataId] = None, *, 

505 collections: Any = None, 

506 allowUnresolved: bool = False, 

507 **kwds: Any) -> DatasetRef: 

508 """Shared logic for methods that start with a search for a dataset in 

509 the registry. 

510 

511 Parameters 

512 ---------- 

513 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

514 When `DatasetRef` the `dataId` should be `None`. 

515 Otherwise the `DatasetType` or name thereof. 

516 dataId : `dict` or `DataCoordinate`, optional 

517 A `dict` of `Dimension` link name, value pairs that label the 

518 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

519 should be provided as the first argument. 

520 collections : Any, optional 

521 Collections to be searched, overriding ``self.collections``. 

522 Can be any of the types supported by the ``collections`` argument 

523 to butler construction. 

524 allowUnresolved : `bool`, optional 

525 If `True`, return an unresolved `DatasetRef` if finding a resolved 

526 one in the `Registry` fails. Defaults to `False`. 

527 kwds 

528 Additional keyword arguments used to augment or construct a 

529 `DataId`. See `DataId` parameters. 

530 

531 Returns 

532 ------- 

533 ref : `DatasetRef` 

534 A reference to the dataset identified by the given arguments. 

535 

536 Raises 

537 ------ 

538 LookupError 

539 Raised if no matching dataset exists in the `Registry` (and 

540 ``allowUnresolved is False``). 

541 ValueError 

542 Raised if a resolved `DatasetRef` was passed as an input, but it 

543 differs from the one found in the registry. 

544 TypeError 

545 Raised if no collections were provided. 

546 """ 

547 datasetType, dataId = self._standardizeArgs(datasetRefOrType, dataId, **kwds) 

548 if isinstance(datasetRefOrType, DatasetRef): 

549 idNumber = datasetRefOrType.id 

550 else: 

551 idNumber = None 

552 # Expand the data ID first instead of letting registry.findDataset do 

553 # it, so we get the result even if it returns None. 

554 dataId = self.registry.expandDataId(dataId, graph=datasetType.dimensions, **kwds) 

555 if collections is None: 

556 collections = self.collections 

557 if not collections: 

558 raise TypeError("No input collections provided.") 

559 else: 

560 collections = CollectionSearch.fromExpression(collections) 

561 # Always lookup the DatasetRef, even if one is given, to ensure it is 

562 # present in the current collection. 

563 ref = self.registry.findDataset(datasetType, dataId, collections=collections) 

564 if ref is None: 

565 if allowUnresolved: 

566 return DatasetRef(datasetType, dataId) 

567 else: 

568 raise LookupError(f"Dataset {datasetType.name} with data ID {dataId} " 

569 f"could not be found in collections {collections}.") 

570 if idNumber is not None and idNumber != ref.id: 

571 raise ValueError(f"DatasetRef.id provided ({idNumber}) does not match " 

572 f"id ({ref.id}) in registry in collections {collections}.") 

573 return ref 

574 

575 @transactional 

576 def put(self, obj: Any, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

577 dataId: Optional[DataId] = None, *, 

578 producer: Optional[Quantum] = None, 

579 run: Optional[str] = None, 

580 tags: Optional[Iterable[str]] = None, 

581 **kwds: Any) -> DatasetRef: 

582 """Store and register a dataset. 

583 

584 Parameters 

585 ---------- 

586 obj : `object` 

587 The dataset. 

588 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

589 When `DatasetRef` is provided, ``dataId`` should be `None`. 

590 Otherwise the `DatasetType` or name thereof. 

591 dataId : `dict` or `DataCoordinate` 

592 A `dict` of `Dimension` link name, value pairs that label the 

593 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

594 should be provided as the second argument. 

595 producer : `Quantum`, optional 

596 The producer. 

597 run : `str`, optional 

598 The name of the run the dataset should be added to, overriding 

599 ``self.run``. 

600 tags : `Iterable` [ `str` ], optional 

601 The names of a `~CollectionType.TAGGED` collections to associate 

602 the dataset with, overriding ``self.tags``. These collections 

603 must have already been added to the `Registry`. 

604 kwds 

605 Additional keyword arguments used to augment or construct a 

606 `DataCoordinate`. See `DataCoordinate.standardize` 

607 parameters. 

608 

609 Returns 

610 ------- 

611 ref : `DatasetRef` 

612 A reference to the stored dataset, updated with the correct id if 

613 given. 

614 

615 Raises 

616 ------ 

617 TypeError 

618 Raised if the butler is read-only or if no run has been provided. 

619 """ 

620 log.debug("Butler put: %s, dataId=%s, producer=%s, run=%s", datasetRefOrType, dataId, producer, run) 

621 if not self.isWriteable(): 

622 raise TypeError("Butler is read-only.") 

623 datasetType, dataId = self._standardizeArgs(datasetRefOrType, dataId, **kwds) 

624 if isinstance(datasetRefOrType, DatasetRef) and datasetRefOrType.id is not None: 

625 raise ValueError("DatasetRef must not be in registry, must have None id") 

626 

627 if run is None: 

628 if self.run is None: 

629 raise TypeError("No run provided.") 

630 run = self.run 

631 # No need to check type for run; first thing we do is 

632 # insertDatasets, and that will check for us. 

633 

634 if tags is None: 

635 tags = self.tags 

636 else: 

637 tags = tuple(tags) 

638 for tag in tags: 

639 # Check that these are tagged collections up front, because we want 

640 # to avoid relying on Datastore transactionality to avoid modifying 

641 # the repo if there's an error later. 

642 collectionType = self.registry.getCollectionType(tag) 

643 if collectionType is not CollectionType.TAGGED: 

644 raise TypeError(f"Cannot associate into collection '{tag}' of non-TAGGED type " 

645 f"{collectionType.name}.") 

646 

647 # Disable all disassembly at the registry level for now 

648 isVirtualComposite = False 

649 

650 # Add Registry Dataset entry. If not a virtual composite, add 

651 # and attach components at the same time. 

652 dataId = self.registry.expandDataId(dataId, graph=datasetType.dimensions, **kwds) 

653 ref, = self.registry.insertDatasets(datasetType, run=run, dataIds=[dataId], 

654 producer=producer, 

655 # Never write components into 

656 # registry 

657 recursive=False) 

658 

659 # Check to see if this datasetType requires disassembly 

660 if isVirtualComposite: 

661 components = datasetType.storageClass.assembler().disassemble(obj) 

662 componentRefs = {} 

663 for component, info in components.items(): 

664 compTypeName = datasetType.componentTypeName(component) 

665 compRef = self.put(info.component, compTypeName, dataId, producer=producer, run=run, 

666 collection=False) # We don't need to recursively associate. 

667 componentRefs[component] = compRef 

668 ref = self.registry.attachComponents(ref, componentRefs) 

669 else: 

670 # This is an entity without a disassembler. 

671 self.datastore.put(obj, ref) 

672 

673 for tag in tags: 

674 self.registry.associate(tag, [ref]) # this is already recursive by default 

675 

676 return ref 

677 

678 def getDirect(self, ref: DatasetRef, *, parameters: Optional[Dict[str, Any]] = None): 

679 """Retrieve a stored dataset. 

680 

681 Unlike `Butler.get`, this method allows datasets outside the Butler's 

682 collection to be read as long as the `DatasetRef` that identifies them 

683 can be obtained separately. 

684 

685 Parameters 

686 ---------- 

687 ref : `DatasetRef` 

688 Reference to an already stored dataset. 

689 parameters : `dict` 

690 Additional StorageClass-defined options to control reading, 

691 typically used to efficiently read only a subset of the dataset. 

692 

693 Returns 

694 ------- 

695 obj : `object` 

696 The dataset. 

697 """ 

698 # if the ref exists in the store we return it directly 

699 if self.datastore.exists(ref): 

700 return self.datastore.get(ref, parameters=parameters) 

701 elif ref.isComposite() and ref.components: 

702 # The presence of components indicates that this dataset 

703 # was disassembled at the registry level. 

704 # Check that we haven't got any unknown parameters 

705 ref.datasetType.storageClass.validateParameters(parameters) 

706 # Reconstruct the composite 

707 usedParams = set() 

708 components = {} 

709 for compName, compRef in ref.components.items(): 

710 # make a dictionary of parameters containing only the subset 

711 # supported by the StorageClass of the components 

712 compParams = compRef.datasetType.storageClass.filterParameters(parameters) 

713 usedParams.update(set(compParams)) 

714 components[compName] = self.datastore.get(compRef, parameters=compParams) 

715 

716 # Any unused parameters will have to be passed to the assembler 

717 if parameters: 

718 unusedParams = {k: v for k, v in parameters.items() if k not in usedParams} 

719 else: 

720 unusedParams = {} 

721 

722 # Assemble the components 

723 inMemoryDataset = ref.datasetType.storageClass.assembler().assemble(components) 

724 return ref.datasetType.storageClass.assembler().handleParameters(inMemoryDataset, 

725 parameters=unusedParams) 

726 else: 

727 # single entity in datastore 

728 raise FileNotFoundError(f"Unable to locate dataset '{ref}' in datastore {self.datastore.name}") 

729 

730 def getDeferred(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

731 dataId: Optional[DataId] = None, *, 

732 parameters: Union[dict, None] = None, 

733 collections: Any = None, 

734 **kwds: Any) -> DeferredDatasetHandle: 

735 """Create a `DeferredDatasetHandle` which can later retrieve a dataset 

736 

737 Parameters 

738 ---------- 

739 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

740 When `DatasetRef` the `dataId` should be `None`. 

741 Otherwise the `DatasetType` or name thereof. 

742 dataId : `dict` or `DataCoordinate`, optional 

743 A `dict` of `Dimension` link name, value pairs that label the 

744 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

745 should be provided as the first argument. 

746 parameters : `dict` 

747 Additional StorageClass-defined options to control reading, 

748 typically used to efficiently read only a subset of the dataset. 

749 collections : Any, optional 

750 Collections to be searched, overriding ``self.collections``. 

751 Can be any of the types supported by the ``collections`` argument 

752 to butler construction. 

753 kwds 

754 Additional keyword arguments used to augment or construct a 

755 `DataId`. See `DataId` parameters. 

756 

757 Returns 

758 ------- 

759 obj : `DeferredDatasetHandle` 

760 A handle which can be used to retrieve a dataset at a later time. 

761 

762 Raises 

763 ------ 

764 LookupError 

765 Raised if no matching dataset exists in the `Registry` (and 

766 ``allowUnresolved is False``). 

767 ValueError 

768 Raised if a resolved `DatasetRef` was passed as an input, but it 

769 differs from the one found in the registry. 

770 TypeError 

771 Raised if no collections were provided. 

772 """ 

773 ref = self._findDatasetRef(datasetRefOrType, dataId, collections=collections, **kwds) 

774 return DeferredDatasetHandle(butler=self, ref=ref, parameters=parameters) 

775 

776 def get(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

777 dataId: Optional[DataId] = None, *, 

778 parameters: Optional[Dict[str, Any]] = None, 

779 collections: Any = None, 

780 **kwds: Any) -> Any: 

781 """Retrieve a stored dataset. 

782 

783 Parameters 

784 ---------- 

785 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

786 When `DatasetRef` the `dataId` should be `None`. 

787 Otherwise the `DatasetType` or name thereof. 

788 dataId : `dict` or `DataCoordinate` 

789 A `dict` of `Dimension` link name, value pairs that label the 

790 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

791 should be provided as the first argument. 

792 parameters : `dict` 

793 Additional StorageClass-defined options to control reading, 

794 typically used to efficiently read only a subset of the dataset. 

795 collections : Any, optional 

796 Collections to be searched, overriding ``self.collections``. 

797 Can be any of the types supported by the ``collections`` argument 

798 to butler construction. 

799 kwds 

800 Additional keyword arguments used to augment or construct a 

801 `DataCoordinate`. See `DataCoordinate.standardize` 

802 parameters. 

803 

804 Returns 

805 ------- 

806 obj : `object` 

807 The dataset. 

808 

809 Raises 

810 ------ 

811 ValueError 

812 Raised if a resolved `DatasetRef` was passed as an input, but it 

813 differs from the one found in the registry. 

814 LookupError 

815 Raised if no matching dataset exists in the `Registry`. 

816 TypeError 

817 Raised if no collections were provided. 

818 """ 

819 log.debug("Butler get: %s, dataId=%s, parameters=%s", datasetRefOrType, dataId, parameters) 

820 ref = self._findDatasetRef(datasetRefOrType, dataId, collections=collections, **kwds) 

821 return self.getDirect(ref, parameters=parameters) 

822 

823 def getUri(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

824 dataId: Optional[DataId] = None, *, 

825 predict: bool = False, 

826 collections: Any = None, 

827 run: Optional[str] = None, 

828 **kwds: Any) -> str: 

829 """Return the URI to the Dataset. 

830 

831 Parameters 

832 ---------- 

833 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

834 When `DatasetRef` the `dataId` should be `None`. 

835 Otherwise the `DatasetType` or name thereof. 

836 dataId : `dict` or `DataCoordinate` 

837 A `dict` of `Dimension` link name, value pairs that label the 

838 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

839 should be provided as the first argument. 

840 predict : `bool` 

841 If `True`, allow URIs to be returned of datasets that have not 

842 been written. 

843 collections : Any, optional 

844 Collections to be searched, overriding ``self.collections``. 

845 Can be any of the types supported by the ``collections`` argument 

846 to butler construction. 

847 run : `str`, optional 

848 Run to use for predictions, overriding ``self.run``. 

849 kwds 

850 Additional keyword arguments used to augment or construct a 

851 `DataCoordinate`. See `DataCoordinate.standardize` 

852 parameters. 

853 

854 Returns 

855 ------- 

856 uri : `str` 

857 URI string pointing to the Dataset within the datastore. If the 

858 Dataset does not exist in the datastore, and if ``predict`` is 

859 `True`, the URI will be a prediction and will include a URI 

860 fragment "#predicted". 

861 If the datastore does not have entities that relate well 

862 to the concept of a URI the returned URI string will be 

863 descriptive. The returned URI is not guaranteed to be obtainable. 

864 

865 Raises 

866 ------ 

867 LookupError 

868 A URI has been requested for a dataset that does not exist and 

869 guessing is not allowed. 

870 ValueError 

871 Raised if a resolved `DatasetRef` was passed as an input, but it 

872 differs from the one found in the registry. 

873 TypeError 

874 Raised if no collections were provided. 

875 """ 

876 ref = self._findDatasetRef(datasetRefOrType, dataId, allowUnresolved=predict, 

877 collections=collections, **kwds) 

878 if ref.id is None: # only possible if predict is True 

879 if run is None: 

880 run = self.run 

881 if run is None: 

882 raise TypeError("Cannot predict location with run=None.") 

883 # Lie about ID, because we can't guess it, and only 

884 # Datastore.getUri() will ever see it (and it doesn't use it). 

885 ref = ref.resolved(id=0, run=self.run) 

886 return self.datastore.getUri(ref, predict) 

887 

888 def datasetExists(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

889 dataId: Optional[DataId] = None, *, 

890 collections: Any = None, 

891 **kwds: Any) -> bool: 

892 """Return True if the Dataset is actually present in the Datastore. 

893 

894 Parameters 

895 ---------- 

896 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

897 When `DatasetRef` the `dataId` should be `None`. 

898 Otherwise the `DatasetType` or name thereof. 

899 dataId : `dict` or `DataCoordinate` 

900 A `dict` of `Dimension` link name, value pairs that label the 

901 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

902 should be provided as the first argument. 

903 collections : Any, optional 

904 Collections to be searched, overriding ``self.collections``. 

905 Can be any of the types supported by the ``collections`` argument 

906 to butler construction. 

907 kwds 

908 Additional keyword arguments used to augment or construct a 

909 `DataCoordinate`. See `DataCoordinate.standardize` 

910 parameters. 

911 

912 Raises 

913 ------ 

914 LookupError 

915 Raised if the dataset is not even present in the Registry. 

916 ValueError 

917 Raised if a resolved `DatasetRef` was passed as an input, but it 

918 differs from the one found in the registry. 

919 TypeError 

920 Raised if no collections were provided. 

921 """ 

922 ref = self._findDatasetRef(datasetRefOrType, dataId, collections=collections, **kwds) 

923 return self.datastore.exists(ref) 

924 

925 def pruneCollection(self, name: str, purge: bool = False, unstore: bool = False): 

926 """Remove a collection and possibly prune datasets within it. 

927 

928 Parameters 

929 ---------- 

930 name : `str` 

931 Name of the collection to remove. If this is a 

932 `~CollectionType.TAGGED` or `~CollectionType.CHAINED` collection, 

933 datasets within the collection are not modified unless ``unstore`` 

934 is `True`. If this is a `~CollectionType.RUN` collection, 

935 ``purge`` and ``unstore`` must be `True`, and all datasets in it 

936 are fully removed from the data repository. 

937 purge : `bool`, optional 

938 If `True`, permit `~CollectionType.RUN` collections to be removed, 

939 fully removing datasets within them. Requires ``unstore=True`` as 

940 well as an added precaution against accidental deletion. Must be 

941 `False` (default) if the collection is not a ``RUN``. 

942 unstore: `bool`, optional 

943 If `True`, remove all datasets in the collection from all 

944 datastores in which they appear. 

945 

946 Raises 

947 ------ 

948 TypeError 

949 Raised if the butler is read-only or arguments are mutually 

950 inconsistent. 

951 """ 

952 # See pruneDatasets comments for more information about the logic here; 

953 # the cases are almost the same, but here we can rely on Registry to 

954 # take care everything but Datastore deletion when we remove the 

955 # collection. 

956 if not self.isWriteable(): 

957 raise TypeError("Butler is read-only.") 

958 if purge and not unstore: 

959 raise TypeError("Cannot pass purge=True without unstore=True.") 

960 collectionType = self.registry.getCollectionType(name) 

961 if collectionType is CollectionType.RUN and not purge: 

962 raise TypeError(f"Cannot prune RUN collection {name} without purge=True.") 

963 if collectionType is not CollectionType.RUN and purge: 

964 raise TypeError(f"Cannot prune {collectionType.name} collection {name} with purge=True.") 

965 with self.registry.transaction(): 

966 if unstore: 

967 for ref in self.registry.queryDatasets(..., collections=name, deduplicate=True): 

968 if self.datastore.exists(ref): 

969 self.datastore.trash(ref) 

970 self.registry.removeCollection(name) 

971 if unstore: 

972 # Point of no return for removing artifacts 

973 self.datastore.emptyTrash() 

974 

975 def pruneDatasets(self, refs: Iterable[DatasetRef], *, 

976 disassociate: bool = True, 

977 unstore: bool = False, 

978 tags: Optional[Iterable[str]] = None, 

979 purge: bool = False, 

980 run: Optional[str] = None, 

981 recursive: bool = True): 

982 """Remove one or more datasets from a collection and/or storage. 

983 

984 Parameters 

985 ---------- 

986 refs : `~collections.abc.Iterable` of `DatasetRef` 

987 Datasets to prune. These must be "resolved" references (not just 

988 a `DatasetType` and data ID). 

989 disassociate : bool`, optional 

990 Disassociate pruned datasets from ``self.tags`` (or the collections 

991 given via the ``tags`` argument). Ignored if ``refs`` is ``...``. 

992 unstore : `bool`, optional 

993 If `True` (`False` is default) remove these datasets from all 

994 datastores known to this butler. Note that this will make it 

995 impossible to retrieve these datasets even via other collections. 

996 Datasets that are already not stored are ignored by this option. 

997 tags : `Iterable` [ `str` ], optional 

998 `~CollectionType.TAGGED` collections to disassociate the datasets 

999 from, overriding ``self.tags``. Ignored if ``disassociate`` is 

1000 `False` or ``purge`` is `True`. 

1001 purge : `bool`, optional 

1002 If `True` (`False` is default), completely remove the dataset from 

1003 the `Registry`. To prevent accidental deletions, ``purge`` may 

1004 only be `True` if all of the following conditions are met: 

1005 

1006 - All given datasets are in the given run. 

1007 - ``disassociate`` is `True`; 

1008 - ``unstore`` is `True`. 

1009 

1010 This mode may remove provenance information from datasets other 

1011 than those provided, and should be used with extreme care. 

1012 run : `str`, optional 

1013 `~CollectionType.RUN` collection to purge from, overriding 

1014 ``self.run``. Ignored unless ``purge`` is `True`. 

1015 recursive : `bool`, optional 

1016 If `True` (default) also prune component datasets of any given 

1017 composite datasets. This will only prune components that are 

1018 actually attached to the given `DatasetRef` objects, which may 

1019 not reflect what is in the database (especially if they were 

1020 obtained from `Registry.queryDatasets`, which does not include 

1021 components in its results). 

1022 

1023 Raises 

1024 ------ 

1025 TypeError 

1026 Raised if the butler is read-only, if no collection was provided, 

1027 or the conditions for ``purge=True`` were not met. 

1028 """ 

1029 if not self.isWriteable(): 

1030 raise TypeError("Butler is read-only.") 

1031 if purge: 

1032 if not disassociate: 

1033 raise TypeError("Cannot pass purge=True without disassociate=True.") 

1034 if not unstore: 

1035 raise TypeError("Cannot pass purge=True without unstore=True.") 

1036 if run is None: 

1037 run = self.run 

1038 if run is None: 

1039 raise TypeError("No run provided but purge=True.") 

1040 collectionType = self.registry.getCollectionType(run) 

1041 if collectionType is not CollectionType.RUN: 

1042 raise TypeError(f"Cannot purge from collection '{run}' " 

1043 f"of non-RUN type {collectionType.name}.") 

1044 elif disassociate: 

1045 if tags is None: 

1046 tags = self.tags 

1047 else: 

1048 tags = tuple(tags) 

1049 if not tags: 

1050 raise TypeError("No tags provided but disassociate=True.") 

1051 for tag in tags: 

1052 collectionType = self.registry.getCollectionType(tag) 

1053 if collectionType is not CollectionType.TAGGED: 

1054 raise TypeError(f"Cannot disassociate from collection '{tag}' " 

1055 f"of non-TAGGED type {collectionType.name}.") 

1056 # Pruning a component of a DatasetRef makes no sense since registry 

1057 # doesn't always know about components and datastore might not store 

1058 # components in a separate file 

1059 for ref in refs: 

1060 if ref.datasetType.component(): 

1061 raise ValueError(f"Can not prune a component of a dataset (ref={ref})") 

1062 

1063 if recursive: 

1064 refs = list(DatasetRef.flatten(refs)) 

1065 # We don't need an unreliable Datastore transaction for this, because 

1066 # we've been extra careful to ensure that Datastore.trash only involves 

1067 # mutating the Registry (it can _look_ at Datastore-specific things, 

1068 # but shouldn't change them), and hence all operations here are 

1069 # Registry operations. 

1070 with self.registry.transaction(): 

1071 if unstore: 

1072 for ref in refs: 

1073 # There is a difference between a concrete composite 

1074 # and virtual composite. In a virtual composite the 

1075 # datastore is never given the top level DatasetRef. In 

1076 # the concrete composite the datastore knows all the 

1077 # refs and will clean up itself if asked to remove the 

1078 # parent ref. We can not check configuration for this 

1079 # since we can not trust that the configuration is the 

1080 # same. We therefore have to ask if the ref exists or 

1081 # not. This is consistent with the fact that we want 

1082 # to ignore already-removed-from-datastore datasets 

1083 # anyway. 

1084 if self.datastore.exists(ref): 

1085 self.datastore.trash(ref) 

1086 if purge: 

1087 self.registry.removeDatasets(refs, recursive=False) # refs is already recursiveley expanded 

1088 elif disassociate: 

1089 for tag in tags: 

1090 # recursive=False here because refs is already recursive 

1091 # if we want it to be. 

1092 self.registry.disassociate(tag, refs, recursive=False) 

1093 # We've exited the Registry transaction, and apparently committed. 

1094 # (if there was an exception, everything rolled back, and it's as if 

1095 # nothing happened - and we never get here). 

1096 # Datastore artifacts are not yet gone, but they're clearly marked 

1097 # as trash, so if we fail to delete now because of (e.g.) filesystem 

1098 # problems we can try again later, and if manual administrative 

1099 # intervention is required, it's pretty clear what that should entail: 

1100 # deleting everything on disk and in private Datastore tables that is 

1101 # in the dataset_location_trash table. 

1102 if unstore: 

1103 # Point of no return for removing artifacts 

1104 self.datastore.emptyTrash() 

1105 

1106 @transactional 

1107 def ingest(self, *datasets: FileDataset, transfer: Optional[str] = None, run: Optional[str] = None, 

1108 tags: Optional[Iterable[str]] = None,): 

1109 """Store and register one or more datasets that already exist on disk. 

1110 

1111 Parameters 

1112 ---------- 

1113 datasets : `FileDataset` 

1114 Each positional argument is a struct containing information about 

1115 a file to be ingested, including its path (either absolute or 

1116 relative to the datastore root, if applicable), a `DatasetRef`, 

1117 and optionally a formatter class or its fully-qualified string 

1118 name. If a formatter is not provided, the formatter that would be 

1119 used for `put` is assumed. On successful return, all 

1120 `FileDataset.ref` attributes will have their `DatasetRef.id` 

1121 attribute populated and all `FileDataset.formatter` attributes will 

1122 be set to the formatter class used. `FileDataset.path` attributes 

1123 may be modified to put paths in whatever the datastore considers a 

1124 standardized form. 

1125 transfer : `str`, optional 

1126 If not `None`, must be one of 'auto', 'move', 'copy', 'hardlink', 

1127 'relsymlink' or 'symlink', indicating how to transfer the file. 

1128 run : `str`, optional 

1129 The name of the run ingested datasets should be added to, 

1130 overriding ``self.run``. 

1131 tags : `Iterable` [ `str` ], optional 

1132 The names of a `~CollectionType.TAGGED` collections to associate 

1133 the dataset with, overriding ``self.tags``. These collections 

1134 must have already been added to the `Registry`. 

1135 

1136 Raises 

1137 ------ 

1138 TypeError 

1139 Raised if the butler is read-only or if no run was provided. 

1140 NotImplementedError 

1141 Raised if the `Datastore` does not support the given transfer mode. 

1142 DatasetTypeNotSupportedError 

1143 Raised if one or more files to be ingested have a dataset type that 

1144 is not supported by the `Datastore`.. 

1145 FileNotFoundError 

1146 Raised if one of the given files does not exist. 

1147 FileExistsError 

1148 Raised if transfer is not `None` but the (internal) location the 

1149 file would be moved to is already occupied. 

1150 

1151 Notes 

1152 ----- 

1153 This operation is not fully exception safe: if a database operation 

1154 fails, the given `FileDataset` instances may be only partially updated. 

1155 

1156 It is atomic in terms of database operations (they will either all 

1157 succeed or all fail) providing the database engine implements 

1158 transactions correctly. It will attempt to be atomic in terms of 

1159 filesystem operations as well, but this cannot be implemented 

1160 rigorously for most datastores. 

1161 """ 

1162 if not self.isWriteable(): 

1163 raise TypeError("Butler is read-only.") 

1164 if run is None: 

1165 if self.run is None: 

1166 raise TypeError("No run provided.") 

1167 run = self.run 

1168 # No need to check run type, since insertDatasets will do that 

1169 # (safely) for us. 

1170 if tags is None: 

1171 tags = self.tags 

1172 else: 

1173 tags = tuple(tags) 

1174 for tag in tags: 

1175 # Check that these are tagged collections up front, because we want 

1176 # to avoid relying on Datastore transactionality to avoid modifying 

1177 # the repo if there's an error later. 

1178 collectionType = self.registry.getCollectionType(tag) 

1179 if collectionType is not CollectionType.TAGGED: 

1180 raise TypeError(f"Cannot associate into collection '{tag}' of non-TAGGED type " 

1181 f"{collectionType.name}.") 

1182 # Reorganize the inputs so they're grouped by DatasetType and then 

1183 # data ID. We also include a list of DatasetRefs for each FileDataset 

1184 # to hold the resolved DatasetRefs returned by the Registry, before 

1185 # it's safe to swap them into FileDataset.refs. 

1186 # Some type annotation aliases to make that clearer: 

1187 GroupForType = Dict[DataCoordinate, Tuple[FileDataset, List[DatasetRef]]] 

1188 GroupedData = MutableMapping[DatasetType, GroupForType] 

1189 # The actual data structure: 

1190 groupedData: GroupedData = defaultdict(dict) 

1191 # And the nested loop that populates it: 

1192 for dataset in datasets: 

1193 # This list intentionally shared across the inner loop, since it's 

1194 # associated with `dataset`. 

1195 resolvedRefs = [] 

1196 for ref in dataset.refs: 

1197 groupedData[ref.datasetType][ref.dataId] = (dataset, resolvedRefs) 

1198 

1199 # Now we can bulk-insert into Registry for each DatasetType. 

1200 allResolvedRefs = [] 

1201 for datasetType, groupForType in groupedData.items(): 

1202 refs = self.registry.insertDatasets(datasetType, 

1203 dataIds=groupForType.keys(), 

1204 run=run, 

1205 recursive=True) 

1206 # Append those resolved DatasetRefs to the new lists we set up for 

1207 # them. 

1208 for ref, (_, resolvedRefs) in zip(refs, groupForType.values()): 

1209 resolvedRefs.append(ref) 

1210 

1211 # Go back to the original FileDatasets to replace their refs with the 

1212 # new resolved ones, and also build a big list of all refs. 

1213 allResolvedRefs = [] 

1214 for groupForType in groupedData.values(): 

1215 for dataset, resolvedRefs in groupForType.values(): 

1216 dataset.refs = resolvedRefs 

1217 allResolvedRefs.extend(resolvedRefs) 

1218 

1219 # Bulk-associate everything with any tagged collections. 

1220 for tag in tags: 

1221 self.registry.associate(tag, allResolvedRefs) 

1222 

1223 # Bulk-insert everything into Datastore. 

1224 self.datastore.ingest(*datasets, transfer=transfer) 

1225 

1226 @contextlib.contextmanager 

1227 def export(self, *, directory: Optional[str] = None, 

1228 filename: Optional[str] = None, 

1229 format: Optional[str] = None, 

1230 transfer: Optional[str] = None) -> ContextManager[RepoExport]: 

1231 """Export datasets from the repository represented by this `Butler`. 

1232 

1233 This method is a context manager that returns a helper object 

1234 (`RepoExport`) that is used to indicate what information from the 

1235 repository should be exported. 

1236 

1237 Parameters 

1238 ---------- 

1239 directory : `str`, optional 

1240 Directory dataset files should be written to if ``transfer`` is not 

1241 `None`. 

1242 filename : `str`, optional 

1243 Name for the file that will include database information associated 

1244 with the exported datasets. If this is not an absolute path and 

1245 ``directory`` is not `None`, it will be written to ``directory`` 

1246 instead of the current working directory. Defaults to 

1247 "export.{format}". 

1248 format : `str`, optional 

1249 File format for the database information file. If `None`, the 

1250 extension of ``filename`` will be used. 

1251 transfer : `str`, optional 

1252 Transfer mode passed to `Datastore.export`. 

1253 

1254 Raises 

1255 ------ 

1256 TypeError 

1257 Raised if the set of arguments passed is inconsistent. 

1258 

1259 Examples 

1260 -------- 

1261 Typically the `Registry.queryDimensions` and `Registry.queryDatasets` 

1262 methods are used to provide the iterables over data IDs and/or datasets 

1263 to be exported:: 

1264 

1265 with butler.export("exports.yaml") as export: 

1266 # Export all flats, and the calibration_label dimensions 

1267 # associated with them. 

1268 export.saveDatasets(butler.registry.queryDatasets("flat"), 

1269 elements=[butler.registry.dimensions["calibration_label"]]) 

1270 # Export all datasets that start with "deepCoadd_" and all of 

1271 # their associated data ID information. 

1272 export.saveDatasets(butler.registry.queryDatasets("deepCoadd_*")) 

1273 """ 

1274 if directory is None and transfer is not None: 

1275 raise TypeError("Cannot transfer without providing a directory.") 

1276 if transfer == "move": 

1277 raise TypeError("Transfer may not be 'move': export is read-only") 

1278 if format is None: 

1279 if filename is None: 

1280 raise TypeError("At least one of 'filename' or 'format' must be provided.") 

1281 else: 

1282 _, format = os.path.splitext(filename) 

1283 elif filename is None: 

1284 filename = f"export.{format}" 

1285 if directory is not None: 

1286 filename = os.path.join(directory, filename) 

1287 BackendClass = getClassOf(self._config["repo_transfer_formats"][format]["export"]) 

1288 with open(filename, 'w') as stream: 

1289 backend = BackendClass(stream) 

1290 try: 

1291 helper = RepoExport(self.registry, self.datastore, backend=backend, 

1292 directory=directory, transfer=transfer) 

1293 yield helper 

1294 except BaseException: 

1295 raise 

1296 else: 

1297 helper._finish() 

1298 

1299 def import_(self, *, directory: Optional[str] = None, 

1300 filename: Optional[str] = None, 

1301 format: Optional[str] = None, 

1302 transfer: Optional[str] = None): 

1303 """Import datasets exported from a different butler repository. 

1304 

1305 Parameters 

1306 ---------- 

1307 directory : `str`, optional 

1308 Directory containing dataset files. If `None`, all file paths 

1309 must be absolute. 

1310 filename : `str`, optional 

1311 Name for the file that containing database information associated 

1312 with the exported datasets. If this is not an absolute path, does 

1313 not exist in the current working directory, and ``directory`` is 

1314 not `None`, it is assumed to be in ``directory``. Defaults to 

1315 "export.{format}". 

1316 format : `str`, optional 

1317 File format for the database information file. If `None`, the 

1318 extension of ``filename`` will be used. 

1319 transfer : `str`, optional 

1320 Transfer mode passed to `Datastore.export`. 

1321 

1322 Raises 

1323 ------ 

1324 TypeError 

1325 Raised if the set of arguments passed is inconsistent, or if the 

1326 butler is read-only. 

1327 """ 

1328 if not self.isWriteable(): 

1329 raise TypeError("Butler is read-only.") 

1330 if format is None: 

1331 if filename is None: 

1332 raise TypeError("At least one of 'filename' or 'format' must be provided.") 

1333 else: 

1334 _, format = os.path.splitext(filename) 

1335 elif filename is None: 

1336 filename = f"export.{format}" 

1337 if directory is not None and not os.path.exists(filename): 

1338 filename = os.path.join(directory, filename) 

1339 BackendClass = getClassOf(self._config["repo_transfer_formats"][format]["import"]) 

1340 with open(filename, 'r') as stream: 

1341 backend = BackendClass(stream, self.registry) 

1342 backend.register() 

1343 with self.transaction(): 

1344 backend.load(self.datastore, directory=directory, transfer=transfer) 

1345 

1346 def validateConfiguration(self, logFailures: bool = False, 

1347 datasetTypeNames: Optional[Iterable[str]] = None, 

1348 ignore: Iterable[str] = None): 

1349 """Validate butler configuration. 

1350 

1351 Checks that each `DatasetType` can be stored in the `Datastore`. 

1352 

1353 Parameters 

1354 ---------- 

1355 logFailures : `bool`, optional 

1356 If `True`, output a log message for every validation error 

1357 detected. 

1358 datasetTypeNames : iterable of `str`, optional 

1359 The `DatasetType` names that should be checked. This allows 

1360 only a subset to be selected. 

1361 ignore : iterable of `str`, optional 

1362 Names of DatasetTypes to skip over. This can be used to skip 

1363 known problems. If a named `DatasetType` corresponds to a 

1364 composite, all component of that `DatasetType` will also be 

1365 ignored. 

1366 

1367 Raises 

1368 ------ 

1369 ButlerValidationError 

1370 Raised if there is some inconsistency with how this Butler 

1371 is configured. 

1372 """ 

1373 if datasetTypeNames: 

1374 entities = [self.registry.getDatasetType(name) for name in datasetTypeNames] 

1375 else: 

1376 entities = list(self.registry.queryDatasetTypes()) 

1377 

1378 # filter out anything from the ignore list 

1379 if ignore: 

1380 ignore = set(ignore) 

1381 entities = [e for e in entities if e.name not in ignore and e.nameAndComponent()[0] not in ignore] 

1382 else: 

1383 ignore = set() 

1384 

1385 # Find all the registered instruments 

1386 instruments = set( 

1387 dataId["instrument"] for dataId in self.registry.queryDimensions(["instrument"]) 

1388 ) 

1389 

1390 # For each datasetType that has an instrument dimension, create 

1391 # a DatasetRef for each defined instrument 

1392 datasetRefs = [] 

1393 

1394 for datasetType in entities: 

1395 if "instrument" in datasetType.dimensions: 

1396 for instrument in instruments: 

1397 datasetRef = DatasetRef(datasetType, {"instrument": instrument}, conform=False) 

1398 datasetRefs.append(datasetRef) 

1399 

1400 entities.extend(datasetRefs) 

1401 

1402 datastoreErrorStr = None 

1403 try: 

1404 self.datastore.validateConfiguration(entities, logFailures=logFailures) 

1405 except ValidationError as e: 

1406 datastoreErrorStr = str(e) 

1407 

1408 # Also check that the LookupKeys used by the datastores match 

1409 # registry and storage class definitions 

1410 keys = self.datastore.getLookupKeys() 

1411 

1412 failedNames = set() 

1413 failedDataId = set() 

1414 for key in keys: 

1415 datasetType = None 

1416 if key.name is not None: 

1417 if key.name in ignore: 

1418 continue 

1419 

1420 # skip if specific datasetType names were requested and this 

1421 # name does not match 

1422 if datasetTypeNames and key.name not in datasetTypeNames: 

1423 continue 

1424 

1425 # See if it is a StorageClass or a DatasetType 

1426 if key.name in self.storageClasses: 

1427 pass 

1428 else: 

1429 try: 

1430 self.registry.getDatasetType(key.name) 

1431 except KeyError: 

1432 if logFailures: 

1433 log.fatal("Key '%s' does not correspond to a DatasetType or StorageClass", key) 

1434 failedNames.add(key) 

1435 else: 

1436 # Dimensions are checked for consistency when the Butler 

1437 # is created and rendezvoused with a universe. 

1438 pass 

1439 

1440 # Check that the instrument is a valid instrument 

1441 # Currently only support instrument so check for that 

1442 if key.dataId: 

1443 dataIdKeys = set(key.dataId) 

1444 if set(["instrument"]) != dataIdKeys: 

1445 if logFailures: 

1446 log.fatal("Key '%s' has unsupported DataId override", key) 

1447 failedDataId.add(key) 

1448 elif key.dataId["instrument"] not in instruments: 

1449 if logFailures: 

1450 log.fatal("Key '%s' has unknown instrument", key) 

1451 failedDataId.add(key) 

1452 

1453 messages = [] 

1454 

1455 if datastoreErrorStr: 

1456 messages.append(datastoreErrorStr) 

1457 

1458 for failed, msg in ((failedNames, "Keys without corresponding DatasetType or StorageClass entry: "), 

1459 (failedDataId, "Keys with bad DataId entries: ")): 

1460 if failed: 

1461 msg += ", ".join(str(k) for k in failed) 

1462 messages.append(msg) 

1463 

1464 if messages: 

1465 raise ValidationError(";\n".join(messages)) 

1466 

1467 registry: Registry 

1468 """The object that manages dataset metadata and relationships (`Registry`). 

1469 

1470 Most operations that don't involve reading or writing butler datasets are 

1471 accessible only via `Registry` methods. 

1472 """ 

1473 

1474 datastore: Datastore 

1475 """The object that manages actual dataset storage (`Datastore`). 

1476 

1477 Direct user access to the datastore should rarely be necessary; the primary 

1478 exception is the case where a `Datastore` implementation provides extra 

1479 functionality beyond what the base class defines. 

1480 """ 

1481 

1482 storageClasses: StorageClassFactory 

1483 """An object that maps known storage class names to objects that fully 

1484 describe them (`StorageClassFactory`). 

1485 """ 

1486 

1487 collections: Optional[CollectionSearch] 

1488 """The collections to search and any restrictions on the dataset types to 

1489 search for within them, in order (`CollectionSearch`). 

1490 """ 

1491 

1492 run: Optional[str] 

1493 """Name of the run this butler writes outputs to (`str` or `None`). 

1494 """ 

1495 

1496 tags: Tuple[str, ...] 

1497 """Names of `~CollectionType.TAGGED` collections this butler associates 

1498 with in `put` and `ingest`, and disassociates from in `pruneDatasets` 

1499 (`tuple` [ `str` ]). 

1500 """