Hide keyboard shortcuts

Hot-keys on this page

r m x p   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

1# This file is part of daf_butler. 

2# 

3# Developed for the LSST Data Management System. 

4# This product includes software developed by the LSST Project 

5# (http://www.lsst.org). 

6# See the COPYRIGHT file at the top-level directory of this distribution 

7# for details of code ownership. 

8# 

9# This program is free software: you can redistribute it and/or modify 

10# it under the terms of the GNU General Public License as published by 

11# the Free Software Foundation, either version 3 of the License, or 

12# (at your option) any later version. 

13# 

14# This program is distributed in the hope that it will be useful, 

15# but WITHOUT ANY WARRANTY; without even the implied warranty of 

16# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 

17# GNU General Public License for more details. 

18# 

19# You should have received a copy of the GNU General Public License 

20# along with this program. If not, see <http://www.gnu.org/licenses/>. 

21 

22""" 

23Butler top level classes. 

24""" 

25from __future__ import annotations 

26 

27__all__ = ("Butler", "ButlerValidationError") 

28 

29import os 

30from collections import defaultdict 

31import contextlib 

32import logging 

33from typing import ( 

34 Any, 

35 ClassVar, 

36 ContextManager, 

37 Dict, 

38 Iterable, 

39 List, 

40 Mapping, 

41 MutableMapping, 

42 Optional, 

43 Tuple, 

44 Union, 

45) 

46 

47try: 

48 import boto3 

49except ImportError: 

50 boto3 = None 

51 

52from lsst.utils import doImport 

53from .core import ( 

54 ButlerURI, 

55 CompositesMap, 

56 Config, 

57 ConfigSubset, 

58 DataCoordinate, 

59 DataId, 

60 DatasetRef, 

61 DatasetType, 

62 Datastore, 

63 FileDataset, 

64 Quantum, 

65 RepoExport, 

66 StorageClassFactory, 

67 ValidationError, 

68) 

69from .core.repoRelocation import BUTLER_ROOT_TAG 

70from .core.safeFileIo import safeMakeDir 

71from .core.utils import transactional, getClassOf 

72from .core.s3utils import bucketExists 

73from ._deferredDatasetHandle import DeferredDatasetHandle 

74from ._butlerConfig import ButlerConfig 

75from .registry import Registry, RegistryConfig, CollectionType 

76from .registry.wildcards import CollectionSearch 

77 

78log = logging.getLogger(__name__) 

79 

80 

81class ButlerValidationError(ValidationError): 

82 """There is a problem with the Butler configuration.""" 

83 pass 

84 

85 

86class Butler: 

87 """Main entry point for the data access system. 

88 

89 Parameters 

90 ---------- 

91 config : `ButlerConfig`, `Config` or `str`, optional. 

92 Configuration. Anything acceptable to the 

93 `ButlerConfig` constructor. If a directory path 

94 is given the configuration will be read from a ``butler.yaml`` file in 

95 that location. If `None` is given default values will be used. 

96 butler : `Butler`, optional. 

97 If provided, construct a new Butler that uses the same registry and 

98 datastore as the given one, but with the given collection and run. 

99 Incompatible with the ``config``, ``searchPaths``, and ``writeable`` 

100 arguments. 

101 collections : `Any`, optional 

102 An expression specifying the collections to be searched (in order) when 

103 reading datasets, and optionally dataset type restrictions on them. 

104 This may be: 

105 - a `str` collection name; 

106 - a tuple of (collection name, *dataset type restriction*); 

107 - an iterable of either of the above; 

108 - a mapping from `str` to *dataset type restriction*. 

109 

110 See :ref:`daf_butler_collection_expressions` for more information, 

111 including the definition of a *dataset type restriction*. All 

112 collections must either already exist or be specified to be created 

113 by other arguments. 

114 run : `str`, optional 

115 Name of the run datasets should be output to. If the run 

116 does not exist, it will be created. If ``collections`` is `None`, it 

117 will be set to ``[run]``. If this is not set (and ``writeable`` is 

118 not set either), a read-only butler will be created. 

119 tags : `Iterable` [ `str` ], optional 

120 A list of `~CollectionType.TAGGED` collections that datasets should be 

121 associated with in `put` or `ingest` and disassociated from in 

122 `pruneDatasets`. If any of these collections does not exist, it will 

123 be created. 

124 chains : `Mapping` [ `str`, `Iterable` [ `str` ] ], optional 

125 A mapping from the names of new `~CollectionType.CHAINED` collections 

126 to an expression identifying their child collections (which takes the 

127 same form as the ``collections`` argument. Chains may be nested only 

128 if children precede their parents in this mapping. 

129 searchPaths : `list` of `str`, optional 

130 Directory paths to search when calculating the full Butler 

131 configuration. Not used if the supplied config is already a 

132 `ButlerConfig`. 

133 writeable : `bool`, optional 

134 Explicitly sets whether the butler supports write operations. If not 

135 provided, a read-write butler is created if any of ``run``, ``tags``, 

136 or ``chains`` is non-empty. 

137 

138 Examples 

139 -------- 

140 While there are many ways to control exactly how a `Butler` interacts with 

141 the collections in its `Registry`, the most common cases are still simple. 

142 

143 For a read-only `Butler` that searches one collection, do:: 

144 

145 butler = Butler("/path/to/repo", collections=["u/alice/DM-50000"]) 

146 

147 For a read-write `Butler` that writes to and reads from a 

148 `~CollectionType.RUN` collection:: 

149 

150 butler = Butler("/path/to/repo", run="u/alice/DM-50000/a") 

151 

152 The `Butler` passed to a ``PipelineTask`` is often much more complex, 

153 because we want to write to one `~CollectionType.RUN` collection but read 

154 from several others (as well), while defining a new 

155 `~CollectionType.CHAINED` collection that combines them all:: 

156 

157 butler = Butler("/path/to/repo", run="u/alice/DM-50000/a", 

158 collections=["u/alice/DM-50000"], 

159 chains={ 

160 "u/alice/DM-50000": ["u/alice/DM-50000/a", 

161 "u/bob/DM-49998", 

162 "raw/hsc"] 

163 }) 

164 

165 This butler will `put` new datasets to the run ``u/alice/DM-50000/a``, but 

166 they'll also be available from the chained collection ``u/alice/DM-50000``. 

167 Datasets will be read first from that run (since it appears first in the 

168 chain), and then from ``u/bob/DM-49998`` and finally ``raw/hsc``. 

169 If ``u/alice/DM-50000`` had already been defined, the ``chain`` argument 

170 would be unnecessary. We could also construct a butler that performs 

171 exactly the same `put` and `get` operations without actually creating a 

172 chained collection, just by passing multiple items is ``collections``:: 

173 

174 butler = Butler("/path/to/repo", run="u/alice/DM-50000/a", 

175 collections=["u/alice/DM-50000/a", 

176 "u/bob/DM-49998", 

177 "raw/hsc"]) 

178 

179 Finally, one can always create a `Butler` with no collections:: 

180 

181 butler = Butler("/path/to/repo", writeable=True) 

182 

183 This can be extremely useful when you just want to use ``butler.registry``, 

184 e.g. for inserting dimension data or managing collections, or when the 

185 collections you want to use with the butler are not consistent. 

186 Passing ``writeable`` explicitly here is only necessary if you want to be 

187 able to make changes to the repo - usually the value for ``writeable`` is 

188 can be guessed from the collection arguments provided, but it defaults to 

189 `False` when there are not collection arguments. 

190 """ 

191 def __init__(self, config: Union[Config, str, None] = None, *, 

192 butler: Optional[Butler] = None, 

193 collections: Any = None, 

194 run: Optional[str] = None, 

195 tags: Iterable[str] = (), 

196 chains: Optional[Mapping[str, Any]] = None, 

197 searchPaths: Optional[List[str]] = None, 

198 writeable: Optional[bool] = None): 

199 # Transform any single-pass iterator into an actual sequence so we 

200 # can see if its empty 

201 self.tags = tuple(tags) 

202 # Load registry, datastore, etc. from config or existing butler. 

203 if butler is not None: 

204 if config is not None or searchPaths is not None or writeable is not None: 

205 raise TypeError("Cannot pass 'config', 'searchPaths', or 'writeable' " 

206 "arguments with 'butler' argument.") 

207 self.registry = butler.registry 

208 self.datastore = butler.datastore 

209 self.storageClasses = butler.storageClasses 

210 self._composites = butler._composites 

211 self._config = butler._config 

212 else: 

213 self._config = ButlerConfig(config, searchPaths=searchPaths) 

214 if "root" in self._config: 

215 butlerRoot = self._config["root"] 

216 else: 

217 butlerRoot = self._config.configDir 

218 if writeable is None: 

219 writeable = run is not None or chains is not None or self.tags 

220 self.registry = Registry.fromConfig(self._config, butlerRoot=butlerRoot, writeable=writeable) 

221 self.datastore = Datastore.fromConfig(self._config, self.registry, butlerRoot=butlerRoot) 

222 self.storageClasses = StorageClassFactory() 

223 self.storageClasses.addFromConfig(self._config) 

224 self._composites = CompositesMap(self._config, universe=self.registry.dimensions) 

225 # Check the many collection arguments for consistency and create any 

226 # needed collections that don't exist. 

227 if collections is None: 

228 if run is not None: 

229 collections = (run,) 

230 else: 

231 collections = () 

232 self.collections = CollectionSearch.fromExpression(collections) 

233 if chains is None: 

234 chains = {} 

235 self.run = run 

236 if "run" in self._config or "collection" in self._config: 

237 raise ValueError("Passing a run or collection via configuration is no longer supported.") 

238 if self.run is not None: 

239 self.registry.registerCollection(self.run, type=CollectionType.RUN) 

240 for tag in self.tags: 

241 self.registry.registerCollection(tag, type=CollectionType.TAGGED) 

242 for parent, children in chains.items(): 

243 self.registry.registerCollection(parent, type=CollectionType.CHAINED) 

244 self.registry.setCollectionChain(parent, children) 

245 

246 GENERATION: ClassVar[int] = 3 

247 """This is a Generation 3 Butler. 

248 

249 This attribute may be removed in the future, once the Generation 2 Butler 

250 interface has been fully retired; it should only be used in transitional 

251 code. 

252 """ 

253 

254 @staticmethod 

255 def makeRepo(root: str, config: Union[Config, str, None] = None, standalone: bool = False, 

256 createRegistry: bool = True, searchPaths: Optional[List[str]] = None, 

257 forceConfigRoot: bool = True, outfile: Optional[str] = None, 

258 overwrite: bool = False) -> Config: 

259 """Create an empty data repository by adding a butler.yaml config 

260 to a repository root directory. 

261 

262 Parameters 

263 ---------- 

264 root : `str` or `ButlerURI` 

265 Path or URI to the root location of the new repository. Will be 

266 created if it does not exist. 

267 config : `Config` or `str`, optional 

268 Configuration to write to the repository, after setting any 

269 root-dependent Registry or Datastore config options. Can not 

270 be a `ButlerConfig` or a `ConfigSubset`. If `None`, default 

271 configuration will be used. Root-dependent config options 

272 specified in this config are overwritten if ``forceConfigRoot`` 

273 is `True`. 

274 standalone : `bool` 

275 If True, write all expanded defaults, not just customized or 

276 repository-specific settings. 

277 This (mostly) decouples the repository from the default 

278 configuration, insulating it from changes to the defaults (which 

279 may be good or bad, depending on the nature of the changes). 

280 Future *additions* to the defaults will still be picked up when 

281 initializing `Butlers` to repos created with ``standalone=True``. 

282 createRegistry : `bool`, optional 

283 If `True` create a new Registry. 

284 searchPaths : `list` of `str`, optional 

285 Directory paths to search when calculating the full butler 

286 configuration. 

287 forceConfigRoot : `bool`, optional 

288 If `False`, any values present in the supplied ``config`` that 

289 would normally be reset are not overridden and will appear 

290 directly in the output config. This allows non-standard overrides 

291 of the root directory for a datastore or registry to be given. 

292 If this parameter is `True` the values for ``root`` will be 

293 forced into the resulting config if appropriate. 

294 outfile : `str`, optional 

295 If not-`None`, the output configuration will be written to this 

296 location rather than into the repository itself. Can be a URI 

297 string. Can refer to a directory that will be used to write 

298 ``butler.yaml``. 

299 overwrite : `bool`, optional 

300 Create a new configuration file even if one already exists 

301 in the specified output location. Default is to raise 

302 an exception. 

303 

304 Returns 

305 ------- 

306 config : `Config` 

307 The updated `Config` instance written to the repo. 

308 

309 Raises 

310 ------ 

311 ValueError 

312 Raised if a ButlerConfig or ConfigSubset is passed instead of a 

313 regular Config (as these subclasses would make it impossible to 

314 support ``standalone=False``). 

315 FileExistsError 

316 Raised if the output config file already exists. 

317 os.error 

318 Raised if the directory does not exist, exists but is not a 

319 directory, or cannot be created. 

320 

321 Notes 

322 ----- 

323 Note that when ``standalone=False`` (the default), the configuration 

324 search path (see `ConfigSubset.defaultSearchPaths`) that was used to 

325 construct the repository should also be used to construct any Butlers 

326 to avoid configuration inconsistencies. 

327 """ 

328 if isinstance(config, (ButlerConfig, ConfigSubset)): 

329 raise ValueError("makeRepo must be passed a regular Config without defaults applied.") 

330 

331 # for "file" schemes we are assuming POSIX semantics for paths, for 

332 # schemeless URIs we are assuming os.path semantics. 

333 uri = ButlerURI(root, forceDirectory=True) 

334 if uri.scheme == "file" or not uri.scheme: 

335 if not os.path.isdir(uri.ospath): 

336 safeMakeDir(uri.ospath) 

337 elif uri.scheme == "s3": 

338 # bucket must already exist 

339 if not bucketExists(uri.netloc): 

340 raise ValueError(f"Bucket {uri.netloc} does not exist!") 

341 s3 = boto3.client("s3") 

342 # don't create S3 key when root is at the top-level of an Bucket 

343 if not uri.path == "/": 

344 s3.put_object(Bucket=uri.netloc, Key=uri.relativeToPathRoot) 

345 else: 

346 raise ValueError(f"Unrecognized scheme: {uri.scheme}") 

347 config = Config(config) 

348 

349 # If we are creating a new repo from scratch with relative roots, 

350 # do not propagate an explicit root from the config file 

351 if "root" in config: 

352 del config["root"] 

353 

354 full = ButlerConfig(config, searchPaths=searchPaths) # this applies defaults 

355 datastoreClass = doImport(full["datastore", "cls"]) 

356 datastoreClass.setConfigRoot(BUTLER_ROOT_TAG, config, full, overwrite=forceConfigRoot) 

357 

358 # if key exists in given config, parse it, otherwise parse the defaults 

359 # in the expanded config 

360 if config.get(("registry", "db")): 

361 registryConfig = RegistryConfig(config) 

362 else: 

363 registryConfig = RegistryConfig(full) 

364 defaultDatabaseUri = registryConfig.makeDefaultDatabaseUri(BUTLER_ROOT_TAG) 

365 if defaultDatabaseUri is not None: 

366 Config.updateParameters(RegistryConfig, config, full, 

367 toUpdate={"db": defaultDatabaseUri}, 

368 overwrite=forceConfigRoot) 

369 else: 

370 Config.updateParameters(RegistryConfig, config, full, toCopy=("db",), 

371 overwrite=forceConfigRoot) 

372 

373 if standalone: 

374 config.merge(full) 

375 if outfile is not None: 

376 # When writing to a separate location we must include 

377 # the root of the butler repo in the config else it won't know 

378 # where to look. 

379 config["root"] = uri.geturl() 

380 configURI = outfile 

381 else: 

382 configURI = uri 

383 config.dumpToUri(configURI, overwrite=overwrite) 

384 

385 # Create Registry and populate tables 

386 Registry.fromConfig(config, create=createRegistry, butlerRoot=root) 

387 return config 

388 

389 @classmethod 

390 def _unpickle(cls, config: ButlerConfig, collections: Optional[CollectionSearch], run: Optional[str], 

391 tags: Tuple[str, ...], writeable: bool) -> Butler: 

392 """Callable used to unpickle a Butler. 

393 

394 We prefer not to use ``Butler.__init__`` directly so we can force some 

395 of its many arguments to be keyword-only (note that ``__reduce__`` 

396 can only invoke callables with positional arguments). 

397 

398 Parameters 

399 ---------- 

400 config : `ButlerConfig` 

401 Butler configuration, already coerced into a true `ButlerConfig` 

402 instance (and hence after any search paths for overrides have been 

403 utilized). 

404 collections : `CollectionSearch` 

405 Names of collections to read from. 

406 run : `str`, optional 

407 Name of `~CollectionType.RUN` collection to write to. 

408 tags : `tuple` [`str`] 

409 Names of `~CollectionType.TAGGED` collections to associate with. 

410 writeable : `bool` 

411 Whether the Butler should support write operations. 

412 

413 Returns 

414 ------- 

415 butler : `Butler` 

416 A new `Butler` instance. 

417 """ 

418 return cls(config=config, collections=collections, run=run, tags=tags, writeable=writeable) 

419 

420 def __reduce__(self): 

421 """Support pickling. 

422 """ 

423 return (Butler._unpickle, (self._config, self.collections, self.run, self.tags, 

424 self.registry.isWriteable())) 

425 

426 def __str__(self): 

427 return "Butler(collections={}, run={}, tags={}, datastore='{}', registry='{}')".format( 

428 self.collections, self.run, self.tags, self.datastore, self.registry) 

429 

430 def isWriteable(self) -> bool: 

431 """Return `True` if this `Butler` supports write operations. 

432 """ 

433 return self.registry.isWriteable() 

434 

435 @contextlib.contextmanager 

436 def transaction(self): 

437 """Context manager supporting `Butler` transactions. 

438 

439 Transactions can be nested. 

440 """ 

441 with self.registry.transaction(): 

442 with self.datastore.transaction(): 

443 yield 

444 

445 def _standardizeArgs(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

446 dataId: Optional[DataId] = None, **kwds: Any) -> Tuple[DatasetType, DataId]: 

447 """Standardize the arguments passed to several Butler APIs. 

448 

449 Parameters 

450 ---------- 

451 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

452 When `DatasetRef` the `dataId` should be `None`. 

453 Otherwise the `DatasetType` or name thereof. 

454 dataId : `dict` or `DataCoordinate` 

455 A `dict` of `Dimension` link name, value pairs that label the 

456 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

457 should be provided as the second argument. 

458 kwds 

459 Additional keyword arguments used to augment or construct a 

460 `DataCoordinate`. See `DataCoordinate.standardize` 

461 parameters. 

462 

463 Returns 

464 ------- 

465 datasetType : `DatasetType` 

466 A `DatasetType` instance extracted from ``datasetRefOrType``. 

467 dataId : `dict` or `DataId`, optional 

468 Argument that can be used (along with ``kwds``) to construct a 

469 `DataId`. 

470 

471 Notes 

472 ----- 

473 Butler APIs that conceptually need a DatasetRef also allow passing a 

474 `DatasetType` (or the name of one) and a `DataId` (or a dict and 

475 keyword arguments that can be used to construct one) separately. This 

476 method accepts those arguments and always returns a true `DatasetType` 

477 and a `DataId` or `dict`. 

478 

479 Standardization of `dict` vs `DataId` is best handled by passing the 

480 returned ``dataId`` (and ``kwds``) to `Registry` APIs, which are 

481 generally similarly flexible. 

482 """ 

483 externalDatasetType = None 

484 internalDatasetType = None 

485 if isinstance(datasetRefOrType, DatasetRef): 

486 if dataId is not None or kwds: 

487 raise ValueError("DatasetRef given, cannot use dataId as well") 

488 externalDatasetType = datasetRefOrType.datasetType 

489 dataId = datasetRefOrType.dataId 

490 else: 

491 # Don't check whether DataId is provided, because Registry APIs 

492 # can usually construct a better error message when it wasn't. 

493 if isinstance(datasetRefOrType, DatasetType): 

494 externalDatasetType = datasetRefOrType 

495 else: 

496 internalDatasetType = self.registry.getDatasetType(datasetRefOrType) 

497 

498 # Check that they are self-consistent 

499 if externalDatasetType is not None: 

500 internalDatasetType = self.registry.getDatasetType(externalDatasetType.name) 

501 if externalDatasetType != internalDatasetType: 

502 raise ValueError(f"Supplied dataset type ({externalDatasetType}) inconsistent with " 

503 f"registry definition ({internalDatasetType})") 

504 

505 return internalDatasetType, dataId 

506 

507 def _findDatasetRef(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

508 dataId: Optional[DataId] = None, *, 

509 collections: Any = None, 

510 allowUnresolved: bool = False, 

511 **kwds: Any) -> DatasetRef: 

512 """Shared logic for methods that start with a search for a dataset in 

513 the registry. 

514 

515 Parameters 

516 ---------- 

517 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

518 When `DatasetRef` the `dataId` should be `None`. 

519 Otherwise the `DatasetType` or name thereof. 

520 dataId : `dict` or `DataCoordinate`, optional 

521 A `dict` of `Dimension` link name, value pairs that label the 

522 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

523 should be provided as the first argument. 

524 collections : Any, optional 

525 Collections to be searched, overriding ``self.collections``. 

526 Can be any of the types supported by the ``collections`` argument 

527 to butler construction. 

528 allowUnresolved : `bool`, optional 

529 If `True`, return an unresolved `DatasetRef` if finding a resolved 

530 one in the `Registry` fails. Defaults to `False`. 

531 kwds 

532 Additional keyword arguments used to augment or construct a 

533 `DataId`. See `DataId` parameters. 

534 

535 Returns 

536 ------- 

537 ref : `DatasetRef` 

538 A reference to the dataset identified by the given arguments. 

539 

540 Raises 

541 ------ 

542 LookupError 

543 Raised if no matching dataset exists in the `Registry` (and 

544 ``allowUnresolved is False``). 

545 ValueError 

546 Raised if a resolved `DatasetRef` was passed as an input, but it 

547 differs from the one found in the registry. 

548 TypeError 

549 Raised if no collections were provided. 

550 """ 

551 datasetType, dataId = self._standardizeArgs(datasetRefOrType, dataId, **kwds) 

552 if isinstance(datasetRefOrType, DatasetRef): 

553 idNumber = datasetRefOrType.id 

554 else: 

555 idNumber = None 

556 # Expand the data ID first instead of letting registry.findDataset do 

557 # it, so we get the result even if it returns None. 

558 dataId = self.registry.expandDataId(dataId, graph=datasetType.dimensions, **kwds) 

559 if collections is None: 

560 collections = self.collections 

561 if not collections: 

562 raise TypeError("No input collections provided.") 

563 else: 

564 collections = CollectionSearch.fromExpression(collections) 

565 # Always lookup the DatasetRef, even if one is given, to ensure it is 

566 # present in the current collection. 

567 ref = self.registry.findDataset(datasetType, dataId, collections=collections) 

568 if ref is None: 

569 if allowUnresolved: 

570 return DatasetRef(datasetType, dataId) 

571 else: 

572 raise LookupError(f"Dataset {datasetType.name} with data ID {dataId} " 

573 f"could not be found in collections {collections}.") 

574 if idNumber is not None and idNumber != ref.id: 

575 raise ValueError(f"DatasetRef.id provided ({idNumber}) does not match " 

576 f"id ({ref.id}) in registry in collections {collections}.") 

577 return ref 

578 

579 @transactional 

580 def put(self, obj: Any, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

581 dataId: Optional[DataId] = None, *, 

582 producer: Optional[Quantum] = None, 

583 run: Optional[str] = None, 

584 tags: Optional[Iterable[str]] = None, 

585 **kwds: Any) -> DatasetRef: 

586 """Store and register a dataset. 

587 

588 Parameters 

589 ---------- 

590 obj : `object` 

591 The dataset. 

592 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

593 When `DatasetRef` is provided, ``dataId`` should be `None`. 

594 Otherwise the `DatasetType` or name thereof. 

595 dataId : `dict` or `DataCoordinate` 

596 A `dict` of `Dimension` link name, value pairs that label the 

597 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

598 should be provided as the second argument. 

599 producer : `Quantum`, optional 

600 The producer. 

601 run : `str`, optional 

602 The name of the run the dataset should be added to, overriding 

603 ``self.run``. 

604 tags : `Iterable` [ `str` ], optional 

605 The names of a `~CollectionType.TAGGED` collections to associate 

606 the dataset with, overriding ``self.tags``. These collections 

607 must have already been added to the `Registry`. 

608 kwds 

609 Additional keyword arguments used to augment or construct a 

610 `DataCoordinate`. See `DataCoordinate.standardize` 

611 parameters. 

612 

613 Returns 

614 ------- 

615 ref : `DatasetRef` 

616 A reference to the stored dataset, updated with the correct id if 

617 given. 

618 

619 Raises 

620 ------ 

621 TypeError 

622 Raised if the butler is read-only or if no run has been provided. 

623 """ 

624 log.debug("Butler put: %s, dataId=%s, producer=%s, run=%s", datasetRefOrType, dataId, producer, run) 

625 if not self.isWriteable(): 

626 raise TypeError("Butler is read-only.") 

627 datasetType, dataId = self._standardizeArgs(datasetRefOrType, dataId, **kwds) 

628 if isinstance(datasetRefOrType, DatasetRef) and datasetRefOrType.id is not None: 

629 raise ValueError("DatasetRef must not be in registry, must have None id") 

630 

631 if run is None: 

632 if self.run is None: 

633 raise TypeError("No run provided.") 

634 run = self.run 

635 # No need to check type for run; first thing we do is 

636 # insertDatasets, and that will check for us. 

637 

638 if tags is None: 

639 tags = self.tags 

640 else: 

641 tags = tuple(tags) 

642 for tag in tags: 

643 # Check that these are tagged collections up front, because we want 

644 # to avoid relying on Datastore transactionality to avoid modifying 

645 # the repo if there's an error later. 

646 collectionType = self.registry.getCollectionType(tag) 

647 if collectionType is not CollectionType.TAGGED: 

648 raise TypeError(f"Cannot associate into collection '{tag}' of non-TAGGED type " 

649 f"{collectionType.name}.") 

650 

651 # Disable all disassembly at the registry level for now 

652 isVirtualComposite = False 

653 

654 # Add Registry Dataset entry. If not a virtual composite, add 

655 # and attach components at the same time. 

656 dataId = self.registry.expandDataId(dataId, graph=datasetType.dimensions, **kwds) 

657 ref, = self.registry.insertDatasets(datasetType, run=run, dataIds=[dataId], 

658 producer=producer, 

659 # Never write components into 

660 # registry 

661 recursive=False) 

662 

663 # Check to see if this datasetType requires disassembly 

664 if isVirtualComposite: 

665 components = datasetType.storageClass.assembler().disassemble(obj) 

666 componentRefs = {} 

667 for component, info in components.items(): 

668 compTypeName = datasetType.componentTypeName(component) 

669 compRef = self.put(info.component, compTypeName, dataId, producer=producer, run=run, 

670 collection=False) # We don't need to recursively associate. 

671 componentRefs[component] = compRef 

672 ref = self.registry.attachComponents(ref, componentRefs) 

673 else: 

674 # This is an entity without a disassembler. 

675 self.datastore.put(obj, ref) 

676 

677 for tag in tags: 

678 self.registry.associate(tag, [ref]) # this is already recursive by default 

679 

680 return ref 

681 

682 def getDirect(self, ref: DatasetRef, *, parameters: Optional[Dict[str, Any]] = None): 

683 """Retrieve a stored dataset. 

684 

685 Unlike `Butler.get`, this method allows datasets outside the Butler's 

686 collection to be read as long as the `DatasetRef` that identifies them 

687 can be obtained separately. 

688 

689 Parameters 

690 ---------- 

691 ref : `DatasetRef` 

692 Reference to an already stored dataset. 

693 parameters : `dict` 

694 Additional StorageClass-defined options to control reading, 

695 typically used to efficiently read only a subset of the dataset. 

696 

697 Returns 

698 ------- 

699 obj : `object` 

700 The dataset. 

701 """ 

702 # if the ref exists in the store we return it directly 

703 if self.datastore.exists(ref): 

704 return self.datastore.get(ref, parameters=parameters) 

705 elif ref.isComposite() and ref.components: 

706 # The presence of components indicates that this dataset 

707 # was disassembled at the registry level. 

708 # Check that we haven't got any unknown parameters 

709 ref.datasetType.storageClass.validateParameters(parameters) 

710 # Reconstruct the composite 

711 usedParams = set() 

712 components = {} 

713 for compName, compRef in ref.components.items(): 

714 # make a dictionary of parameters containing only the subset 

715 # supported by the StorageClass of the components 

716 compParams = compRef.datasetType.storageClass.filterParameters(parameters) 

717 usedParams.update(set(compParams)) 

718 components[compName] = self.datastore.get(compRef, parameters=compParams) 

719 

720 # Any unused parameters will have to be passed to the assembler 

721 if parameters: 

722 unusedParams = {k: v for k, v in parameters.items() if k not in usedParams} 

723 else: 

724 unusedParams = {} 

725 

726 # Assemble the components 

727 inMemoryDataset = ref.datasetType.storageClass.assembler().assemble(components) 

728 return ref.datasetType.storageClass.assembler().handleParameters(inMemoryDataset, 

729 parameters=unusedParams) 

730 else: 

731 # single entity in datastore 

732 raise FileNotFoundError(f"Unable to locate dataset '{ref}' in datastore {self.datastore.name}") 

733 

734 def getDeferred(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

735 dataId: Optional[DataId] = None, *, 

736 parameters: Union[dict, None] = None, 

737 collections: Any = None, 

738 **kwds: Any) -> DeferredDatasetHandle: 

739 """Create a `DeferredDatasetHandle` which can later retrieve a dataset 

740 

741 Parameters 

742 ---------- 

743 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

744 When `DatasetRef` the `dataId` should be `None`. 

745 Otherwise the `DatasetType` or name thereof. 

746 dataId : `dict` or `DataCoordinate`, optional 

747 A `dict` of `Dimension` link name, value pairs that label the 

748 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

749 should be provided as the first argument. 

750 parameters : `dict` 

751 Additional StorageClass-defined options to control reading, 

752 typically used to efficiently read only a subset of the dataset. 

753 collections : Any, optional 

754 Collections to be searched, overriding ``self.collections``. 

755 Can be any of the types supported by the ``collections`` argument 

756 to butler construction. 

757 kwds 

758 Additional keyword arguments used to augment or construct a 

759 `DataId`. See `DataId` parameters. 

760 

761 Returns 

762 ------- 

763 obj : `DeferredDatasetHandle` 

764 A handle which can be used to retrieve a dataset at a later time. 

765 

766 Raises 

767 ------ 

768 LookupError 

769 Raised if no matching dataset exists in the `Registry` (and 

770 ``allowUnresolved is False``). 

771 ValueError 

772 Raised if a resolved `DatasetRef` was passed as an input, but it 

773 differs from the one found in the registry. 

774 TypeError 

775 Raised if no collections were provided. 

776 """ 

777 ref = self._findDatasetRef(datasetRefOrType, dataId, collections=collections, **kwds) 

778 return DeferredDatasetHandle(butler=self, ref=ref, parameters=parameters) 

779 

780 def get(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

781 dataId: Optional[DataId] = None, *, 

782 parameters: Optional[Dict[str, Any]] = None, 

783 collections: Any = None, 

784 **kwds: Any) -> Any: 

785 """Retrieve a stored dataset. 

786 

787 Parameters 

788 ---------- 

789 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

790 When `DatasetRef` the `dataId` should be `None`. 

791 Otherwise the `DatasetType` or name thereof. 

792 dataId : `dict` or `DataCoordinate` 

793 A `dict` of `Dimension` link name, value pairs that label the 

794 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

795 should be provided as the first argument. 

796 parameters : `dict` 

797 Additional StorageClass-defined options to control reading, 

798 typically used to efficiently read only a subset of the dataset. 

799 collections : Any, optional 

800 Collections to be searched, overriding ``self.collections``. 

801 Can be any of the types supported by the ``collections`` argument 

802 to butler construction. 

803 kwds 

804 Additional keyword arguments used to augment or construct a 

805 `DataCoordinate`. See `DataCoordinate.standardize` 

806 parameters. 

807 

808 Returns 

809 ------- 

810 obj : `object` 

811 The dataset. 

812 

813 Raises 

814 ------ 

815 ValueError 

816 Raised if a resolved `DatasetRef` was passed as an input, but it 

817 differs from the one found in the registry. 

818 LookupError 

819 Raised if no matching dataset exists in the `Registry`. 

820 TypeError 

821 Raised if no collections were provided. 

822 """ 

823 log.debug("Butler get: %s, dataId=%s, parameters=%s", datasetRefOrType, dataId, parameters) 

824 ref = self._findDatasetRef(datasetRefOrType, dataId, collections=collections, **kwds) 

825 return self.getDirect(ref, parameters=parameters) 

826 

827 def getUri(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

828 dataId: Optional[DataId] = None, *, 

829 predict: bool = False, 

830 collections: Any = None, 

831 run: Optional[str] = None, 

832 **kwds: Any) -> str: 

833 """Return the URI to the Dataset. 

834 

835 Parameters 

836 ---------- 

837 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

838 When `DatasetRef` the `dataId` should be `None`. 

839 Otherwise the `DatasetType` or name thereof. 

840 dataId : `dict` or `DataCoordinate` 

841 A `dict` of `Dimension` link name, value pairs that label the 

842 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

843 should be provided as the first argument. 

844 predict : `bool` 

845 If `True`, allow URIs to be returned of datasets that have not 

846 been written. 

847 collections : Any, optional 

848 Collections to be searched, overriding ``self.collections``. 

849 Can be any of the types supported by the ``collections`` argument 

850 to butler construction. 

851 run : `str`, optional 

852 Run to use for predictions, overriding ``self.run``. 

853 kwds 

854 Additional keyword arguments used to augment or construct a 

855 `DataCoordinate`. See `DataCoordinate.standardize` 

856 parameters. 

857 

858 Returns 

859 ------- 

860 uri : `str` 

861 URI string pointing to the Dataset within the datastore. If the 

862 Dataset does not exist in the datastore, and if ``predict`` is 

863 `True`, the URI will be a prediction and will include a URI 

864 fragment "#predicted". 

865 If the datastore does not have entities that relate well 

866 to the concept of a URI the returned URI string will be 

867 descriptive. The returned URI is not guaranteed to be obtainable. 

868 

869 Raises 

870 ------ 

871 LookupError 

872 A URI has been requested for a dataset that does not exist and 

873 guessing is not allowed. 

874 ValueError 

875 Raised if a resolved `DatasetRef` was passed as an input, but it 

876 differs from the one found in the registry. 

877 TypeError 

878 Raised if no collections were provided. 

879 """ 

880 ref = self._findDatasetRef(datasetRefOrType, dataId, allowUnresolved=predict, 

881 collections=collections, **kwds) 

882 if ref.id is None: # only possible if predict is True 

883 if run is None: 

884 run = self.run 

885 if run is None: 

886 raise TypeError("Cannot predict location with run=None.") 

887 # Lie about ID, because we can't guess it, and only 

888 # Datastore.getUri() will ever see it (and it doesn't use it). 

889 ref = ref.resolved(id=0, run=self.run) 

890 return self.datastore.getUri(ref, predict) 

891 

892 def datasetExists(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

893 dataId: Optional[DataId] = None, *, 

894 collections: Any = None, 

895 **kwds: Any) -> bool: 

896 """Return True if the Dataset is actually present in the Datastore. 

897 

898 Parameters 

899 ---------- 

900 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

901 When `DatasetRef` the `dataId` should be `None`. 

902 Otherwise the `DatasetType` or name thereof. 

903 dataId : `dict` or `DataCoordinate` 

904 A `dict` of `Dimension` link name, value pairs that label the 

905 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

906 should be provided as the first argument. 

907 collections : Any, optional 

908 Collections to be searched, overriding ``self.collections``. 

909 Can be any of the types supported by the ``collections`` argument 

910 to butler construction. 

911 kwds 

912 Additional keyword arguments used to augment or construct a 

913 `DataCoordinate`. See `DataCoordinate.standardize` 

914 parameters. 

915 

916 Raises 

917 ------ 

918 LookupError 

919 Raised if the dataset is not even present in the Registry. 

920 ValueError 

921 Raised if a resolved `DatasetRef` was passed as an input, but it 

922 differs from the one found in the registry. 

923 TypeError 

924 Raised if no collections were provided. 

925 """ 

926 ref = self._findDatasetRef(datasetRefOrType, dataId, collections=collections, **kwds) 

927 return self.datastore.exists(ref) 

928 

929 def pruneCollection(self, name: str, purge: bool = False, unstore: bool = False): 

930 """Remove a collection and possibly prune datasets within it. 

931 

932 Parameters 

933 ---------- 

934 name : `str` 

935 Name of the collection to remove. If this is a 

936 `~CollectionType.TAGGED` or `~CollectionType.CHAINED` collection, 

937 datasets within the collection are not modified unless ``unstore`` 

938 is `True`. If this is a `~CollectionType.RUN` collection, 

939 ``purge`` and ``unstore`` must be `True`, and all datasets in it 

940 are fully removed from the data repository. 

941 purge : `bool`, optional 

942 If `True`, permit `~CollectionType.RUN` collections to be removed, 

943 fully removing datasets within them. Requires ``unstore=True`` as 

944 well as an added precaution against accidental deletion. Must be 

945 `False` (default) if the collection is not a ``RUN``. 

946 unstore: `bool`, optional 

947 If `True`, remove all datasets in the collection from all 

948 datastores in which they appear. 

949 

950 Raises 

951 ------ 

952 TypeError 

953 Raised if the butler is read-only or arguments are mutually 

954 inconsistent. 

955 """ 

956 # See pruneDatasets comments for more information about the logic here; 

957 # the cases are almost the same, but here we can rely on Registry to 

958 # take care everything but Datastore deletion when we remove the 

959 # collection. 

960 if not self.isWriteable(): 

961 raise TypeError("Butler is read-only.") 

962 if purge and not unstore: 

963 raise TypeError("Cannot pass purge=True without unstore=True.") 

964 collectionType = self.registry.getCollectionType(name) 

965 if collectionType is CollectionType.RUN and not purge: 

966 raise TypeError(f"Cannot prune RUN collection {name} without purge=True.") 

967 if collectionType is not CollectionType.RUN and purge: 

968 raise TypeError(f"Cannot prune {collectionType.name} collection {name} with purge=True.") 

969 with self.registry.transaction(): 

970 if unstore: 

971 for ref in self.registry.queryDatasets(..., collections=name, deduplicate=True): 

972 if self.datastore.exists(ref): 

973 self.datastore.trash(ref) 

974 self.registry.removeCollection(name) 

975 if unstore: 

976 # Point of no return for removing artifacts 

977 self.datastore.emptyTrash() 

978 

979 def pruneDatasets(self, refs: Iterable[DatasetRef], *, 

980 disassociate: bool = True, 

981 unstore: bool = False, 

982 tags: Optional[Iterable[str]] = None, 

983 purge: bool = False, 

984 run: Optional[str] = None, 

985 recursive: bool = True): 

986 """Remove one or more datasets from a collection and/or storage. 

987 

988 Parameters 

989 ---------- 

990 refs : `~collections.abc.Iterable` of `DatasetRef` 

991 Datasets to prune. These must be "resolved" references (not just 

992 a `DatasetType` and data ID). 

993 disassociate : bool`, optional 

994 Disassociate pruned datasets from ``self.tags`` (or the collections 

995 given via the ``tags`` argument). Ignored if ``refs`` is ``...``. 

996 unstore : `bool`, optional 

997 If `True` (`False` is default) remove these datasets from all 

998 datastores known to this butler. Note that this will make it 

999 impossible to retrieve these datasets even via other collections. 

1000 Datasets that are already not stored are ignored by this option. 

1001 tags : `Iterable` [ `str` ], optional 

1002 `~CollectionType.TAGGED` collections to disassociate the datasets 

1003 from, overriding ``self.tags``. Ignored if ``disassociate`` is 

1004 `False` or ``purge`` is `True`. 

1005 purge : `bool`, optional 

1006 If `True` (`False` is default), completely remove the dataset from 

1007 the `Registry`. To prevent accidental deletions, ``purge`` may 

1008 only be `True` if all of the following conditions are met: 

1009 

1010 - All given datasets are in the given run. 

1011 - ``disassociate`` is `True`; 

1012 - ``unstore`` is `True`. 

1013 

1014 This mode may remove provenance information from datasets other 

1015 than those provided, and should be used with extreme care. 

1016 run : `str`, optional 

1017 `~CollectionType.RUN` collection to purge from, overriding 

1018 ``self.run``. Ignored unless ``purge`` is `True`. 

1019 recursive : `bool`, optional 

1020 If `True` (default) also prune component datasets of any given 

1021 composite datasets. This will only prune components that are 

1022 actually attached to the given `DatasetRef` objects, which may 

1023 not reflect what is in the database (especially if they were 

1024 obtained from `Registry.queryDatasets`, which does not include 

1025 components in its results). 

1026 

1027 Raises 

1028 ------ 

1029 TypeError 

1030 Raised if the butler is read-only, if no collection was provided, 

1031 or the conditions for ``purge=True`` were not met. 

1032 """ 

1033 if not self.isWriteable(): 

1034 raise TypeError("Butler is read-only.") 

1035 if purge: 

1036 if not disassociate: 

1037 raise TypeError("Cannot pass purge=True without disassociate=True.") 

1038 if not unstore: 

1039 raise TypeError("Cannot pass purge=True without unstore=True.") 

1040 if run is None: 

1041 run = self.run 

1042 if run is None: 

1043 raise TypeError("No run provided but purge=True.") 

1044 collectionType = self.registry.getCollectionType(run) 

1045 if collectionType is not CollectionType.RUN: 

1046 raise TypeError(f"Cannot purge from collection '{run}' " 

1047 f"of non-RUN type {collectionType.name}.") 

1048 elif disassociate: 

1049 if tags is None: 

1050 tags = self.tags 

1051 else: 

1052 tags = tuple(tags) 

1053 if not tags: 

1054 raise TypeError("No tags provided but disassociate=True.") 

1055 for tag in tags: 

1056 collectionType = self.registry.getCollectionType(tag) 

1057 if collectionType is not CollectionType.TAGGED: 

1058 raise TypeError(f"Cannot disassociate from collection '{tag}' " 

1059 f"of non-TAGGED type {collectionType.name}.") 

1060 # Pruning a component of a DatasetRef makes no sense since registry 

1061 # doesn't always know about components and datastore might not store 

1062 # components in a separate file 

1063 for ref in refs: 

1064 if ref.datasetType.component(): 

1065 raise ValueError(f"Can not prune a component of a dataset (ref={ref})") 

1066 

1067 if recursive: 

1068 refs = list(DatasetRef.flatten(refs)) 

1069 # We don't need an unreliable Datastore transaction for this, because 

1070 # we've been extra careful to ensure that Datastore.trash only involves 

1071 # mutating the Registry (it can _look_ at Datastore-specific things, 

1072 # but shouldn't change them), and hence all operations here are 

1073 # Registry operations. 

1074 with self.registry.transaction(): 

1075 if unstore: 

1076 for ref in refs: 

1077 # There is a difference between a concrete composite 

1078 # and virtual composite. In a virtual composite the 

1079 # datastore is never given the top level DatasetRef. In 

1080 # the concrete composite the datastore knows all the 

1081 # refs and will clean up itself if asked to remove the 

1082 # parent ref. We can not check configuration for this 

1083 # since we can not trust that the configuration is the 

1084 # same. We therefore have to ask if the ref exists or 

1085 # not. This is consistent with the fact that we want 

1086 # to ignore already-removed-from-datastore datasets 

1087 # anyway. 

1088 if self.datastore.exists(ref): 

1089 self.datastore.trash(ref) 

1090 if purge: 

1091 self.registry.removeDatasets(refs, recursive=False) # refs is already recursiveley expanded 

1092 elif disassociate: 

1093 for tag in tags: 

1094 # recursive=False here because refs is already recursive 

1095 # if we want it to be. 

1096 self.registry.disassociate(tag, refs, recursive=False) 

1097 # We've exited the Registry transaction, and apparently committed. 

1098 # (if there was an exception, everything rolled back, and it's as if 

1099 # nothing happened - and we never get here). 

1100 # Datastore artifacts are not yet gone, but they're clearly marked 

1101 # as trash, so if we fail to delete now because of (e.g.) filesystem 

1102 # problems we can try again later, and if manual administrative 

1103 # intervention is required, it's pretty clear what that should entail: 

1104 # deleting everything on disk and in private Datastore tables that is 

1105 # in the dataset_location_trash table. 

1106 if unstore: 

1107 # Point of no return for removing artifacts 

1108 self.datastore.emptyTrash() 

1109 

1110 @transactional 

1111 def ingest(self, *datasets: FileDataset, transfer: Optional[str] = None, run: Optional[str] = None, 

1112 tags: Optional[Iterable[str]] = None,): 

1113 """Store and register one or more datasets that already exist on disk. 

1114 

1115 Parameters 

1116 ---------- 

1117 datasets : `FileDataset` 

1118 Each positional argument is a struct containing information about 

1119 a file to be ingested, including its path (either absolute or 

1120 relative to the datastore root, if applicable), a `DatasetRef`, 

1121 and optionally a formatter class or its fully-qualified string 

1122 name. If a formatter is not provided, the formatter that would be 

1123 used for `put` is assumed. On successful return, all 

1124 `FileDataset.ref` attributes will have their `DatasetRef.id` 

1125 attribute populated and all `FileDataset.formatter` attributes will 

1126 be set to the formatter class used. `FileDataset.path` attributes 

1127 may be modified to put paths in whatever the datastore considers a 

1128 standardized form. 

1129 transfer : `str`, optional 

1130 If not `None`, must be one of 'auto', 'move', 'copy', 'hardlink', 

1131 'relsymlink' or 'symlink', indicating how to transfer the file. 

1132 run : `str`, optional 

1133 The name of the run ingested datasets should be added to, 

1134 overriding ``self.run``. 

1135 tags : `Iterable` [ `str` ], optional 

1136 The names of a `~CollectionType.TAGGED` collections to associate 

1137 the dataset with, overriding ``self.tags``. These collections 

1138 must have already been added to the `Registry`. 

1139 

1140 Raises 

1141 ------ 

1142 TypeError 

1143 Raised if the butler is read-only or if no run was provided. 

1144 NotImplementedError 

1145 Raised if the `Datastore` does not support the given transfer mode. 

1146 DatasetTypeNotSupportedError 

1147 Raised if one or more files to be ingested have a dataset type that 

1148 is not supported by the `Datastore`.. 

1149 FileNotFoundError 

1150 Raised if one of the given files does not exist. 

1151 FileExistsError 

1152 Raised if transfer is not `None` but the (internal) location the 

1153 file would be moved to is already occupied. 

1154 

1155 Notes 

1156 ----- 

1157 This operation is not fully exception safe: if a database operation 

1158 fails, the given `FileDataset` instances may be only partially updated. 

1159 

1160 It is atomic in terms of database operations (they will either all 

1161 succeed or all fail) providing the database engine implements 

1162 transactions correctly. It will attempt to be atomic in terms of 

1163 filesystem operations as well, but this cannot be implemented 

1164 rigorously for most datastores. 

1165 """ 

1166 if not self.isWriteable(): 

1167 raise TypeError("Butler is read-only.") 

1168 if run is None: 

1169 if self.run is None: 

1170 raise TypeError("No run provided.") 

1171 run = self.run 

1172 # No need to check run type, since insertDatasets will do that 

1173 # (safely) for us. 

1174 if tags is None: 

1175 tags = self.tags 

1176 else: 

1177 tags = tuple(tags) 

1178 for tag in tags: 

1179 # Check that these are tagged collections up front, because we want 

1180 # to avoid relying on Datastore transactionality to avoid modifying 

1181 # the repo if there's an error later. 

1182 collectionType = self.registry.getCollectionType(tag) 

1183 if collectionType is not CollectionType.TAGGED: 

1184 raise TypeError(f"Cannot associate into collection '{tag}' of non-TAGGED type " 

1185 f"{collectionType.name}.") 

1186 # Reorganize the inputs so they're grouped by DatasetType and then 

1187 # data ID. We also include a list of DatasetRefs for each FileDataset 

1188 # to hold the resolved DatasetRefs returned by the Registry, before 

1189 # it's safe to swap them into FileDataset.refs. 

1190 # Some type annotation aliases to make that clearer: 

1191 GroupForType = Dict[DataCoordinate, Tuple[FileDataset, List[DatasetRef]]] 

1192 GroupedData = MutableMapping[DatasetType, GroupForType] 

1193 # The actual data structure: 

1194 groupedData: GroupedData = defaultdict(dict) 

1195 # And the nested loop that populates it: 

1196 for dataset in datasets: 

1197 # This list intentionally shared across the inner loop, since it's 

1198 # associated with `dataset`. 

1199 resolvedRefs = [] 

1200 for ref in dataset.refs: 

1201 groupedData[ref.datasetType][ref.dataId] = (dataset, resolvedRefs) 

1202 

1203 # Now we can bulk-insert into Registry for each DatasetType. 

1204 allResolvedRefs = [] 

1205 for datasetType, groupForType in groupedData.items(): 

1206 refs = self.registry.insertDatasets(datasetType, 

1207 dataIds=groupForType.keys(), 

1208 run=run, 

1209 recursive=True) 

1210 # Append those resolved DatasetRefs to the new lists we set up for 

1211 # them. 

1212 for ref, (_, resolvedRefs) in zip(refs, groupForType.values()): 

1213 resolvedRefs.append(ref) 

1214 

1215 # Go back to the original FileDatasets to replace their refs with the 

1216 # new resolved ones, and also build a big list of all refs. 

1217 allResolvedRefs = [] 

1218 for groupForType in groupedData.values(): 

1219 for dataset, resolvedRefs in groupForType.values(): 

1220 dataset.refs = resolvedRefs 

1221 allResolvedRefs.extend(resolvedRefs) 

1222 

1223 # Bulk-associate everything with any tagged collections. 

1224 for tag in tags: 

1225 self.registry.associate(tag, allResolvedRefs) 

1226 

1227 # Bulk-insert everything into Datastore. 

1228 self.datastore.ingest(*datasets, transfer=transfer) 

1229 

1230 @contextlib.contextmanager 

1231 def export(self, *, directory: Optional[str] = None, 

1232 filename: Optional[str] = None, 

1233 format: Optional[str] = None, 

1234 transfer: Optional[str] = None) -> ContextManager[RepoExport]: 

1235 """Export datasets from the repository represented by this `Butler`. 

1236 

1237 This method is a context manager that returns a helper object 

1238 (`RepoExport`) that is used to indicate what information from the 

1239 repository should be exported. 

1240 

1241 Parameters 

1242 ---------- 

1243 directory : `str`, optional 

1244 Directory dataset files should be written to if ``transfer`` is not 

1245 `None`. 

1246 filename : `str`, optional 

1247 Name for the file that will include database information associated 

1248 with the exported datasets. If this is not an absolute path and 

1249 ``directory`` is not `None`, it will be written to ``directory`` 

1250 instead of the current working directory. Defaults to 

1251 "export.{format}". 

1252 format : `str`, optional 

1253 File format for the database information file. If `None`, the 

1254 extension of ``filename`` will be used. 

1255 transfer : `str`, optional 

1256 Transfer mode passed to `Datastore.export`. 

1257 

1258 Raises 

1259 ------ 

1260 TypeError 

1261 Raised if the set of arguments passed is inconsistent. 

1262 

1263 Examples 

1264 -------- 

1265 Typically the `Registry.queryDimensions` and `Registry.queryDatasets` 

1266 methods are used to provide the iterables over data IDs and/or datasets 

1267 to be exported:: 

1268 

1269 with butler.export("exports.yaml") as export: 

1270 # Export all flats, and the calibration_label dimensions 

1271 # associated with them. 

1272 export.saveDatasets(butler.registry.queryDatasets("flat"), 

1273 elements=[butler.registry.dimensions["calibration_label"]]) 

1274 # Export all datasets that start with "deepCoadd_" and all of 

1275 # their associated data ID information. 

1276 export.saveDatasets(butler.registry.queryDatasets("deepCoadd_*")) 

1277 """ 

1278 if directory is None and transfer is not None: 

1279 raise TypeError("Cannot transfer without providing a directory.") 

1280 if transfer == "move": 

1281 raise TypeError("Transfer may not be 'move': export is read-only") 

1282 if format is None: 

1283 if filename is None: 

1284 raise TypeError("At least one of 'filename' or 'format' must be provided.") 

1285 else: 

1286 _, format = os.path.splitext(filename) 

1287 elif filename is None: 

1288 filename = f"export.{format}" 

1289 if directory is not None: 

1290 filename = os.path.join(directory, filename) 

1291 BackendClass = getClassOf(self._config["repo_transfer_formats"][format]["export"]) 

1292 with open(filename, 'w') as stream: 

1293 backend = BackendClass(stream) 

1294 try: 

1295 helper = RepoExport(self.registry, self.datastore, backend=backend, 

1296 directory=directory, transfer=transfer) 

1297 yield helper 

1298 except BaseException: 

1299 raise 

1300 else: 

1301 helper._finish() 

1302 

1303 def import_(self, *, directory: Optional[str] = None, 

1304 filename: Optional[str] = None, 

1305 format: Optional[str] = None, 

1306 transfer: Optional[str] = None): 

1307 """Import datasets exported from a different butler repository. 

1308 

1309 Parameters 

1310 ---------- 

1311 directory : `str`, optional 

1312 Directory containing dataset files. If `None`, all file paths 

1313 must be absolute. 

1314 filename : `str`, optional 

1315 Name for the file that containing database information associated 

1316 with the exported datasets. If this is not an absolute path, does 

1317 not exist in the current working directory, and ``directory`` is 

1318 not `None`, it is assumed to be in ``directory``. Defaults to 

1319 "export.{format}". 

1320 format : `str`, optional 

1321 File format for the database information file. If `None`, the 

1322 extension of ``filename`` will be used. 

1323 transfer : `str`, optional 

1324 Transfer mode passed to `Datastore.export`. 

1325 

1326 Raises 

1327 ------ 

1328 TypeError 

1329 Raised if the set of arguments passed is inconsistent, or if the 

1330 butler is read-only. 

1331 """ 

1332 if not self.isWriteable(): 

1333 raise TypeError("Butler is read-only.") 

1334 if format is None: 

1335 if filename is None: 

1336 raise TypeError("At least one of 'filename' or 'format' must be provided.") 

1337 else: 

1338 _, format = os.path.splitext(filename) 

1339 elif filename is None: 

1340 filename = f"export.{format}" 

1341 if directory is not None and not os.path.exists(filename): 

1342 filename = os.path.join(directory, filename) 

1343 BackendClass = getClassOf(self._config["repo_transfer_formats"][format]["import"]) 

1344 with open(filename, 'r') as stream: 

1345 backend = BackendClass(stream, self.registry) 

1346 backend.register() 

1347 with self.transaction(): 

1348 backend.load(self.datastore, directory=directory, transfer=transfer) 

1349 

1350 def validateConfiguration(self, logFailures: bool = False, 

1351 datasetTypeNames: Optional[Iterable[str]] = None, 

1352 ignore: Iterable[str] = None): 

1353 """Validate butler configuration. 

1354 

1355 Checks that each `DatasetType` can be stored in the `Datastore`. 

1356 

1357 Parameters 

1358 ---------- 

1359 logFailures : `bool`, optional 

1360 If `True`, output a log message for every validation error 

1361 detected. 

1362 datasetTypeNames : iterable of `str`, optional 

1363 The `DatasetType` names that should be checked. This allows 

1364 only a subset to be selected. 

1365 ignore : iterable of `str`, optional 

1366 Names of DatasetTypes to skip over. This can be used to skip 

1367 known problems. If a named `DatasetType` corresponds to a 

1368 composite, all component of that `DatasetType` will also be 

1369 ignored. 

1370 

1371 Raises 

1372 ------ 

1373 ButlerValidationError 

1374 Raised if there is some inconsistency with how this Butler 

1375 is configured. 

1376 """ 

1377 if datasetTypeNames: 

1378 entities = [self.registry.getDatasetType(name) for name in datasetTypeNames] 

1379 else: 

1380 entities = list(self.registry.queryDatasetTypes()) 

1381 

1382 # filter out anything from the ignore list 

1383 if ignore: 

1384 ignore = set(ignore) 

1385 entities = [e for e in entities if e.name not in ignore and e.nameAndComponent()[0] not in ignore] 

1386 else: 

1387 ignore = set() 

1388 

1389 # Find all the registered instruments 

1390 instruments = set( 

1391 dataId["instrument"] for dataId in self.registry.queryDimensions(["instrument"]) 

1392 ) 

1393 

1394 # For each datasetType that has an instrument dimension, create 

1395 # a DatasetRef for each defined instrument 

1396 datasetRefs = [] 

1397 

1398 for datasetType in entities: 

1399 if "instrument" in datasetType.dimensions: 

1400 for instrument in instruments: 

1401 datasetRef = DatasetRef(datasetType, {"instrument": instrument}, conform=False) 

1402 datasetRefs.append(datasetRef) 

1403 

1404 entities.extend(datasetRefs) 

1405 

1406 datastoreErrorStr = None 

1407 try: 

1408 self.datastore.validateConfiguration(entities, logFailures=logFailures) 

1409 except ValidationError as e: 

1410 datastoreErrorStr = str(e) 

1411 

1412 # Also check that the LookupKeys used by the datastores match 

1413 # registry and storage class definitions 

1414 keys = self.datastore.getLookupKeys() 

1415 

1416 failedNames = set() 

1417 failedDataId = set() 

1418 for key in keys: 

1419 datasetType = None 

1420 if key.name is not None: 

1421 if key.name in ignore: 

1422 continue 

1423 

1424 # skip if specific datasetType names were requested and this 

1425 # name does not match 

1426 if datasetTypeNames and key.name not in datasetTypeNames: 

1427 continue 

1428 

1429 # See if it is a StorageClass or a DatasetType 

1430 if key.name in self.storageClasses: 

1431 pass 

1432 else: 

1433 try: 

1434 self.registry.getDatasetType(key.name) 

1435 except KeyError: 

1436 if logFailures: 

1437 log.fatal("Key '%s' does not correspond to a DatasetType or StorageClass", key) 

1438 failedNames.add(key) 

1439 else: 

1440 # Dimensions are checked for consistency when the Butler 

1441 # is created and rendezvoused with a universe. 

1442 pass 

1443 

1444 # Check that the instrument is a valid instrument 

1445 # Currently only support instrument so check for that 

1446 if key.dataId: 

1447 dataIdKeys = set(key.dataId) 

1448 if set(["instrument"]) != dataIdKeys: 

1449 if logFailures: 

1450 log.fatal("Key '%s' has unsupported DataId override", key) 

1451 failedDataId.add(key) 

1452 elif key.dataId["instrument"] not in instruments: 

1453 if logFailures: 

1454 log.fatal("Key '%s' has unknown instrument", key) 

1455 failedDataId.add(key) 

1456 

1457 messages = [] 

1458 

1459 if datastoreErrorStr: 

1460 messages.append(datastoreErrorStr) 

1461 

1462 for failed, msg in ((failedNames, "Keys without corresponding DatasetType or StorageClass entry: "), 

1463 (failedDataId, "Keys with bad DataId entries: ")): 

1464 if failed: 

1465 msg += ", ".join(str(k) for k in failed) 

1466 messages.append(msg) 

1467 

1468 if messages: 

1469 raise ValidationError(";\n".join(messages)) 

1470 

1471 registry: Registry 

1472 """The object that manages dataset metadata and relationships (`Registry`). 

1473 

1474 Most operations that don't involve reading or writing butler datasets are 

1475 accessible only via `Registry` methods. 

1476 """ 

1477 

1478 datastore: Datastore 

1479 """The object that manages actual dataset storage (`Datastore`). 

1480 

1481 Direct user access to the datastore should rarely be necessary; the primary 

1482 exception is the case where a `Datastore` implementation provides extra 

1483 functionality beyond what the base class defines. 

1484 """ 

1485 

1486 storageClasses: StorageClassFactory 

1487 """An object that maps known storage class names to objects that fully 

1488 describe them (`StorageClassFactory`). 

1489 """ 

1490 

1491 collections: Optional[CollectionSearch] 

1492 """The collections to search and any restrictions on the dataset types to 

1493 search for within them, in order (`CollectionSearch`). 

1494 """ 

1495 

1496 run: Optional[str] 

1497 """Name of the run this butler writes outputs to (`str` or `None`). 

1498 """ 

1499 

1500 tags: Tuple[str, ...] 

1501 """Names of `~CollectionType.TAGGED` collections this butler associates 

1502 with in `put` and `ingest`, and disassociates from in `pruneDatasets` 

1503 (`tuple` [ `str` ]). 

1504 """