Hide keyboard shortcuts

Hot-keys on this page

r m x p   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

1# This file is part of daf_butler. 

2# 

3# Developed for the LSST Data Management System. 

4# This product includes software developed by the LSST Project 

5# (http://www.lsst.org). 

6# See the COPYRIGHT file at the top-level directory of this distribution 

7# for details of code ownership. 

8# 

9# This program is free software: you can redistribute it and/or modify 

10# it under the terms of the GNU General Public License as published by 

11# the Free Software Foundation, either version 3 of the License, or 

12# (at your option) any later version. 

13# 

14# This program is distributed in the hope that it will be useful, 

15# but WITHOUT ANY WARRANTY; without even the implied warranty of 

16# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 

17# GNU General Public License for more details. 

18# 

19# You should have received a copy of the GNU General Public License 

20# along with this program. If not, see <http://www.gnu.org/licenses/>. 

21 

22""" 

23Butler top level classes. 

24""" 

25from __future__ import annotations 

26 

27__all__ = ("Butler", "ButlerValidationError") 

28 

29import os 

30from collections import defaultdict 

31import contextlib 

32import logging 

33from typing import ( 

34 Any, 

35 ClassVar, 

36 ContextManager, 

37 Dict, 

38 Iterable, 

39 List, 

40 Mapping, 

41 MutableMapping, 

42 Optional, 

43 Tuple, 

44 Union, 

45) 

46 

47try: 

48 import boto3 

49except ImportError: 

50 boto3 = None 

51 

52from lsst.utils import doImport 

53from .core import ( 

54 ButlerURI, 

55 CompositesMap, 

56 Config, 

57 ConfigSubset, 

58 DataCoordinate, 

59 DataId, 

60 DatasetRef, 

61 DatasetType, 

62 Datastore, 

63 FileDataset, 

64 Quantum, 

65 RepoExport, 

66 StorageClassFactory, 

67 ValidationError, 

68) 

69from .core.repoRelocation import BUTLER_ROOT_TAG 

70from .core.safeFileIo import safeMakeDir 

71from .core.utils import transactional, getClassOf 

72from .core.s3utils import bucketExists 

73from ._deferredDatasetHandle import DeferredDatasetHandle 

74from ._butlerConfig import ButlerConfig 

75from .registry import Registry, RegistryConfig, CollectionType 

76from .registry.wildcards import CollectionSearch 

77 

78log = logging.getLogger(__name__) 

79 

80 

81class ButlerValidationError(ValidationError): 

82 """There is a problem with the Butler configuration.""" 

83 pass 

84 

85 

86class Butler: 

87 """Main entry point for the data access system. 

88 

89 Parameters 

90 ---------- 

91 config : `ButlerConfig`, `Config` or `str`, optional. 

92 Configuration. Anything acceptable to the 

93 `ButlerConfig` constructor. If a directory path 

94 is given the configuration will be read from a ``butler.yaml`` file in 

95 that location. If `None` is given default values will be used. 

96 butler : `Butler`, optional. 

97 If provided, construct a new Butler that uses the same registry and 

98 datastore as the given one, but with the given collection and run. 

99 Incompatible with the ``config``, ``searchPaths``, and ``writeable`` 

100 arguments. 

101 collections : `Any`, optional 

102 An expression specifying the collections to be searched (in order) when 

103 reading datasets, and optionally dataset type restrictions on them. 

104 This may be: 

105 - a `str` collection name; 

106 - a tuple of (collection name, *dataset type restriction*); 

107 - an iterable of either of the above; 

108 - a mapping from `str` to *dataset type restriction*. 

109 

110 See :ref:`daf_butler_collection_expressions` for more information, 

111 including the definition of a *dataset type restriction*. All 

112 collections must either already exist or be specified to be created 

113 by other arguments. 

114 run : `str`, optional 

115 Name of the run datasets should be output to. If the run 

116 does not exist, it will be created. If ``collections`` is `None`, it 

117 will be set to ``[run]``. If this is not set (and ``writeable`` is 

118 not set either), a read-only butler will be created. 

119 tags : `Iterable` [ `str` ], optional 

120 A list of `~CollectionType.TAGGED` collections that datasets should be 

121 associated with in `put` or `ingest` and disassociated from in 

122 `pruneDatasets`. If any of these collections does not exist, it will 

123 be created. 

124 chains : `Mapping` [ `str`, `Iterable` [ `str` ] ], optional 

125 A mapping from the names of new `~CollectionType.CHAINED` collections 

126 to an expression identifying their child collections (which takes the 

127 same form as the ``collections`` argument. Chains may be nested only 

128 if children precede their parents in this mapping. 

129 searchPaths : `list` of `str`, optional 

130 Directory paths to search when calculating the full Butler 

131 configuration. Not used if the supplied config is already a 

132 `ButlerConfig`. 

133 writeable : `bool`, optional 

134 Explicitly sets whether the butler supports write operations. If not 

135 provided, a read-write butler is created if any of ``run``, ``tags``, 

136 or ``chains`` is non-empty. 

137 

138 Examples 

139 -------- 

140 While there are many ways to control exactly how a `Butler` interacts with 

141 the collections in its `Registry`, the most common cases are still simple. 

142 

143 For a read-only `Butler` that searches one collection, do:: 

144 

145 butler = Butler("/path/to/repo", collections=["u/alice/DM-50000"]) 

146 

147 For a read-write `Butler` that writes to and reads from a 

148 `~CollectionType.RUN` collection:: 

149 

150 butler = Butler("/path/to/repo", run="u/alice/DM-50000/a") 

151 

152 The `Butler` passed to a ``PipelineTask`` is often much more complex, 

153 because we want to write to one `~CollectionType.RUN` collection but read 

154 from several others (as well), while defining a new 

155 `~CollectionType.CHAINED` collection that combines them all:: 

156 

157 butler = Butler("/path/to/repo", run="u/alice/DM-50000/a", 

158 collections=["u/alice/DM-50000"], 

159 chains={ 

160 "u/alice/DM-50000": ["u/alice/DM-50000/a", 

161 "u/bob/DM-49998", 

162 "raw/hsc"] 

163 }) 

164 

165 This butler will `put` new datasets to the run ``u/alice/DM-50000/a``, but 

166 they'll also be available from the chained collection ``u/alice/DM-50000``. 

167 Datasets will be read first from that run (since it appears first in the 

168 chain), and then from ``u/bob/DM-49998`` and finally ``raw/hsc``. 

169 If ``u/alice/DM-50000`` had already been defined, the ``chain`` argument 

170 would be unnecessary. We could also construct a butler that performs 

171 exactly the same `put` and `get` operations without actually creating a 

172 chained collection, just by passing multiple items is ``collections``:: 

173 

174 butler = Butler("/path/to/repo", run="u/alice/DM-50000/a", 

175 collections=["u/alice/DM-50000/a", 

176 "u/bob/DM-49998", 

177 "raw/hsc"]) 

178 

179 Finally, one can always create a `Butler` with no collections:: 

180 

181 butler = Butler("/path/to/repo", writeable=True) 

182 

183 This can be extremely useful when you just want to use ``butler.registry``, 

184 e.g. for inserting dimension data or managing collections, or when the 

185 collections you want to use with the butler are not consistent. 

186 Passing ``writeable`` explicitly here is only necessary if you want to be 

187 able to make changes to the repo - usually the value for ``writeable`` is 

188 can be guessed from the collection arguments provided, but it defaults to 

189 `False` when there are not collection arguments. 

190 """ 

191 def __init__(self, config: Union[Config, str, None] = None, *, 

192 butler: Optional[Butler] = None, 

193 collections: Any = None, 

194 run: Optional[str] = None, 

195 tags: Iterable[str] = (), 

196 chains: Optional[Mapping[str, Any]] = None, 

197 searchPaths: Optional[List[str]] = None, 

198 writeable: Optional[bool] = None): 

199 # Transform any single-pass iterator into an actual sequence so we 

200 # can see if its empty 

201 self.tags = tuple(tags) 

202 # Load registry, datastore, etc. from config or existing butler. 

203 if butler is not None: 

204 if config is not None or searchPaths is not None or writeable is not None: 

205 raise TypeError("Cannot pass 'config', 'searchPaths', or 'writeable' " 

206 "arguments with 'butler' argument.") 

207 self.registry = butler.registry 

208 self.datastore = butler.datastore 

209 self.storageClasses = butler.storageClasses 

210 self._composites = butler._composites 

211 self._config = butler._config 

212 else: 

213 self._config = ButlerConfig(config, searchPaths=searchPaths) 

214 if "root" in self._config: 

215 butlerRoot = self._config["root"] 

216 else: 

217 butlerRoot = self._config.configDir 

218 if writeable is None: 

219 writeable = run is not None or chains is not None or self.tags 

220 self.registry = Registry.fromConfig(self._config, butlerRoot=butlerRoot, writeable=writeable) 

221 self.datastore = Datastore.fromConfig(self._config, self.registry.getDatastoreBridgeManager(), 

222 butlerRoot=butlerRoot) 

223 self.storageClasses = StorageClassFactory() 

224 self.storageClasses.addFromConfig(self._config) 

225 self._composites = CompositesMap(self._config, universe=self.registry.dimensions) 

226 # Check the many collection arguments for consistency and create any 

227 # needed collections that don't exist. 

228 if collections is None: 

229 if run is not None: 

230 collections = (run,) 

231 else: 

232 collections = () 

233 self.collections = CollectionSearch.fromExpression(collections) 

234 if chains is None: 

235 chains = {} 

236 self.run = run 

237 if "run" in self._config or "collection" in self._config: 

238 raise ValueError("Passing a run or collection via configuration is no longer supported.") 

239 if self.run is not None: 

240 self.registry.registerCollection(self.run, type=CollectionType.RUN) 

241 for tag in self.tags: 

242 self.registry.registerCollection(tag, type=CollectionType.TAGGED) 

243 for parent, children in chains.items(): 

244 self.registry.registerCollection(parent, type=CollectionType.CHAINED) 

245 self.registry.setCollectionChain(parent, children) 

246 

247 GENERATION: ClassVar[int] = 3 

248 """This is a Generation 3 Butler. 

249 

250 This attribute may be removed in the future, once the Generation 2 Butler 

251 interface has been fully retired; it should only be used in transitional 

252 code. 

253 """ 

254 

255 @staticmethod 

256 def makeRepo(root: str, config: Union[Config, str, None] = None, standalone: bool = False, 

257 createRegistry: bool = True, searchPaths: Optional[List[str]] = None, 

258 forceConfigRoot: bool = True, outfile: Optional[str] = None, 

259 overwrite: bool = False) -> Config: 

260 """Create an empty data repository by adding a butler.yaml config 

261 to a repository root directory. 

262 

263 Parameters 

264 ---------- 

265 root : `str` or `ButlerURI` 

266 Path or URI to the root location of the new repository. Will be 

267 created if it does not exist. 

268 config : `Config` or `str`, optional 

269 Configuration to write to the repository, after setting any 

270 root-dependent Registry or Datastore config options. Can not 

271 be a `ButlerConfig` or a `ConfigSubset`. If `None`, default 

272 configuration will be used. Root-dependent config options 

273 specified in this config are overwritten if ``forceConfigRoot`` 

274 is `True`. 

275 standalone : `bool` 

276 If True, write all expanded defaults, not just customized or 

277 repository-specific settings. 

278 This (mostly) decouples the repository from the default 

279 configuration, insulating it from changes to the defaults (which 

280 may be good or bad, depending on the nature of the changes). 

281 Future *additions* to the defaults will still be picked up when 

282 initializing `Butlers` to repos created with ``standalone=True``. 

283 createRegistry : `bool`, optional 

284 If `True` create a new Registry. 

285 searchPaths : `list` of `str`, optional 

286 Directory paths to search when calculating the full butler 

287 configuration. 

288 forceConfigRoot : `bool`, optional 

289 If `False`, any values present in the supplied ``config`` that 

290 would normally be reset are not overridden and will appear 

291 directly in the output config. This allows non-standard overrides 

292 of the root directory for a datastore or registry to be given. 

293 If this parameter is `True` the values for ``root`` will be 

294 forced into the resulting config if appropriate. 

295 outfile : `str`, optional 

296 If not-`None`, the output configuration will be written to this 

297 location rather than into the repository itself. Can be a URI 

298 string. Can refer to a directory that will be used to write 

299 ``butler.yaml``. 

300 overwrite : `bool`, optional 

301 Create a new configuration file even if one already exists 

302 in the specified output location. Default is to raise 

303 an exception. 

304 

305 Returns 

306 ------- 

307 config : `Config` 

308 The updated `Config` instance written to the repo. 

309 

310 Raises 

311 ------ 

312 ValueError 

313 Raised if a ButlerConfig or ConfigSubset is passed instead of a 

314 regular Config (as these subclasses would make it impossible to 

315 support ``standalone=False``). 

316 FileExistsError 

317 Raised if the output config file already exists. 

318 os.error 

319 Raised if the directory does not exist, exists but is not a 

320 directory, or cannot be created. 

321 

322 Notes 

323 ----- 

324 Note that when ``standalone=False`` (the default), the configuration 

325 search path (see `ConfigSubset.defaultSearchPaths`) that was used to 

326 construct the repository should also be used to construct any Butlers 

327 to avoid configuration inconsistencies. 

328 """ 

329 if isinstance(config, (ButlerConfig, ConfigSubset)): 

330 raise ValueError("makeRepo must be passed a regular Config without defaults applied.") 

331 

332 # for "file" schemes we are assuming POSIX semantics for paths, for 

333 # schemeless URIs we are assuming os.path semantics. 

334 uri = ButlerURI(root, forceDirectory=True) 

335 if uri.scheme == "file" or not uri.scheme: 

336 if not os.path.isdir(uri.ospath): 

337 safeMakeDir(uri.ospath) 

338 elif uri.scheme == "s3": 

339 # bucket must already exist 

340 if not bucketExists(uri.netloc): 

341 raise ValueError(f"Bucket {uri.netloc} does not exist!") 

342 s3 = boto3.client("s3") 

343 # don't create S3 key when root is at the top-level of an Bucket 

344 if not uri.path == "/": 

345 s3.put_object(Bucket=uri.netloc, Key=uri.relativeToPathRoot) 

346 else: 

347 raise ValueError(f"Unrecognized scheme: {uri.scheme}") 

348 config = Config(config) 

349 

350 # If we are creating a new repo from scratch with relative roots, 

351 # do not propagate an explicit root from the config file 

352 if "root" in config: 

353 del config["root"] 

354 

355 full = ButlerConfig(config, searchPaths=searchPaths) # this applies defaults 

356 datastoreClass = doImport(full["datastore", "cls"]) 

357 datastoreClass.setConfigRoot(BUTLER_ROOT_TAG, config, full, overwrite=forceConfigRoot) 

358 

359 # if key exists in given config, parse it, otherwise parse the defaults 

360 # in the expanded config 

361 if config.get(("registry", "db")): 

362 registryConfig = RegistryConfig(config) 

363 else: 

364 registryConfig = RegistryConfig(full) 

365 defaultDatabaseUri = registryConfig.makeDefaultDatabaseUri(BUTLER_ROOT_TAG) 

366 if defaultDatabaseUri is not None: 

367 Config.updateParameters(RegistryConfig, config, full, 

368 toUpdate={"db": defaultDatabaseUri}, 

369 overwrite=forceConfigRoot) 

370 else: 

371 Config.updateParameters(RegistryConfig, config, full, toCopy=("db",), 

372 overwrite=forceConfigRoot) 

373 

374 if standalone: 

375 config.merge(full) 

376 if outfile is not None: 

377 # When writing to a separate location we must include 

378 # the root of the butler repo in the config else it won't know 

379 # where to look. 

380 config["root"] = uri.geturl() 

381 configURI = outfile 

382 else: 

383 configURI = uri 

384 config.dumpToUri(configURI, overwrite=overwrite) 

385 

386 # Create Registry and populate tables 

387 Registry.fromConfig(config, create=createRegistry, butlerRoot=root) 

388 return config 

389 

390 @classmethod 

391 def _unpickle(cls, config: ButlerConfig, collections: Optional[CollectionSearch], run: Optional[str], 

392 tags: Tuple[str, ...], writeable: bool) -> Butler: 

393 """Callable used to unpickle a Butler. 

394 

395 We prefer not to use ``Butler.__init__`` directly so we can force some 

396 of its many arguments to be keyword-only (note that ``__reduce__`` 

397 can only invoke callables with positional arguments). 

398 

399 Parameters 

400 ---------- 

401 config : `ButlerConfig` 

402 Butler configuration, already coerced into a true `ButlerConfig` 

403 instance (and hence after any search paths for overrides have been 

404 utilized). 

405 collections : `CollectionSearch` 

406 Names of collections to read from. 

407 run : `str`, optional 

408 Name of `~CollectionType.RUN` collection to write to. 

409 tags : `tuple` [`str`] 

410 Names of `~CollectionType.TAGGED` collections to associate with. 

411 writeable : `bool` 

412 Whether the Butler should support write operations. 

413 

414 Returns 

415 ------- 

416 butler : `Butler` 

417 A new `Butler` instance. 

418 """ 

419 return cls(config=config, collections=collections, run=run, tags=tags, writeable=writeable) 

420 

421 def __reduce__(self): 

422 """Support pickling. 

423 """ 

424 return (Butler._unpickle, (self._config, self.collections, self.run, self.tags, 

425 self.registry.isWriteable())) 

426 

427 def __str__(self): 

428 return "Butler(collections={}, run={}, tags={}, datastore='{}', registry='{}')".format( 

429 self.collections, self.run, self.tags, self.datastore, self.registry) 

430 

431 def isWriteable(self) -> bool: 

432 """Return `True` if this `Butler` supports write operations. 

433 """ 

434 return self.registry.isWriteable() 

435 

436 @contextlib.contextmanager 

437 def transaction(self): 

438 """Context manager supporting `Butler` transactions. 

439 

440 Transactions can be nested. 

441 """ 

442 with self.registry.transaction(): 

443 with self.datastore.transaction(): 

444 yield 

445 

446 def _standardizeArgs(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

447 dataId: Optional[DataId] = None, **kwds: Any) -> Tuple[DatasetType, DataId]: 

448 """Standardize the arguments passed to several Butler APIs. 

449 

450 Parameters 

451 ---------- 

452 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

453 When `DatasetRef` the `dataId` should be `None`. 

454 Otherwise the `DatasetType` or name thereof. 

455 dataId : `dict` or `DataCoordinate` 

456 A `dict` of `Dimension` link name, value pairs that label the 

457 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

458 should be provided as the second argument. 

459 kwds 

460 Additional keyword arguments used to augment or construct a 

461 `DataCoordinate`. See `DataCoordinate.standardize` 

462 parameters. 

463 

464 Returns 

465 ------- 

466 datasetType : `DatasetType` 

467 A `DatasetType` instance extracted from ``datasetRefOrType``. 

468 dataId : `dict` or `DataId`, optional 

469 Argument that can be used (along with ``kwds``) to construct a 

470 `DataId`. 

471 

472 Notes 

473 ----- 

474 Butler APIs that conceptually need a DatasetRef also allow passing a 

475 `DatasetType` (or the name of one) and a `DataId` (or a dict and 

476 keyword arguments that can be used to construct one) separately. This 

477 method accepts those arguments and always returns a true `DatasetType` 

478 and a `DataId` or `dict`. 

479 

480 Standardization of `dict` vs `DataId` is best handled by passing the 

481 returned ``dataId`` (and ``kwds``) to `Registry` APIs, which are 

482 generally similarly flexible. 

483 """ 

484 externalDatasetType = None 

485 internalDatasetType = None 

486 if isinstance(datasetRefOrType, DatasetRef): 

487 if dataId is not None or kwds: 

488 raise ValueError("DatasetRef given, cannot use dataId as well") 

489 externalDatasetType = datasetRefOrType.datasetType 

490 dataId = datasetRefOrType.dataId 

491 else: 

492 # Don't check whether DataId is provided, because Registry APIs 

493 # can usually construct a better error message when it wasn't. 

494 if isinstance(datasetRefOrType, DatasetType): 

495 externalDatasetType = datasetRefOrType 

496 else: 

497 internalDatasetType = self.registry.getDatasetType(datasetRefOrType) 

498 

499 # Check that they are self-consistent 

500 if externalDatasetType is not None: 

501 internalDatasetType = self.registry.getDatasetType(externalDatasetType.name) 

502 if externalDatasetType != internalDatasetType: 

503 raise ValueError(f"Supplied dataset type ({externalDatasetType}) inconsistent with " 

504 f"registry definition ({internalDatasetType})") 

505 

506 return internalDatasetType, dataId 

507 

508 def _findDatasetRef(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

509 dataId: Optional[DataId] = None, *, 

510 collections: Any = None, 

511 allowUnresolved: bool = False, 

512 **kwds: Any) -> DatasetRef: 

513 """Shared logic for methods that start with a search for a dataset in 

514 the registry. 

515 

516 Parameters 

517 ---------- 

518 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

519 When `DatasetRef` the `dataId` should be `None`. 

520 Otherwise the `DatasetType` or name thereof. 

521 dataId : `dict` or `DataCoordinate`, optional 

522 A `dict` of `Dimension` link name, value pairs that label the 

523 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

524 should be provided as the first argument. 

525 collections : Any, optional 

526 Collections to be searched, overriding ``self.collections``. 

527 Can be any of the types supported by the ``collections`` argument 

528 to butler construction. 

529 allowUnresolved : `bool`, optional 

530 If `True`, return an unresolved `DatasetRef` if finding a resolved 

531 one in the `Registry` fails. Defaults to `False`. 

532 kwds 

533 Additional keyword arguments used to augment or construct a 

534 `DataId`. See `DataId` parameters. 

535 

536 Returns 

537 ------- 

538 ref : `DatasetRef` 

539 A reference to the dataset identified by the given arguments. 

540 

541 Raises 

542 ------ 

543 LookupError 

544 Raised if no matching dataset exists in the `Registry` (and 

545 ``allowUnresolved is False``). 

546 ValueError 

547 Raised if a resolved `DatasetRef` was passed as an input, but it 

548 differs from the one found in the registry. 

549 TypeError 

550 Raised if no collections were provided. 

551 """ 

552 datasetType, dataId = self._standardizeArgs(datasetRefOrType, dataId, **kwds) 

553 if isinstance(datasetRefOrType, DatasetRef): 

554 idNumber = datasetRefOrType.id 

555 else: 

556 idNumber = None 

557 # Expand the data ID first instead of letting registry.findDataset do 

558 # it, so we get the result even if it returns None. 

559 dataId = self.registry.expandDataId(dataId, graph=datasetType.dimensions, **kwds) 

560 if collections is None: 

561 collections = self.collections 

562 if not collections: 

563 raise TypeError("No input collections provided.") 

564 else: 

565 collections = CollectionSearch.fromExpression(collections) 

566 # Always lookup the DatasetRef, even if one is given, to ensure it is 

567 # present in the current collection. 

568 ref = self.registry.findDataset(datasetType, dataId, collections=collections) 

569 if ref is None: 

570 if allowUnresolved: 

571 return DatasetRef(datasetType, dataId) 

572 else: 

573 raise LookupError(f"Dataset {datasetType.name} with data ID {dataId} " 

574 f"could not be found in collections {collections}.") 

575 if idNumber is not None and idNumber != ref.id: 

576 raise ValueError(f"DatasetRef.id provided ({idNumber}) does not match " 

577 f"id ({ref.id}) in registry in collections {collections}.") 

578 return ref 

579 

580 @transactional 

581 def put(self, obj: Any, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

582 dataId: Optional[DataId] = None, *, 

583 producer: Optional[Quantum] = None, 

584 run: Optional[str] = None, 

585 tags: Optional[Iterable[str]] = None, 

586 **kwds: Any) -> DatasetRef: 

587 """Store and register a dataset. 

588 

589 Parameters 

590 ---------- 

591 obj : `object` 

592 The dataset. 

593 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

594 When `DatasetRef` is provided, ``dataId`` should be `None`. 

595 Otherwise the `DatasetType` or name thereof. 

596 dataId : `dict` or `DataCoordinate` 

597 A `dict` of `Dimension` link name, value pairs that label the 

598 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

599 should be provided as the second argument. 

600 producer : `Quantum`, optional 

601 The producer. 

602 run : `str`, optional 

603 The name of the run the dataset should be added to, overriding 

604 ``self.run``. 

605 tags : `Iterable` [ `str` ], optional 

606 The names of a `~CollectionType.TAGGED` collections to associate 

607 the dataset with, overriding ``self.tags``. These collections 

608 must have already been added to the `Registry`. 

609 kwds 

610 Additional keyword arguments used to augment or construct a 

611 `DataCoordinate`. See `DataCoordinate.standardize` 

612 parameters. 

613 

614 Returns 

615 ------- 

616 ref : `DatasetRef` 

617 A reference to the stored dataset, updated with the correct id if 

618 given. 

619 

620 Raises 

621 ------ 

622 TypeError 

623 Raised if the butler is read-only or if no run has been provided. 

624 """ 

625 log.debug("Butler put: %s, dataId=%s, producer=%s, run=%s", datasetRefOrType, dataId, producer, run) 

626 if not self.isWriteable(): 

627 raise TypeError("Butler is read-only.") 

628 datasetType, dataId = self._standardizeArgs(datasetRefOrType, dataId, **kwds) 

629 if isinstance(datasetRefOrType, DatasetRef) and datasetRefOrType.id is not None: 

630 raise ValueError("DatasetRef must not be in registry, must have None id") 

631 

632 if run is None: 

633 if self.run is None: 

634 raise TypeError("No run provided.") 

635 run = self.run 

636 # No need to check type for run; first thing we do is 

637 # insertDatasets, and that will check for us. 

638 

639 if tags is None: 

640 tags = self.tags 

641 else: 

642 tags = tuple(tags) 

643 for tag in tags: 

644 # Check that these are tagged collections up front, because we want 

645 # to avoid relying on Datastore transactionality to avoid modifying 

646 # the repo if there's an error later. 

647 collectionType = self.registry.getCollectionType(tag) 

648 if collectionType is not CollectionType.TAGGED: 

649 raise TypeError(f"Cannot associate into collection '{tag}' of non-TAGGED type " 

650 f"{collectionType.name}.") 

651 

652 # Disable all disassembly at the registry level for now 

653 isVirtualComposite = False 

654 

655 # Add Registry Dataset entry. If not a virtual composite, add 

656 # and attach components at the same time. 

657 dataId = self.registry.expandDataId(dataId, graph=datasetType.dimensions, **kwds) 

658 ref, = self.registry.insertDatasets(datasetType, run=run, dataIds=[dataId], 

659 producer=producer, 

660 # Never write components into 

661 # registry 

662 recursive=False) 

663 

664 # Check to see if this datasetType requires disassembly 

665 if isVirtualComposite: 

666 components = datasetType.storageClass.assembler().disassemble(obj) 

667 componentRefs = {} 

668 for component, info in components.items(): 

669 compTypeName = datasetType.componentTypeName(component) 

670 compRef = self.put(info.component, compTypeName, dataId, producer=producer, run=run, 

671 collection=False) # We don't need to recursively associate. 

672 componentRefs[component] = compRef 

673 ref = self.registry.attachComponents(ref, componentRefs) 

674 else: 

675 # This is an entity without a disassembler. 

676 self.datastore.put(obj, ref) 

677 

678 for tag in tags: 

679 self.registry.associate(tag, [ref]) # this is already recursive by default 

680 

681 return ref 

682 

683 def getDirect(self, ref: DatasetRef, *, parameters: Optional[Dict[str, Any]] = None): 

684 """Retrieve a stored dataset. 

685 

686 Unlike `Butler.get`, this method allows datasets outside the Butler's 

687 collection to be read as long as the `DatasetRef` that identifies them 

688 can be obtained separately. 

689 

690 Parameters 

691 ---------- 

692 ref : `DatasetRef` 

693 Reference to an already stored dataset. 

694 parameters : `dict` 

695 Additional StorageClass-defined options to control reading, 

696 typically used to efficiently read only a subset of the dataset. 

697 

698 Returns 

699 ------- 

700 obj : `object` 

701 The dataset. 

702 """ 

703 # if the ref exists in the store we return it directly 

704 if self.datastore.exists(ref): 

705 return self.datastore.get(ref, parameters=parameters) 

706 elif ref.isComposite() and ref.components: 

707 # The presence of components indicates that this dataset 

708 # was disassembled at the registry level. 

709 # Check that we haven't got any unknown parameters 

710 ref.datasetType.storageClass.validateParameters(parameters) 

711 # Reconstruct the composite 

712 usedParams = set() 

713 components = {} 

714 for compName, compRef in ref.components.items(): 

715 # make a dictionary of parameters containing only the subset 

716 # supported by the StorageClass of the components 

717 compParams = compRef.datasetType.storageClass.filterParameters(parameters) 

718 usedParams.update(set(compParams)) 

719 components[compName] = self.datastore.get(compRef, parameters=compParams) 

720 

721 # Any unused parameters will have to be passed to the assembler 

722 if parameters: 

723 unusedParams = {k: v for k, v in parameters.items() if k not in usedParams} 

724 else: 

725 unusedParams = {} 

726 

727 # Assemble the components 

728 inMemoryDataset = ref.datasetType.storageClass.assembler().assemble(components) 

729 return ref.datasetType.storageClass.assembler().handleParameters(inMemoryDataset, 

730 parameters=unusedParams) 

731 else: 

732 # single entity in datastore 

733 raise FileNotFoundError(f"Unable to locate dataset '{ref}' in datastore {self.datastore.name}") 

734 

735 def getDeferred(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

736 dataId: Optional[DataId] = None, *, 

737 parameters: Union[dict, None] = None, 

738 collections: Any = None, 

739 **kwds: Any) -> DeferredDatasetHandle: 

740 """Create a `DeferredDatasetHandle` which can later retrieve a dataset 

741 

742 Parameters 

743 ---------- 

744 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

745 When `DatasetRef` the `dataId` should be `None`. 

746 Otherwise the `DatasetType` or name thereof. 

747 dataId : `dict` or `DataCoordinate`, optional 

748 A `dict` of `Dimension` link name, value pairs that label the 

749 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

750 should be provided as the first argument. 

751 parameters : `dict` 

752 Additional StorageClass-defined options to control reading, 

753 typically used to efficiently read only a subset of the dataset. 

754 collections : Any, optional 

755 Collections to be searched, overriding ``self.collections``. 

756 Can be any of the types supported by the ``collections`` argument 

757 to butler construction. 

758 kwds 

759 Additional keyword arguments used to augment or construct a 

760 `DataId`. See `DataId` parameters. 

761 

762 Returns 

763 ------- 

764 obj : `DeferredDatasetHandle` 

765 A handle which can be used to retrieve a dataset at a later time. 

766 

767 Raises 

768 ------ 

769 LookupError 

770 Raised if no matching dataset exists in the `Registry` (and 

771 ``allowUnresolved is False``). 

772 ValueError 

773 Raised if a resolved `DatasetRef` was passed as an input, but it 

774 differs from the one found in the registry. 

775 TypeError 

776 Raised if no collections were provided. 

777 """ 

778 ref = self._findDatasetRef(datasetRefOrType, dataId, collections=collections, **kwds) 

779 return DeferredDatasetHandle(butler=self, ref=ref, parameters=parameters) 

780 

781 def get(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

782 dataId: Optional[DataId] = None, *, 

783 parameters: Optional[Dict[str, Any]] = None, 

784 collections: Any = None, 

785 **kwds: Any) -> Any: 

786 """Retrieve a stored dataset. 

787 

788 Parameters 

789 ---------- 

790 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

791 When `DatasetRef` the `dataId` should be `None`. 

792 Otherwise the `DatasetType` or name thereof. 

793 dataId : `dict` or `DataCoordinate` 

794 A `dict` of `Dimension` link name, value pairs that label the 

795 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

796 should be provided as the first argument. 

797 parameters : `dict` 

798 Additional StorageClass-defined options to control reading, 

799 typically used to efficiently read only a subset of the dataset. 

800 collections : Any, optional 

801 Collections to be searched, overriding ``self.collections``. 

802 Can be any of the types supported by the ``collections`` argument 

803 to butler construction. 

804 kwds 

805 Additional keyword arguments used to augment or construct a 

806 `DataCoordinate`. See `DataCoordinate.standardize` 

807 parameters. 

808 

809 Returns 

810 ------- 

811 obj : `object` 

812 The dataset. 

813 

814 Raises 

815 ------ 

816 ValueError 

817 Raised if a resolved `DatasetRef` was passed as an input, but it 

818 differs from the one found in the registry. 

819 LookupError 

820 Raised if no matching dataset exists in the `Registry`. 

821 TypeError 

822 Raised if no collections were provided. 

823 """ 

824 log.debug("Butler get: %s, dataId=%s, parameters=%s", datasetRefOrType, dataId, parameters) 

825 ref = self._findDatasetRef(datasetRefOrType, dataId, collections=collections, **kwds) 

826 return self.getDirect(ref, parameters=parameters) 

827 

828 def getURIs(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

829 dataId: Optional[DataId] = None, *, 

830 predict: bool = False, 

831 collections: Any = None, 

832 run: Optional[str] = None, 

833 **kwds: Any) -> Tuple[Optional[ButlerURI], Dict[str, ButlerURI]]: 

834 """Returns the URIs associated with the dataset. 

835 

836 Parameters 

837 ---------- 

838 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

839 When `DatasetRef` the `dataId` should be `None`. 

840 Otherwise the `DatasetType` or name thereof. 

841 dataId : `dict` or `DataCoordinate` 

842 A `dict` of `Dimension` link name, value pairs that label the 

843 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

844 should be provided as the first argument. 

845 predict : `bool` 

846 If `True`, allow URIs to be returned of datasets that have not 

847 been written. 

848 collections : Any, optional 

849 Collections to be searched, overriding ``self.collections``. 

850 Can be any of the types supported by the ``collections`` argument 

851 to butler construction. 

852 run : `str`, optional 

853 Run to use for predictions, overriding ``self.run``. 

854 kwds 

855 Additional keyword arguments used to augment or construct a 

856 `DataCoordinate`. See `DataCoordinate.standardize` 

857 parameters. 

858 

859 Returns 

860 ------- 

861 primary : `ButlerURI` 

862 The URI to the primary artifact associated with this dataset. 

863 If the dataset was disassembled within the datastore this 

864 may be `None`. 

865 components : `dict` 

866 URIs to any components associated with the dataset artifact. 

867 Can be empty if there are no components. 

868 """ 

869 ref = self._findDatasetRef(datasetRefOrType, dataId, allowUnresolved=predict, 

870 collections=collections, **kwds) 

871 if ref.id is None: # only possible if predict is True 

872 if run is None: 

873 run = self.run 

874 if run is None: 

875 raise TypeError("Cannot predict location with run=None.") 

876 # Lie about ID, because we can't guess it, and only 

877 # Datastore.getURIs() will ever see it (and it doesn't use it). 

878 ref = ref.resolved(id=0, run=self.run) 

879 return self.datastore.getURIs(ref, predict) 

880 

881 def getURI(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

882 dataId: Optional[DataId] = None, *, 

883 predict: bool = False, 

884 collections: Any = None, 

885 run: Optional[str] = None, 

886 **kwds: Any) -> ButlerURI: 

887 """Return the URI to the Dataset. 

888 

889 Parameters 

890 ---------- 

891 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

892 When `DatasetRef` the `dataId` should be `None`. 

893 Otherwise the `DatasetType` or name thereof. 

894 dataId : `dict` or `DataCoordinate` 

895 A `dict` of `Dimension` link name, value pairs that label the 

896 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

897 should be provided as the first argument. 

898 predict : `bool` 

899 If `True`, allow URIs to be returned of datasets that have not 

900 been written. 

901 collections : Any, optional 

902 Collections to be searched, overriding ``self.collections``. 

903 Can be any of the types supported by the ``collections`` argument 

904 to butler construction. 

905 run : `str`, optional 

906 Run to use for predictions, overriding ``self.run``. 

907 kwds 

908 Additional keyword arguments used to augment or construct a 

909 `DataCoordinate`. See `DataCoordinate.standardize` 

910 parameters. 

911 

912 Returns 

913 ------- 

914 uri : `ButlerURI` 

915 URI pointing to the Dataset within the datastore. If the 

916 Dataset does not exist in the datastore, and if ``predict`` is 

917 `True`, the URI will be a prediction and will include a URI 

918 fragment "#predicted". 

919 If the datastore does not have entities that relate well 

920 to the concept of a URI the returned URI string will be 

921 descriptive. The returned URI is not guaranteed to be obtainable. 

922 

923 Raises 

924 ------ 

925 LookupError 

926 A URI has been requested for a dataset that does not exist and 

927 guessing is not allowed. 

928 ValueError 

929 Raised if a resolved `DatasetRef` was passed as an input, but it 

930 differs from the one found in the registry. 

931 TypeError 

932 Raised if no collections were provided. 

933 RuntimeError 

934 Raised if a URI is requested for a dataset that consists of 

935 multiple artifacts. 

936 """ 

937 primary, components = self.getURIs(datasetRefOrType, dataId=dataId, predict=predict, 

938 collections=collections, run=run, **kwds) 

939 

940 if primary is None or components: 

941 raise RuntimeError(f"Dataset ({datasetRefOrType}) includes distinct URIs for components. " 

942 "Use Butler.getURIs() instead.") 

943 return primary 

944 

945 def datasetExists(self, datasetRefOrType: Union[DatasetRef, DatasetType, str], 

946 dataId: Optional[DataId] = None, *, 

947 collections: Any = None, 

948 **kwds: Any) -> bool: 

949 """Return True if the Dataset is actually present in the Datastore. 

950 

951 Parameters 

952 ---------- 

953 datasetRefOrType : `DatasetRef`, `DatasetType`, or `str` 

954 When `DatasetRef` the `dataId` should be `None`. 

955 Otherwise the `DatasetType` or name thereof. 

956 dataId : `dict` or `DataCoordinate` 

957 A `dict` of `Dimension` link name, value pairs that label the 

958 `DatasetRef` within a Collection. When `None`, a `DatasetRef` 

959 should be provided as the first argument. 

960 collections : Any, optional 

961 Collections to be searched, overriding ``self.collections``. 

962 Can be any of the types supported by the ``collections`` argument 

963 to butler construction. 

964 kwds 

965 Additional keyword arguments used to augment or construct a 

966 `DataCoordinate`. See `DataCoordinate.standardize` 

967 parameters. 

968 

969 Raises 

970 ------ 

971 LookupError 

972 Raised if the dataset is not even present in the Registry. 

973 ValueError 

974 Raised if a resolved `DatasetRef` was passed as an input, but it 

975 differs from the one found in the registry. 

976 TypeError 

977 Raised if no collections were provided. 

978 """ 

979 ref = self._findDatasetRef(datasetRefOrType, dataId, collections=collections, **kwds) 

980 return self.datastore.exists(ref) 

981 

982 def pruneCollection(self, name: str, purge: bool = False, unstore: bool = False): 

983 """Remove a collection and possibly prune datasets within it. 

984 

985 Parameters 

986 ---------- 

987 name : `str` 

988 Name of the collection to remove. If this is a 

989 `~CollectionType.TAGGED` or `~CollectionType.CHAINED` collection, 

990 datasets within the collection are not modified unless ``unstore`` 

991 is `True`. If this is a `~CollectionType.RUN` collection, 

992 ``purge`` and ``unstore`` must be `True`, and all datasets in it 

993 are fully removed from the data repository. 

994 purge : `bool`, optional 

995 If `True`, permit `~CollectionType.RUN` collections to be removed, 

996 fully removing datasets within them. Requires ``unstore=True`` as 

997 well as an added precaution against accidental deletion. Must be 

998 `False` (default) if the collection is not a ``RUN``. 

999 unstore: `bool`, optional 

1000 If `True`, remove all datasets in the collection from all 

1001 datastores in which they appear. 

1002 

1003 Raises 

1004 ------ 

1005 TypeError 

1006 Raised if the butler is read-only or arguments are mutually 

1007 inconsistent. 

1008 """ 

1009 # See pruneDatasets comments for more information about the logic here; 

1010 # the cases are almost the same, but here we can rely on Registry to 

1011 # take care everything but Datastore deletion when we remove the 

1012 # collection. 

1013 if not self.isWriteable(): 

1014 raise TypeError("Butler is read-only.") 

1015 if purge and not unstore: 

1016 raise TypeError("Cannot pass purge=True without unstore=True.") 

1017 collectionType = self.registry.getCollectionType(name) 

1018 if collectionType is CollectionType.RUN and not purge: 

1019 raise TypeError(f"Cannot prune RUN collection {name} without purge=True.") 

1020 if collectionType is not CollectionType.RUN and purge: 

1021 raise TypeError(f"Cannot prune {collectionType.name} collection {name} with purge=True.") 

1022 with self.registry.transaction(): 

1023 if unstore: 

1024 for ref in self.registry.queryDatasets(..., collections=name, deduplicate=True): 

1025 if self.datastore.exists(ref): 

1026 self.datastore.trash(ref) 

1027 self.registry.removeCollection(name) 

1028 if unstore: 

1029 # Point of no return for removing artifacts 

1030 self.datastore.emptyTrash() 

1031 

1032 def pruneDatasets(self, refs: Iterable[DatasetRef], *, 

1033 disassociate: bool = True, 

1034 unstore: bool = False, 

1035 tags: Optional[Iterable[str]] = None, 

1036 purge: bool = False, 

1037 run: Optional[str] = None, 

1038 recursive: bool = True): 

1039 """Remove one or more datasets from a collection and/or storage. 

1040 

1041 Parameters 

1042 ---------- 

1043 refs : `~collections.abc.Iterable` of `DatasetRef` 

1044 Datasets to prune. These must be "resolved" references (not just 

1045 a `DatasetType` and data ID). 

1046 disassociate : bool`, optional 

1047 Disassociate pruned datasets from ``self.tags`` (or the collections 

1048 given via the ``tags`` argument). Ignored if ``refs`` is ``...``. 

1049 unstore : `bool`, optional 

1050 If `True` (`False` is default) remove these datasets from all 

1051 datastores known to this butler. Note that this will make it 

1052 impossible to retrieve these datasets even via other collections. 

1053 Datasets that are already not stored are ignored by this option. 

1054 tags : `Iterable` [ `str` ], optional 

1055 `~CollectionType.TAGGED` collections to disassociate the datasets 

1056 from, overriding ``self.tags``. Ignored if ``disassociate`` is 

1057 `False` or ``purge`` is `True`. 

1058 purge : `bool`, optional 

1059 If `True` (`False` is default), completely remove the dataset from 

1060 the `Registry`. To prevent accidental deletions, ``purge`` may 

1061 only be `True` if all of the following conditions are met: 

1062 

1063 - All given datasets are in the given run. 

1064 - ``disassociate`` is `True`; 

1065 - ``unstore`` is `True`. 

1066 

1067 This mode may remove provenance information from datasets other 

1068 than those provided, and should be used with extreme care. 

1069 run : `str`, optional 

1070 `~CollectionType.RUN` collection to purge from, overriding 

1071 ``self.run``. Ignored unless ``purge`` is `True`. 

1072 recursive : `bool`, optional 

1073 If `True` (default) also prune component datasets of any given 

1074 composite datasets. This will only prune components that are 

1075 actually attached to the given `DatasetRef` objects, which may 

1076 not reflect what is in the database (especially if they were 

1077 obtained from `Registry.queryDatasets`, which does not include 

1078 components in its results). 

1079 

1080 Raises 

1081 ------ 

1082 TypeError 

1083 Raised if the butler is read-only, if no collection was provided, 

1084 or the conditions for ``purge=True`` were not met. 

1085 """ 

1086 if not self.isWriteable(): 

1087 raise TypeError("Butler is read-only.") 

1088 if purge: 

1089 if not disassociate: 

1090 raise TypeError("Cannot pass purge=True without disassociate=True.") 

1091 if not unstore: 

1092 raise TypeError("Cannot pass purge=True without unstore=True.") 

1093 if run is None: 

1094 run = self.run 

1095 if run is None: 

1096 raise TypeError("No run provided but purge=True.") 

1097 collectionType = self.registry.getCollectionType(run) 

1098 if collectionType is not CollectionType.RUN: 

1099 raise TypeError(f"Cannot purge from collection '{run}' " 

1100 f"of non-RUN type {collectionType.name}.") 

1101 elif disassociate: 

1102 if tags is None: 

1103 tags = self.tags 

1104 else: 

1105 tags = tuple(tags) 

1106 if not tags: 

1107 raise TypeError("No tags provided but disassociate=True.") 

1108 for tag in tags: 

1109 collectionType = self.registry.getCollectionType(tag) 

1110 if collectionType is not CollectionType.TAGGED: 

1111 raise TypeError(f"Cannot disassociate from collection '{tag}' " 

1112 f"of non-TAGGED type {collectionType.name}.") 

1113 # Pruning a component of a DatasetRef makes no sense since registry 

1114 # doesn't always know about components and datastore might not store 

1115 # components in a separate file 

1116 for ref in refs: 

1117 if ref.datasetType.component(): 

1118 raise ValueError(f"Can not prune a component of a dataset (ref={ref})") 

1119 

1120 if recursive: 

1121 refs = list(DatasetRef.flatten(refs)) 

1122 # We don't need an unreliable Datastore transaction for this, because 

1123 # we've been extra careful to ensure that Datastore.trash only involves 

1124 # mutating the Registry (it can _look_ at Datastore-specific things, 

1125 # but shouldn't change them), and hence all operations here are 

1126 # Registry operations. 

1127 with self.registry.transaction(): 

1128 if unstore: 

1129 for ref in refs: 

1130 # There is a difference between a concrete composite 

1131 # and virtual composite. In a virtual composite the 

1132 # datastore is never given the top level DatasetRef. In 

1133 # the concrete composite the datastore knows all the 

1134 # refs and will clean up itself if asked to remove the 

1135 # parent ref. We can not check configuration for this 

1136 # since we can not trust that the configuration is the 

1137 # same. We therefore have to ask if the ref exists or 

1138 # not. This is consistent with the fact that we want 

1139 # to ignore already-removed-from-datastore datasets 

1140 # anyway. 

1141 if self.datastore.exists(ref): 

1142 self.datastore.trash(ref) 

1143 if purge: 

1144 self.registry.removeDatasets(refs, recursive=False) # refs is already recursiveley expanded 

1145 elif disassociate: 

1146 for tag in tags: 

1147 # recursive=False here because refs is already recursive 

1148 # if we want it to be. 

1149 self.registry.disassociate(tag, refs, recursive=False) 

1150 # We've exited the Registry transaction, and apparently committed. 

1151 # (if there was an exception, everything rolled back, and it's as if 

1152 # nothing happened - and we never get here). 

1153 # Datastore artifacts are not yet gone, but they're clearly marked 

1154 # as trash, so if we fail to delete now because of (e.g.) filesystem 

1155 # problems we can try again later, and if manual administrative 

1156 # intervention is required, it's pretty clear what that should entail: 

1157 # deleting everything on disk and in private Datastore tables that is 

1158 # in the dataset_location_trash table. 

1159 if unstore: 

1160 # Point of no return for removing artifacts 

1161 self.datastore.emptyTrash() 

1162 

1163 @transactional 

1164 def ingest(self, *datasets: FileDataset, transfer: Optional[str] = None, run: Optional[str] = None, 

1165 tags: Optional[Iterable[str]] = None,): 

1166 """Store and register one or more datasets that already exist on disk. 

1167 

1168 Parameters 

1169 ---------- 

1170 datasets : `FileDataset` 

1171 Each positional argument is a struct containing information about 

1172 a file to be ingested, including its path (either absolute or 

1173 relative to the datastore root, if applicable), a `DatasetRef`, 

1174 and optionally a formatter class or its fully-qualified string 

1175 name. If a formatter is not provided, the formatter that would be 

1176 used for `put` is assumed. On successful return, all 

1177 `FileDataset.ref` attributes will have their `DatasetRef.id` 

1178 attribute populated and all `FileDataset.formatter` attributes will 

1179 be set to the formatter class used. `FileDataset.path` attributes 

1180 may be modified to put paths in whatever the datastore considers a 

1181 standardized form. 

1182 transfer : `str`, optional 

1183 If not `None`, must be one of 'auto', 'move', 'copy', 'hardlink', 

1184 'relsymlink' or 'symlink', indicating how to transfer the file. 

1185 run : `str`, optional 

1186 The name of the run ingested datasets should be added to, 

1187 overriding ``self.run``. 

1188 tags : `Iterable` [ `str` ], optional 

1189 The names of a `~CollectionType.TAGGED` collections to associate 

1190 the dataset with, overriding ``self.tags``. These collections 

1191 must have already been added to the `Registry`. 

1192 

1193 Raises 

1194 ------ 

1195 TypeError 

1196 Raised if the butler is read-only or if no run was provided. 

1197 NotImplementedError 

1198 Raised if the `Datastore` does not support the given transfer mode. 

1199 DatasetTypeNotSupportedError 

1200 Raised if one or more files to be ingested have a dataset type that 

1201 is not supported by the `Datastore`.. 

1202 FileNotFoundError 

1203 Raised if one of the given files does not exist. 

1204 FileExistsError 

1205 Raised if transfer is not `None` but the (internal) location the 

1206 file would be moved to is already occupied. 

1207 

1208 Notes 

1209 ----- 

1210 This operation is not fully exception safe: if a database operation 

1211 fails, the given `FileDataset` instances may be only partially updated. 

1212 

1213 It is atomic in terms of database operations (they will either all 

1214 succeed or all fail) providing the database engine implements 

1215 transactions correctly. It will attempt to be atomic in terms of 

1216 filesystem operations as well, but this cannot be implemented 

1217 rigorously for most datastores. 

1218 """ 

1219 if not self.isWriteable(): 

1220 raise TypeError("Butler is read-only.") 

1221 if run is None: 

1222 if self.run is None: 

1223 raise TypeError("No run provided.") 

1224 run = self.run 

1225 # No need to check run type, since insertDatasets will do that 

1226 # (safely) for us. 

1227 if tags is None: 

1228 tags = self.tags 

1229 else: 

1230 tags = tuple(tags) 

1231 for tag in tags: 

1232 # Check that these are tagged collections up front, because we want 

1233 # to avoid relying on Datastore transactionality to avoid modifying 

1234 # the repo if there's an error later. 

1235 collectionType = self.registry.getCollectionType(tag) 

1236 if collectionType is not CollectionType.TAGGED: 

1237 raise TypeError(f"Cannot associate into collection '{tag}' of non-TAGGED type " 

1238 f"{collectionType.name}.") 

1239 # Reorganize the inputs so they're grouped by DatasetType and then 

1240 # data ID. We also include a list of DatasetRefs for each FileDataset 

1241 # to hold the resolved DatasetRefs returned by the Registry, before 

1242 # it's safe to swap them into FileDataset.refs. 

1243 # Some type annotation aliases to make that clearer: 

1244 GroupForType = Dict[DataCoordinate, Tuple[FileDataset, List[DatasetRef]]] 

1245 GroupedData = MutableMapping[DatasetType, GroupForType] 

1246 # The actual data structure: 

1247 groupedData: GroupedData = defaultdict(dict) 

1248 # And the nested loop that populates it: 

1249 for dataset in datasets: 

1250 # This list intentionally shared across the inner loop, since it's 

1251 # associated with `dataset`. 

1252 resolvedRefs = [] 

1253 for ref in dataset.refs: 

1254 groupedData[ref.datasetType][ref.dataId] = (dataset, resolvedRefs) 

1255 

1256 # Now we can bulk-insert into Registry for each DatasetType. 

1257 allResolvedRefs = [] 

1258 for datasetType, groupForType in groupedData.items(): 

1259 refs = self.registry.insertDatasets(datasetType, 

1260 dataIds=groupForType.keys(), 

1261 run=run, 

1262 recursive=True) 

1263 # Append those resolved DatasetRefs to the new lists we set up for 

1264 # them. 

1265 for ref, (_, resolvedRefs) in zip(refs, groupForType.values()): 

1266 resolvedRefs.append(ref) 

1267 

1268 # Go back to the original FileDatasets to replace their refs with the 

1269 # new resolved ones, and also build a big list of all refs. 

1270 allResolvedRefs = [] 

1271 for groupForType in groupedData.values(): 

1272 for dataset, resolvedRefs in groupForType.values(): 

1273 dataset.refs = resolvedRefs 

1274 allResolvedRefs.extend(resolvedRefs) 

1275 

1276 # Bulk-associate everything with any tagged collections. 

1277 for tag in tags: 

1278 self.registry.associate(tag, allResolvedRefs) 

1279 

1280 # Bulk-insert everything into Datastore. 

1281 self.datastore.ingest(*datasets, transfer=transfer) 

1282 

1283 @contextlib.contextmanager 

1284 def export(self, *, directory: Optional[str] = None, 

1285 filename: Optional[str] = None, 

1286 format: Optional[str] = None, 

1287 transfer: Optional[str] = None) -> ContextManager[RepoExport]: 

1288 """Export datasets from the repository represented by this `Butler`. 

1289 

1290 This method is a context manager that returns a helper object 

1291 (`RepoExport`) that is used to indicate what information from the 

1292 repository should be exported. 

1293 

1294 Parameters 

1295 ---------- 

1296 directory : `str`, optional 

1297 Directory dataset files should be written to if ``transfer`` is not 

1298 `None`. 

1299 filename : `str`, optional 

1300 Name for the file that will include database information associated 

1301 with the exported datasets. If this is not an absolute path and 

1302 ``directory`` is not `None`, it will be written to ``directory`` 

1303 instead of the current working directory. Defaults to 

1304 "export.{format}". 

1305 format : `str`, optional 

1306 File format for the database information file. If `None`, the 

1307 extension of ``filename`` will be used. 

1308 transfer : `str`, optional 

1309 Transfer mode passed to `Datastore.export`. 

1310 

1311 Raises 

1312 ------ 

1313 TypeError 

1314 Raised if the set of arguments passed is inconsistent. 

1315 

1316 Examples 

1317 -------- 

1318 Typically the `Registry.queryDimensions` and `Registry.queryDatasets` 

1319 methods are used to provide the iterables over data IDs and/or datasets 

1320 to be exported:: 

1321 

1322 with butler.export("exports.yaml") as export: 

1323 # Export all flats, and the calibration_label dimensions 

1324 # associated with them. 

1325 export.saveDatasets(butler.registry.queryDatasets("flat"), 

1326 elements=[butler.registry.dimensions["calibration_label"]]) 

1327 # Export all datasets that start with "deepCoadd_" and all of 

1328 # their associated data ID information. 

1329 export.saveDatasets(butler.registry.queryDatasets("deepCoadd_*")) 

1330 """ 

1331 if directory is None and transfer is not None: 

1332 raise TypeError("Cannot transfer without providing a directory.") 

1333 if transfer == "move": 

1334 raise TypeError("Transfer may not be 'move': export is read-only") 

1335 if format is None: 

1336 if filename is None: 

1337 raise TypeError("At least one of 'filename' or 'format' must be provided.") 

1338 else: 

1339 _, format = os.path.splitext(filename) 

1340 elif filename is None: 

1341 filename = f"export.{format}" 

1342 if directory is not None: 

1343 filename = os.path.join(directory, filename) 

1344 BackendClass = getClassOf(self._config["repo_transfer_formats"][format]["export"]) 

1345 with open(filename, 'w') as stream: 

1346 backend = BackendClass(stream) 

1347 try: 

1348 helper = RepoExport(self.registry, self.datastore, backend=backend, 

1349 directory=directory, transfer=transfer) 

1350 yield helper 

1351 except BaseException: 

1352 raise 

1353 else: 

1354 helper._finish() 

1355 

1356 def import_(self, *, directory: Optional[str] = None, 

1357 filename: Optional[str] = None, 

1358 format: Optional[str] = None, 

1359 transfer: Optional[str] = None): 

1360 """Import datasets exported from a different butler repository. 

1361 

1362 Parameters 

1363 ---------- 

1364 directory : `str`, optional 

1365 Directory containing dataset files. If `None`, all file paths 

1366 must be absolute. 

1367 filename : `str`, optional 

1368 Name for the file that containing database information associated 

1369 with the exported datasets. If this is not an absolute path, does 

1370 not exist in the current working directory, and ``directory`` is 

1371 not `None`, it is assumed to be in ``directory``. Defaults to 

1372 "export.{format}". 

1373 format : `str`, optional 

1374 File format for the database information file. If `None`, the 

1375 extension of ``filename`` will be used. 

1376 transfer : `str`, optional 

1377 Transfer mode passed to `Datastore.export`. 

1378 

1379 Raises 

1380 ------ 

1381 TypeError 

1382 Raised if the set of arguments passed is inconsistent, or if the 

1383 butler is read-only. 

1384 """ 

1385 if not self.isWriteable(): 

1386 raise TypeError("Butler is read-only.") 

1387 if format is None: 

1388 if filename is None: 

1389 raise TypeError("At least one of 'filename' or 'format' must be provided.") 

1390 else: 

1391 _, format = os.path.splitext(filename) 

1392 elif filename is None: 

1393 filename = f"export.{format}" 

1394 if directory is not None and not os.path.exists(filename): 

1395 filename = os.path.join(directory, filename) 

1396 BackendClass = getClassOf(self._config["repo_transfer_formats"][format]["import"]) 

1397 with open(filename, 'r') as stream: 

1398 backend = BackendClass(stream, self.registry) 

1399 backend.register() 

1400 with self.transaction(): 

1401 backend.load(self.datastore, directory=directory, transfer=transfer) 

1402 

1403 def validateConfiguration(self, logFailures: bool = False, 

1404 datasetTypeNames: Optional[Iterable[str]] = None, 

1405 ignore: Iterable[str] = None): 

1406 """Validate butler configuration. 

1407 

1408 Checks that each `DatasetType` can be stored in the `Datastore`. 

1409 

1410 Parameters 

1411 ---------- 

1412 logFailures : `bool`, optional 

1413 If `True`, output a log message for every validation error 

1414 detected. 

1415 datasetTypeNames : iterable of `str`, optional 

1416 The `DatasetType` names that should be checked. This allows 

1417 only a subset to be selected. 

1418 ignore : iterable of `str`, optional 

1419 Names of DatasetTypes to skip over. This can be used to skip 

1420 known problems. If a named `DatasetType` corresponds to a 

1421 composite, all component of that `DatasetType` will also be 

1422 ignored. 

1423 

1424 Raises 

1425 ------ 

1426 ButlerValidationError 

1427 Raised if there is some inconsistency with how this Butler 

1428 is configured. 

1429 """ 

1430 if datasetTypeNames: 

1431 entities = [self.registry.getDatasetType(name) for name in datasetTypeNames] 

1432 else: 

1433 entities = list(self.registry.queryDatasetTypes()) 

1434 

1435 # filter out anything from the ignore list 

1436 if ignore: 

1437 ignore = set(ignore) 

1438 entities = [e for e in entities if e.name not in ignore and e.nameAndComponent()[0] not in ignore] 

1439 else: 

1440 ignore = set() 

1441 

1442 # Find all the registered instruments 

1443 instruments = set( 

1444 dataId["instrument"] for dataId in self.registry.queryDimensions(["instrument"]) 

1445 ) 

1446 

1447 # For each datasetType that has an instrument dimension, create 

1448 # a DatasetRef for each defined instrument 

1449 datasetRefs = [] 

1450 

1451 for datasetType in entities: 

1452 if "instrument" in datasetType.dimensions: 

1453 for instrument in instruments: 

1454 datasetRef = DatasetRef(datasetType, {"instrument": instrument}, conform=False) 

1455 datasetRefs.append(datasetRef) 

1456 

1457 entities.extend(datasetRefs) 

1458 

1459 datastoreErrorStr = None 

1460 try: 

1461 self.datastore.validateConfiguration(entities, logFailures=logFailures) 

1462 except ValidationError as e: 

1463 datastoreErrorStr = str(e) 

1464 

1465 # Also check that the LookupKeys used by the datastores match 

1466 # registry and storage class definitions 

1467 keys = self.datastore.getLookupKeys() 

1468 

1469 failedNames = set() 

1470 failedDataId = set() 

1471 for key in keys: 

1472 datasetType = None 

1473 if key.name is not None: 

1474 if key.name in ignore: 

1475 continue 

1476 

1477 # skip if specific datasetType names were requested and this 

1478 # name does not match 

1479 if datasetTypeNames and key.name not in datasetTypeNames: 

1480 continue 

1481 

1482 # See if it is a StorageClass or a DatasetType 

1483 if key.name in self.storageClasses: 

1484 pass 

1485 else: 

1486 try: 

1487 self.registry.getDatasetType(key.name) 

1488 except KeyError: 

1489 if logFailures: 

1490 log.fatal("Key '%s' does not correspond to a DatasetType or StorageClass", key) 

1491 failedNames.add(key) 

1492 else: 

1493 # Dimensions are checked for consistency when the Butler 

1494 # is created and rendezvoused with a universe. 

1495 pass 

1496 

1497 # Check that the instrument is a valid instrument 

1498 # Currently only support instrument so check for that 

1499 if key.dataId: 

1500 dataIdKeys = set(key.dataId) 

1501 if set(["instrument"]) != dataIdKeys: 

1502 if logFailures: 

1503 log.fatal("Key '%s' has unsupported DataId override", key) 

1504 failedDataId.add(key) 

1505 elif key.dataId["instrument"] not in instruments: 

1506 if logFailures: 

1507 log.fatal("Key '%s' has unknown instrument", key) 

1508 failedDataId.add(key) 

1509 

1510 messages = [] 

1511 

1512 if datastoreErrorStr: 

1513 messages.append(datastoreErrorStr) 

1514 

1515 for failed, msg in ((failedNames, "Keys without corresponding DatasetType or StorageClass entry: "), 

1516 (failedDataId, "Keys with bad DataId entries: ")): 

1517 if failed: 

1518 msg += ", ".join(str(k) for k in failed) 

1519 messages.append(msg) 

1520 

1521 if messages: 

1522 raise ValidationError(";\n".join(messages)) 

1523 

1524 registry: Registry 

1525 """The object that manages dataset metadata and relationships (`Registry`). 

1526 

1527 Most operations that don't involve reading or writing butler datasets are 

1528 accessible only via `Registry` methods. 

1529 """ 

1530 

1531 datastore: Datastore 

1532 """The object that manages actual dataset storage (`Datastore`). 

1533 

1534 Direct user access to the datastore should rarely be necessary; the primary 

1535 exception is the case where a `Datastore` implementation provides extra 

1536 functionality beyond what the base class defines. 

1537 """ 

1538 

1539 storageClasses: StorageClassFactory 

1540 """An object that maps known storage class names to objects that fully 

1541 describe them (`StorageClassFactory`). 

1542 """ 

1543 

1544 collections: Optional[CollectionSearch] 

1545 """The collections to search and any restrictions on the dataset types to 

1546 search for within them, in order (`CollectionSearch`). 

1547 """ 

1548 

1549 run: Optional[str] 

1550 """Name of the run this butler writes outputs to (`str` or `None`). 

1551 """ 

1552 

1553 tags: Tuple[str, ...] 

1554 """Names of `~CollectionType.TAGGED` collections this butler associates 

1555 with in `put` and `ingest`, and disassociates from in `pruneDatasets` 

1556 (`tuple` [ `str` ]). 

1557 """