|
92 | 92 | "### 1. Create an instance of the Butler\n", |
93 | 93 | "\n", |
94 | 94 | "To create the Butler, we need to provide it with a path to the data set, which is called a \"data repository\".\n", |
95 | | - "Butler repositories can be remote (i.e., pointing to an S3 bucket) or local (i.e., pointing to a directory on the local file system).\n", |
| 95 | + "Butler repositories have both a database component and a file-like storage component; the latter can can be remote (i.e., pointing to an S3 bucket) or local (i.e., pointing to a directory on the local file system), and it contains a configuration file (usually `butler.yaml`) that points to the right database\n", |
96 | 96 | "\n", |
97 | 97 | "S3 (Simple Storage Service) buckets are public cloud storage resources that are similar to file folders, store objects, and which consist of data and its descriptive metadata.\n", |
98 | 98 | "\n", |
|
122 | 122 | "source": [ |
123 | 123 | "#### 2.1 Butler registry and collections\n", |
124 | 124 | "\n", |
125 | | - "The registry is a database containing information about available data products.\n", |
126 | | - "The registry helps the user to examine what collections of data products exist.\n", |
| 125 | + "The database side of a data repository is called a registry.\n", |
| 126 | + "The registry contains entries for all data products, and organizes them by _collection_, _dataset type_, and _data ID_.\n", |
127 | 127 | "Use the registry to investigate a repository by listing all collections.\n", |
128 | 128 | "\n", |
129 | | - "Find more about the registry schema [here](https://dmtn-073.lsst.io/).\n", |
130 | | - "\n", |
131 | 129 | "Find more about collections [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/organizing.html#collections).\n", |
132 | 130 | "\n", |
133 | | - "Create a registry for the DP0.1 data set using the Butler." |
| 131 | + "A registry client is part of our butler object:" |
134 | 132 | ] |
135 | 133 | }, |
136 | 134 | { |
|
173 | 171 | "* `calib` - refers to calibration products that are used for instrument signature removal\n", |
174 | 172 | "* `runs` - refers to processed data products\n", |
175 | 173 | "* `refcats` - refers to the reference catalogs used for astrometric and photometric calibration\n", |
176 | | - "* `skymaps` - are the geometric representations of the sky coverage\n", |
| 174 | + "* `skymaps` - definitions for the _tract_ and _patch_ grids that coadds are built on\n", |
177 | 175 | "\n", |
178 | | - "Collections are nested, and DP0 delegates can access all the data for DC2 Run 2.2i, which is the DP0.1 data set, by selecting the collection `2.2i/runs/DP0.1`.\n", |
| 176 | + "Some collections are nested, and DP0 delegates can access all the data for DC2 Run 2.2i, which is the DP0.1 data set, by selecting the collection `2.2i/runs/DP0.1`.\n", |
179 | 177 | "\n", |
180 | 178 | "Expand the pointer recursively to show the full contents of the selected collection." |
181 | 179 | ] |
|
261 | 259 | "cell_type": "markdown", |
262 | 260 | "metadata": {}, |
263 | 261 | "source": [ |
264 | | - "#### 2.3 Butler dataId\n", |
| 262 | + "#### 2.3 Butler data IDs\n", |
265 | 263 | "\n", |
266 | | - "The `dataId` (data identifier) is how specific data within a data set is accessed. Find more about the `dataId` [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/dimensions.html#data-ids).\n", |
| 264 | + "The data ID is a dictionary-like identifier for a data product.\n", |
| 265 | + "Find more about the data IDs [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/dimensions.html#data-ids).\n", |
267 | 266 | "\n", |
268 | | - "Each `DatasetType` uses a different set of keys as the `dataId`.\n", |
269 | | - "For example, in the `DatasetType` list printed to screen (above), next to `calexp` in curly brackets is listed the band, instrument, detector, physical_filter, visit_system, and visit. These are the keys of the `dataId` for a `calexp`.\n", |
| 267 | + "Each `DatasetType` uses a different set of keys in its data ID.\n", |
| 268 | + "For example, in the `DatasetType` list printed to screen (above), next to `calexp` in curly brackets is listed the band, instrument, detector, physical_filter, visit_system, and visit.\n", |
| 269 | + "These are the keys of the data ID for a `calexp`, which are also called \"dimensions\".\n", |
270 | 270 | "\n", |
271 | | - "In the following cell, the `DatasetRef` is queried for `calexp` data in our collection of interest, and the full `dataId` are printed to screen (for just a few examples).\n", |
| 271 | + "In the following cell, the `DatasetRef` is queried for `calexp` data in our collection of interest, and the full data IDs are printed to screen (for just a few examples).\n", |
| 272 | + "Data IDs can be represented in code as regular Python `dict` objects, but when returned from the `Butler` the `DataCoordinate` class is used instead.\n", |
272 | 273 | "\n", |
273 | | - "The `dataId` contains both *implied* and *required* keys. For example, the value of *band* would be *implied* by the *visit*, because a single visit refers to a single exposure at a single pointing in a single band. \n", |
274 | | - "In the following cell, printing the `dataId` without specifying `.full` shows only the required keys.\n", |
| 274 | + "The data ID contains both *implied* and *required* keys.\n", |
| 275 | + "For example, the value of *band* would be *implied* by the *visit*, because a single visit refers to a single exposure at a single pointing in a single band. \n", |
| 276 | + "In the following cell, printing the data ID without specifying `.full` shows only the required keys.\n", |
275 | 277 | "The value of a single key, in this case *band*, can also be printed by specifying the key name.\n", |
276 | 278 | "\n", |
277 | 279 | "The following cell will fail and return an error if the query is requesting a `DatasetRef` for data that does not exist." |
|
350 | 352 | "source": [ |
351 | 353 | "<br>\n", |
352 | 354 | "\n", |
353 | | - "The `dataId` can be retrieved directly by using `queryDataIds` instead of `queryDatasets`, as in the following two examples.\n", |
354 | | - "Note the flexibility in the use of the query keys and the where statement.\n", |
355 | | - "Also note that both the `calexp` and `src` data sets can be found by the registry, but this will not always necessarily be the case.\n", |
356 | | - "Queries for non-existent data will cause an error to be returned." |
| 355 | + "Each data ID key-value pair is associated with a metadata row called a `DimensionRecord`.\n", |
| 356 | + "Like dataset types, these exist independent of any collection, but they are also identified by data IDs.\n", |
| 357 | + "\n", |
| 358 | + "The `queryDimensionsRecords` method provides a way to query for these records.\n", |
| 359 | + "Most of the arguments accepted by `queryDatasets` can be used here (including `where`).\n", |
| 360 | + "\n", |
| 361 | + "An example of this is provided below:" |
357 | 362 | ] |
358 | 363 | }, |
359 | 364 | { |
|
362 | 367 | "metadata": {}, |
363 | 368 | "outputs": [], |
364 | 369 | "source": [ |
365 | | - "dataIds = registry.queryDataIds([\"visit\", \"detector\", \"band\"], datasets=[\"calexp\"],\n", |
366 | | - " where='visit = 703697', collections=collection)\n", |
367 | | - "for i, dataId in enumerate(dataIds):\n", |
368 | | - " print(dataId.full)\n", |
369 | | - " if i > 2:\n", |
370 | | - " break" |
| 370 | + "for dim in ['exposure', 'visit', 'detector']:\n", |
| 371 | + " print(list(registry.queryDimensionRecords(dim, where='visit = 971990 and detector=0'))[0])\n", |
| 372 | + " print()" |
| 373 | + ] |
| 374 | + }, |
| 375 | + { |
| 376 | + "cell_type": "markdown", |
| 377 | + "metadata": {}, |
| 378 | + "source": [ |
| 379 | + "Another query method, `queryDataIds`, can be used to query for data IDs independent of any dataset, but it's less useful for general data exploration.\n", |
| 380 | + "\n", |
| 381 | + "It is also possible to pass `datasets` and `collections` to both `queryDataIds` and `queryDimensionRecords` in order to return records whose data IDs match those of existing datasets.\n", |
| 382 | + "But this is quite a bit more subtle than searching directly for a dataset, and rarely wanted when exploring a data repository.\n", |
| 383 | + "\n", |
| 384 | + "More information on all of the query methods can be found [here](https://pipelines.lsst.io/v/weekly/middleware/faq.html#when-should-i-use-each-of-the-query-methods-commands)." |
| 385 | + ] |
| 386 | + }, |
| 387 | + { |
| 388 | + "cell_type": "markdown", |
| 389 | + "metadata": {}, |
| 390 | + "source": [ |
| 391 | + "#### 2.5 Temporal and spatial queries\n", |
| 392 | + "\n", |
| 393 | + "The following examples show how to query for data sets that include a desired coordinate and observation date.\n", |
| 394 | + "\n", |
| 395 | + "Above, we can see that for visit 971990, the (RA,Dec) are (70.37770,-37.1757) and the observation date is 20251201\n", |
| 396 | + "But these are just human-readable summaries of the more precise spatial and temporal information stored in the registry, which are represented in Python by `Timespan` and `Region` objects, respectively.\n", |
| 397 | + "`DimensionRecord` objects that represent spatial or temporal concepts (a `visit` is both) have these objects attached to them:" |
371 | 398 | ] |
372 | 399 | }, |
373 | 400 | { |
|
376 | 403 | "metadata": {}, |
377 | 404 | "outputs": [], |
378 | 405 | "source": [ |
379 | | - "dataIds = registry.queryDataIds([\"visit\", \"detector\"], datasets=[\"src\"],\n", |
380 | | - " where=\"band='g' and detector=0 and visit > 700000\",\n", |
381 | | - " collections=collection)\n", |
382 | | - "for i, dataId in enumerate(dataIds):\n", |
383 | | - " print(dataId.full)\n", |
384 | | - " if i > 2:\n", |
385 | | - " break" |
| 406 | + "(record,) = registry.queryDimensionRecords('visit', visit=971990)\n", |
| 407 | + "print(record.timespan)\n", |
| 408 | + "print(record.region)" |
386 | 409 | ] |
387 | 410 | }, |
388 | 411 | { |
389 | 412 | "cell_type": "markdown", |
390 | 413 | "metadata": {}, |
391 | 414 | "source": [ |
392 | | - "<br>\n", |
393 | | - "\n", |
394 | | - "The `queryDimensions` method provides a more flexible way to query for multiple datasets (requiring an instance of all datasets to be available for that `dataId`) or to ask for different `dataId` keys than what is used to identify the dataset (which invokes various built-in relationships).\n", |
395 | | - "\n", |
396 | | - "An example of this is provided below:" |
| 415 | + "If the timespan or spatial region that's being used as a query constraint is already associated with a data ID in the database, spatial and temporal overlap constraints are automatic.\n", |
| 416 | + "For example, if we query for `deepCoadd` datasets with a `visit`+`detector` data ID, we'll get just the ones that overlap that observation and have the same band (because a visit implies a band):" |
397 | 417 | ] |
398 | 418 | }, |
399 | 419 | { |
|
402 | 422 | "metadata": {}, |
403 | 423 | "outputs": [], |
404 | 424 | "source": [ |
405 | | - "for dim in ['exposure', 'visit', 'detector']:\n", |
406 | | - " print(list(registry.queryDimensionRecords(dim, where='visit = 971990 and detector=0'))[0])\n", |
407 | | - " print()" |
| 425 | + "for ref in registry.queryDatasets(\"deepCoadd\", visit=971990, detector=50):\n", |
| 426 | + " print(ref)" |
408 | 427 | ] |
409 | 428 | }, |
410 | 429 | { |
411 | 430 | "cell_type": "markdown", |
412 | 431 | "metadata": {}, |
413 | 432 | "source": [ |
414 | | - "<br>\n", |
| 433 | + "To query for dimension records or datasets that overlap an arbitrary time range, we can use the `bind` argument to pass times through to `where`; we'll use this to look for visits within one minute of this one on either side:" |
| 434 | + ] |
| 435 | + }, |
| 436 | + { |
| 437 | + "cell_type": "code", |
| 438 | + "execution_count": null, |
| 439 | + "metadata": {}, |
| 440 | + "outputs": [], |
| 441 | + "source": [ |
| 442 | + "import astropy.time\n", |
| 443 | + "minute = astropy.time.TimeDelta(60, format=\"sec\")\n", |
| 444 | + "timespan = dafButler.Timespan(record.timespan.begin - minute, record.timespan.end + minute)\n", |
415 | 445 | "\n", |
416 | | - "**NEED HELP HERE WITH THIS FINAL BIT!!**\n", |
| 446 | + "for visit in registry.queryDimensionRecords(\"visit\", where=\"visit.timespan OVERLAPS my_timespan\", bind={\"my_timespan\": timespan}):\n", |
| 447 | + " print(visit.id, visit.timespan, visit.physical_filter)" |
| 448 | + ] |
| 449 | + }, |
| 450 | + { |
| 451 | + "cell_type": "markdown", |
| 452 | + "metadata": {}, |
| 453 | + "source": [ |
| 454 | + "Using `bind` to define an alias for a variable saves us from having to string-format the times into the `where` expression.\n", |
| 455 | + "Unfortunately, there is a bug in `queryDatasets` that prevents `bind` from working there (fixed in `w_2021_25`).\n", |
417 | 456 | "\n", |
418 | | - "The following examples show how to query for data sets that include a desired coordinate and observation date.\n", |
| 457 | + "A `Timespan` can have a `begin` or `end` of `None` if it is unbounded on that side." |
| 458 | + ] |
| 459 | + }, |
| 460 | + { |
| 461 | + "cell_type": "markdown", |
| 462 | + "metadata": {}, |
| 463 | + "source": [ |
| 464 | + "Arbitrary spatial queries are not supported, but we do have set of dimensions that correspond to different levels of the HTM (hierarchical triangular mesh) pixelization of the sky.\n", |
419 | 465 | "\n", |
420 | | - "Above, we can see that for visit 971990, the (RA,Dec) are (70.37770,-37.1757) and the observation date is 20251201. The following example uses the RA,Dec and date to retrieve the visit." |
| 466 | + "So one can transform a region or point into one or more HTM IDs, and then seach using that as a spatial data ID.\n", |
| 467 | + "The `lsst.sphgeom` library is what backs our region objects, and we can also use it to find the HTM ID for a point." |
421 | 468 | ] |
422 | 469 | }, |
423 | 470 | { |
|
426 | 473 | "metadata": {}, |
427 | 474 | "outputs": [], |
428 | 475 | "source": [ |
429 | | - "ra = 70.37770\n", |
430 | | - "dec = -37.1757\n", |
431 | | - "s1 = \"exposure.day_obs = 20251201\"\n", |
432 | | - "s2 = \"exposure.tracking_ra > \"+str(ra-1.0)\n", |
433 | | - "s3 = \"exposure.tracking_ra < \"+str(ra+1.0)\n", |
434 | | - "s4 = \"exposure.tracking_dec > \"+str(dec-1.0)\n", |
435 | | - "s5 = \"exposure.tracking_dec < \"+str(dec+1.0)\n", |
436 | | - "\n", |
437 | | - "results = registry.queryDimensionRecords('visit',\n", |
438 | | - " where=s1+\" AND \"+s2+\" AND \"+s3+\" AND \"+s4+\" AND \"+s5,\n", |
439 | | - " collections=collection)\n", |
| 476 | + "import lsst.sphgeom\n", |
440 | 477 | "\n", |
441 | | - "# Use expandDataId to fill in the implicit dataId keys with values\n", |
442 | | - "for i, ref in enumerate(results):\n", |
443 | | - " tempId = butler.registry.expandDataId(ref.dataId)\n", |
444 | | - " print(tempId.full)\n", |
445 | | - " if i > 10:\n", |
| 478 | + "pixelization = lsst.sphgeom.HtmPixelization(7)" |
| 479 | + ] |
| 480 | + }, |
| 481 | + { |
| 482 | + "cell_type": "code", |
| 483 | + "execution_count": null, |
| 484 | + "metadata": {}, |
| 485 | + "outputs": [], |
| 486 | + "source": [ |
| 487 | + "htm_id = pixelization.index(\n", |
| 488 | + " lsst.sphgeom.UnitVector3d(\n", |
| 489 | + " lsst.sphgeom.LonLat.fromDegrees(70.37699524983329, -37.17573628348882)\n", |
| 490 | + " )\n", |
| 491 | + ")\n", |
| 492 | + "scale = pixelization.triangle(htm_id).getBoundingCircle().getOpeningAngle().asDegrees()*3600\n", |
| 493 | + "print(f'HTM ID={htm_id} at level={pixelization.getLevel()} is a ~{scale:0.2}\" triangle.')" |
| 494 | + ] |
| 495 | + }, |
| 496 | + { |
| 497 | + "cell_type": "markdown", |
| 498 | + "metadata": {}, |
| 499 | + "source": [ |
| 500 | + "And we can use that to query for (e.g.) the set of all `src` data products that overlap this point in i:" |
| 501 | + ] |
| 502 | + }, |
| 503 | + { |
| 504 | + "cell_type": "code", |
| 505 | + "execution_count": null, |
| 506 | + "metadata": {}, |
| 507 | + "outputs": [], |
| 508 | + "source": [ |
| 509 | + "for i, src_ref in enumerate(registry.queryDatasets(\"src\", htm20=htm_id, band=\"i\")):\n", |
| 510 | + " print(src_ref)\n", |
| 511 | + " if i > 2:\n", |
446 | 512 | " break" |
447 | 513 | ] |
448 | 514 | }, |
449 | 515 | { |
450 | 516 | "cell_type": "markdown", |
451 | 517 | "metadata": {}, |
452 | 518 | "source": [ |
453 | | - "Above, our query terms were not sufficiently unqiue to return only visit 971990, because there were other images of that sky location obtained on that date. **IS IT WEIRD THEY ARE ALL Z BAND?**\n", |
454 | | - "\n", |
455 | | - "<br>\n", |
456 | | - "\n", |
457 | | - "**TO BE ADDED:**\n", |
458 | | - "\n", |
459 | | - "* use of regions instead of the above kludge with RA and Dec within a degree\n", |
460 | | - "* how to figure out which detector the coordinates are in, instead of matching to exposure center\n", |
461 | | - "* how to use timespan, to be more specific about time instead of just date" |
| 519 | + "The butler's spatial reasoning is designed to work well for regions the size of full data products, like detector- or patch-level images and catalogs, and it's a poor choice for object-scale searches.\n", |
| 520 | + "The query above is slow in large part because it actually searches for all `src` datasets that overlap the much larger htm7 pixel (about a degree on a side), and then filters the results down to the htm20 pixel in Python." |
| 521 | + ] |
| 522 | + }, |
| 523 | + { |
| 524 | + "cell_type": "markdown", |
| 525 | + "metadata": {}, |
| 526 | + "source": [ |
| 527 | + "That said, it's something we'll use frequently below, so it will be useful to wrap this in a function:" |
462 | 528 | ] |
463 | 529 | }, |
464 | 530 | { |
|
588 | 654 | " Parameters\n", |
589 | 655 | " ----------\n", |
590 | 656 | " butler: lsst.daf.persistence.Butler\n", |
591 | | - " Servant providing access to a data repository\n", |
| 657 | + " Client providing access to a data repository\n", |
592 | 658 | " ra: float\n", |
593 | 659 | " Right ascension of the center of the cutout, degrees\n", |
594 | 660 | " dec: float\n", |
|
0 commit comments