Merge pull request #12 from rubin-dp0/tickets/PREOPS-558

MelissaGraham · web-flow · commit 05dae9f58898 · 2021-06-22T21:30:38.000-07:00
PREOPS-558: Updates to butler tutorial notebook.
diff --git a/04_Intro_to_Butler.ipynb b/04_Intro_to_Butler.ipynb
@@ -6,13 +6,13 @@
    "source": [
     "<img align=\"left\" src = https://project.lsst.org/sites/default/files/Rubin-O-Logo_0.png width=250> \n",
     "<b>Introduction to the LSST data Butler</b> <br>\n",
-    "Last verified to run on <b>TBD</b> with LSST Science Pipelines release <b>TBD</b> <br>\n",
+    "Last verified to run on <b>Jun 17 2021</b> with LSST Science Pipelines release <b>w_2021_25</b> <br>\n",
     "Contact author: Alex Drlica-Wagner <br>\n",
-    "Credit: Originally developed by Alex Drlica-Wagner in the context of the LSST Stack Club <br>\n",
+    "Credit: Originally developed by Alex Drlica-Wagner in the context of the LSST Stack Club. <br>\n",
     "Target audience: All DP0 delegates. <br>\n",
     "Container Size: medium <br>\n",
-    "Questions welcome at <a href=\"https://community.lsst.org/c/support/dp0\">community.lsst.org/c/support/dp0</a> <br>\n",
-    "Find DP0 documentation and resources at <a href=\"https://dp0-1.lsst.io\">dp0-1.lsst.io</a> <br>"
+    "Questions welcome at <a href=\"https://community.lsst.org/c/support/dp0\">community.lsst.org/c/support/dp0</a>. <br>\n",
+    "Find DP0 documentation and resources at <a href=\"https://dp0-1.lsst.io\">dp0-1.lsst.io</a>. <br>"
    ]
   },
   {
@@ -92,7 +92,7 @@
     "### 1. Create an instance of the Butler\n",
     "\n",
     "To create the Butler, we need to provide it with a path to the data set, which is called a \"data repository\".\n",
-    "Butler repositories can be remote (i.e., pointing to an S3 bucket) or local (i.e., pointing to a directory on the local file system).\n",
+    "Butler repositories have both a database component and a file-like storage component; the latter can can be remote (i.e., pointing to an S3 bucket) or local (i.e., pointing to a directory on the local file system), and it contains a configuration file (usually `butler.yaml`) that points to the right database\n",
     "\n",
     "S3 (Simple Storage Service) buckets are public cloud storage resources that are similar to file folders, store objects, and which consist of data and its descriptive metadata.\n",
     "\n",
@@ -122,15 +122,13 @@
    "source": [
     "#### 2.1 Butler registry and collections\n",
     "\n",
-    "The registry is a database containing information about available data products.\n",
-    "The registry helps the user to examine what collections of data products exist.\n",
+    "The database side of a data repository is called a registry.\n",
+    "The registry contains entries for all data products, and organizes them by _collection_, _dataset type_, and _data ID_.\n",
     "Use the registry to investigate a repository by listing all collections.\n",
     "\n",
-    "Find more about the registry schema [here](https://dmtn-073.lsst.io/).\n",
-    "\n",
     "Find more about collections [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/organizing.html#collections).\n",
     "\n",
-    "Create a registry for the DP0.1 data set using the Butler."
+    "A registry client is part of our butler object:"
    ]
   },
   {
@@ -173,9 +171,9 @@
     "* `calib` - refers to calibration products that are used for instrument signature removal\n",
     "* `runs` - refers to processed data products\n",
     "* `refcats` - refers to the reference catalogs used for astrometric and photometric calibration\n",
-    "* `skymaps` - are the geometric representations of the sky coverage\n",
+    "* `skymaps` - definitions for the _tract_ and _patch_ grids that coadds are built on\n",
     "\n",
-    "Collections are nested, and DP0 delegates can access all the data for DC2 Run 2.2i, which is the DP0.1 data set, by selecting the collection `2.2i/runs/DP0.1`.\n",
+    "Some collections are nested, and DP0 delegates can access all the data for DC2 Run 2.2i, which is the DP0.1 data set, by selecting the collection `2.2i/runs/DP0.1`.\n",
     "\n",
     "Expand the pointer recursively to show the full contents of the selected collection."
    ]
@@ -261,17 +259,21 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### 2.3 Butler dataId\n",
+    "#### 2.3 Butler data IDs\n",
     "\n",
-    "The `dataId` (data identifier) is how specific data within a data set is accessed. Find more about the `dataId` [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/dimensions.html#data-ids).\n",
+    "The data ID is a dictionary-like identifier for a data product.\n",
+    "Find more about the data IDs [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/dimensions.html#data-ids).\n",
     "\n",
-    "Each `DatasetType` uses a different set of keys as the `dataId`.\n",
-    "For example, in the `DatasetType` list printed to screen (above), next to `calexp` in curly brackets is listed the band, instrument, detector, physical_filter, visit_system, and visit. These are the keys of the `dataId` for a `calexp`.\n",
+    "Each `DatasetType` uses a different set of keys in its data ID.\n",
+    "For example, in the `DatasetType` list printed to screen (above), next to `calexp` in curly brackets is listed the band, instrument, detector, physical_filter, visit_system, and visit.\n",
+    "These are the keys of the data ID for a `calexp`, which are also called \"dimensions\".\n",
     "\n",
-    "In the following cell, the `DatasetRef` is queried for `calexp` data in our collection of interest, and the full `dataId` are printed to screen (for just a few examples).\n",
+    "In the following cell, the `DatasetRef` is queried for `calexp` data in our collection of interest, and the full data IDs are printed to screen (for just a few examples).\n",
+    "Data IDs can be represented in code as regular Python `dict` objects, but when returned from the `Butler` the `DataCoordinate` class is used instead.\n",
     "\n",
-    "The `dataId` contains both *implied* and *required* keys. For example, the value of *band* would be *implied* by the *visit*, because a single visit refers to a single exposure at a single pointing in a single band. \n",
-    "In the following cell, printing the `dataId` without specifying `.full` shows only the required keys.\n",
+    "The data ID contains both *implied* and *required* keys.\n",
+    "For example, the value of *band* would be *implied* by the *visit*, because a single visit refers to a single exposure at a single pointing in a single band. \n",
+    "In the following cell, printing the data ID without specifying `.full` shows only the required keys.\n",
     "The value of a single key, in this case *band*, can also be printed by specifying the key name.\n",
     "\n",
     "The following cell will fail and return an error if the query is requesting a `DatasetRef` for data that does not exist."
@@ -350,10 +352,13 @@
    "source": [
     "<br>\n",
     "\n",
-    "The `dataId` can be retrieved directly by using `queryDataIds` instead of `queryDatasets`, as in the following two examples.\n",
-    "Note the flexibility in the use of the query keys and the where statement.\n",
-    "Also note that both the `calexp` and `src` data sets can be found by the registry, but this will not always necessarily be the case.\n",
-    "Queries for non-existent data will cause an error to be returned."
+    "Each data ID key-value pair is associated with a metadata row called a `DimensionRecord`.\n",
+    "Like dataset types, these exist independent of any collection, but they are also identified by data IDs.\n",
+    "\n",
+    "The `queryDimensionsRecords` method provides a way to query for these records.\n",
+    "Most of the arguments accepted by `queryDatasets` can be used here (including `where`).\n",
+    "\n",
+    "An example of this is provided below:"
    ]
   },
   {
@@ -362,12 +367,38 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "dataIds = registry.queryDataIds([\"visit\", \"detector\", \"band\"], datasets=[\"calexp\"],\n",
-    "                                where='visit = 703697', collections=collection)\n",
-    "for i, dataId in enumerate(dataIds):\n",
-    "    print(dataId.full)\n",
-    "    if i > 2:\n",
-    "        break"
+    "for dim in ['exposure', 'visit', 'detector']:\n",
+    "    print(list(registry.queryDimensionRecords(dim, where='visit = 971990 and detector=0'))[0])\n",
+    "    print()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Another query method, `queryDataIds`, can be used to query for data IDs independent of any dataset, but it's less useful for general data exploration.\n",
+    "\n",
+    "It is also possible to pass `datasets` and `collections` to both `queryDataIds` and `queryDimensionRecords` in order to return records whose data IDs match those of existing datasets.\n",
+    "But this is quite a bit more subtle than searching directly for a dataset, and rarely wanted when exploring a data repository.\n",
+    "\n",
+    "More information on all of the query methods can be found [here](https://pipelines.lsst.io/v/weekly/middleware/faq.html#when-should-i-use-each-of-the-query-methods-commands)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 2.5 Temporal and spatial queries\n",
+    "\n",
+    "The following examples show how to query for data sets that include a desired coordinate and observation date.\n",
+    "\n",
+    "##### Temporal queries\n",
+    "\n",
+    "Above, we can see that for visit 971990, the (RA,Dec) are (70.37770,-37.1757) and the observation date is 20251201.\n",
+    "But these are just human-readable summaries of the more precise spatial and temporal information stored in the registry, which are represented in Python by `Timespan` and `Region` objects, respectively.\n",
+    "`DimensionRecord` objects that represent spatial or temporal concepts (a `visit` is both) have these objects attached to them.\n",
+    "\n",
+    "Retrieve the `DimensionRecord` for a visit and show its timespan and region."
    ]
   },
   {
@@ -376,24 +407,55 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "dataIds = registry.queryDataIds([\"visit\", \"detector\"], datasets=[\"src\"],\n",
-    "                                where=\"band='g' and detector=0 and visit > 700000\",\n",
-    "                                collections=collection)\n",
-    "for i, dataId in enumerate(dataIds):\n",
-    "    print(dataId.full)\n",
-    "    if i > 2:\n",
-    "        break"
+    "(record,) = registry.queryDimensionRecords('visit', visit=971990)\n",
+    "\n",
+    "print(record.timespan)\n",
+    "print(' ')\n",
+    "print(record.region)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<br>\n",
+    "If the timespan or spatial region that are being used as query constraints are already associated with a data ID in the database, the spatial and temporal overlap constraints are automatic.\n",
+    "For example, if we query for `deepCoadd` datasets with a `visit`+`detector` data ID, we'll get just the ones that overlap that observation and have the same band (because a visit implies a band):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for ref in registry.queryDatasets(\"deepCoadd\", visit=971990, detector=50):\n",
+    "    print(ref)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To query for dimension records or datasets that overlap an arbitrary time range, we can use the `bind` argument to pass times through to `where`.\n",
+    "Using `bind` to define an alias for a variable saves us from having to string-format the times into the `where` expression.\n",
+    "Note that a `dafButler.Timespan` will accept a `begin` or `end` value that is equal to `None` if it is unbounded on that side.\n",
     "\n",
-    "The `queryDimensions` method provides a more flexible way to query for multiple datasets (requiring an instance of all datasets to be available for that `dataId`) or to ask for different `dataId` keys than what is used to identify the dataset (which invokes various built-in relationships).\n",
+    "Use `bind` and `where`, along with [astropy.time](https://docs.astropy.org/en/stable/time/index.html), to look for visits within one minute of this one on either side."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# import astropy.time\n",
+    "# minute = astropy.time.TimeDelta(60, format=\"sec\")\n",
+    "# timespan = dafButler.Timespan(record.timespan.begin - minute, record.timespan.end + minute)\n",
     "\n",
-    "An example of this is provided below:"
+    "# for visit in registry.queryDimensionRecords(\"visit\", where=\"visit.timespan OVERLAPS my_timespan\", \n",
+    "#                                             bind={\"my_timespan\": timespan}):\n",
+    "#     print(visit.id, visit.timespan, visit.physical_filter)"
    ]
   },
   {
@@ -402,22 +464,33 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "for dim in ['exposure', 'visit', 'detector']:\n",
-    "    print(list(registry.queryDimensionRecords(dim, where='visit = 971990 and detector=0'))[0])\n",
-    "    print()"
+    "import astropy.time\n",
+    "minute = astropy.time.TimeDelta(60, format=\"sec\")\n",
+    "timespan = dafButler.Timespan(record.timespan.begin - minute, record.timespan.end + minute)\n",
+    "\n",
+    "datasetRefs = registry.queryDatasets(\"calexp\", where=\"visit.timespan OVERLAPS my_timespan\",\n",
+    "                                     bind={\"my_timespan\": timespan})\n",
+    "\n",
+    "for i, ref in enumerate(datasetRefs):\n",
+    "    print(ref)\n",
+    "    if i > 6:\n",
+    "        break"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<br>\n",
+    "##### Spatial queries\n",
     "\n",
-    "**NEED HELP HERE WITH THIS FINAL BIT!!**\n",
+    "Arbitrary spatial queries are not supported at this time, such as the \"POINT() IN (REGION)\" example found in this [Butler queries](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/queries.html) documentation.\n",
+    "In other words, at this time it is only possible to do queries involving regions that are already \"in\" the data repository, either because they are HTM pixel regions or because they are tract/patch/visit/visit+detector regions.\n",
     "\n",
-    "The following examples show how to query for data sets that include a desired coordinate and observation date.\n",
+    "Thus, for this example we use the set of dimensions that correspond to different levels of the HTM (hierarchical triangular mesh) pixelization of the sky ([HTM primer](http://www.skyserver.org/htm/)).\n",
+    "The process is to transform a region or point into one or more HTM identifiers (HTM IDs), and then create a query using the HTM ID as the spatial data ID.\n",
+    "The `lsst.sphgeom` library supports region objects and HTM pixelization in the LSST Science Pipelines.\n",
     "\n",
-    "Above, we can see that for visit 971990, the (RA,Dec) are (70.37770,-37.1757) and the observation date is 20251201. The following example uses the RA,Dec and date to retrieve the visit."
+    "Import the `lsst.sphgeom` package, initialize a sky pixelization to level 10 (the level at which one sky pixel is about five arcmin across), and find the HTM ID for a desired sky coordinate."
    ]
   },
   {
@@ -426,39 +499,76 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "ra = 70.37770\n",
-    "dec = -37.1757\n",
-    "s1 = \"exposure.day_obs = 20251201\"\n",
-    "s2 = \"exposure.tracking_ra > \"+str(ra-1.0)\n",
-    "s3 = \"exposure.tracking_ra < \"+str(ra+1.0)\n",
-    "s4 = \"exposure.tracking_dec > \"+str(dec-1.0)\n",
-    "s5 = \"exposure.tracking_dec < \"+str(dec+1.0)\n",
+    "import lsst.sphgeom\n",
     "\n",
-    "results = registry.queryDimensionRecords('visit',\n",
-    "                                         where=s1+\" AND \"+s2+\" AND \"+s3+\" AND \"+s4+\" AND \"+s5,\n",
-    "                                         collections=collection)\n",
+    "pixelization = lsst.sphgeom.HtmPixelization(10)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "htm_id = pixelization.index(\n",
+    "    lsst.sphgeom.UnitVector3d(\n",
+    "        lsst.sphgeom.LonLat.fromDegrees(70.376995, -37.175736)\n",
+    "    )\n",
+    ")\n",
     "\n",
-    "# Use expandDataId to fill in the implicit dataId keys with values\n",
-    "for i, ref in enumerate(results):\n",
-    "    tempId = butler.registry.expandDataId(ref.dataId)\n",
-    "    print(tempId.full)\n",
-    "    if i > 10:\n",
+    "# Obtain and print the scale to provide a sense of the size of the sky pixelization being used\n",
+    "scale = pixelization.triangle(htm_id).getBoundingCircle().getOpeningAngle().asDegrees()*3600\n",
+    "print(f'HTM ID={htm_id} at level={pixelization.getLevel()} is a ~{scale:0.2}\" triangle.')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "datasetRefs = registry.queryDatasets(\"calexp\", htm20=htm_id,\n",
+    "                                     where=\"visit.timespan OVERLAPS my_timespan\",\n",
+    "                                     bind={\"my_timespan\": timespan})\n",
+    "\n",
+    "for i, ref in enumerate(datasetRefs):\n",
+    "    print(ref)\n",
+    "    if i > 6:\n",
     "        break"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Above, our query terms were not sufficiently unqiue to return only visit 971990, because there were other images of that sky location obtained on that date. **IS IT WEIRD THEY ARE ALL Z BAND?**\n",
+    "Thus, with the above query, we have uniquely identified the visit and detector for our desired temporal and spatial constraints.\n",
     "\n",
-    "<br>\n",
+    "Note that if a smaller HTM level is used (like 7), which is a larger sky pixel (~2200 arcseconds), the above query will return many more visits and detectors which overlap with that larger region. Try it and see!\n",
     "\n",
-    "**TO BE ADDED:**\n",
+    "Note that queries using the HTM ID can also be used to, e.g., find the set of all i-band `src` catalog data products that overlap this point."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for i, src_ref in enumerate(registry.queryDatasets(\"src\", htm20=htm_id, band=\"i\")):\n",
+    "    print(src_ref)\n",
+    "    if i > 2:\n",
+    "        break"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Why is does that search take tens of seconds?\n",
+    "The butler's spatial reasoning is designed to work well for regions the size of full data products, like detector- or patch-level images and catalogs, and it's a poor choice for object-scale searches.\n",
+    "The above search is slow in part because `queryDatasets` searches for all `src` datasets that overlap a larger region and then filters the results down to the specified HTM ID pixel.\n",
     "\n",
-    "* use of regions instead of the above kludge with RA and Dec within a degree\n",
-    "* how to figure out which detector the coordinates are in, instead of matching to exposure center\n",
-    "* how to use timespan, to be more specific about time instead of just date"
+    "Options for exploring and retrieving catalog data with the Butler is covered in more depth in Section 5."
    ]
   },
   {
@@ -588,7 +698,7 @@
     "    Parameters\n",
     "    ----------\n",
     "    butler: lsst.daf.persistence.Butler\n",
-    "        Servant providing access to a data repository\n",
+    "        Client providing access to a data repository\n",
     "    ra: float\n",
     "        Right ascension of the center of the cutout, degrees\n",
     "    dec: float\n",