Updates to butler tutorial notebook.

TallJimbo · TallJimbo · commit 43e883ed8cb8 · 2021-06-15T19:00:21.000Z
diff --git a/04_Intro_to_Butler.ipynb b/04_Intro_to_Butler.ipynb
@@ -92,7 +92,7 @@
     "### 1. Create an instance of the Butler\n",
     "\n",
     "To create the Butler, we need to provide it with a path to the data set, which is called a \"data repository\".\n",
-    "Butler repositories can be remote (i.e., pointing to an S3 bucket) or local (i.e., pointing to a directory on the local file system).\n",
+    "Butler repositories have both a database component and a file-like storage component; the latter can can be remote (i.e., pointing to an S3 bucket) or local (i.e., pointing to a directory on the local file system), and it contains a configuration file (usually `butler.yaml`) that points to the right database\n",
     "\n",
     "S3 (Simple Storage Service) buckets are public cloud storage resources that are similar to file folders, store objects, and which consist of data and its descriptive metadata.\n",
     "\n",
@@ -122,15 +122,13 @@
    "source": [
     "#### 2.1 Butler registry and collections\n",
     "\n",
-    "The registry is a database containing information about available data products.\n",
-    "The registry helps the user to examine what collections of data products exist.\n",
+    "The database side of a data repository is called a registry.\n",
+    "The registry contains entries for all data products, and organizes them by _collection_, _dataset type_, and _data ID_.\n",
     "Use the registry to investigate a repository by listing all collections.\n",
     "\n",
-    "Find more about the registry schema [here](https://dmtn-073.lsst.io/).\n",
-    "\n",
     "Find more about collections [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/organizing.html#collections).\n",
     "\n",
-    "Create a registry for the DP0.1 data set using the Butler."
+    "A registry client is part of our butler object:"
    ]
   },
   {
@@ -173,9 +171,9 @@
     "* `calib` - refers to calibration products that are used for instrument signature removal\n",
     "* `runs` - refers to processed data products\n",
     "* `refcats` - refers to the reference catalogs used for astrometric and photometric calibration\n",
-    "* `skymaps` - are the geometric representations of the sky coverage\n",
+    "* `skymaps` - definitions for the _tract_ and _patch_ grids that coadds are built on\n",
     "\n",
-    "Collections are nested, and DP0 delegates can access all the data for DC2 Run 2.2i, which is the DP0.1 data set, by selecting the collection `2.2i/runs/DP0.1`.\n",
+    "Some collections are nested, and DP0 delegates can access all the data for DC2 Run 2.2i, which is the DP0.1 data set, by selecting the collection `2.2i/runs/DP0.1`.\n",
     "\n",
     "Expand the pointer recursively to show the full contents of the selected collection."
    ]
@@ -261,17 +259,21 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### 2.3 Butler dataId\n",
+    "#### 2.3 Butler data IDs\n",
     "\n",
-    "The `dataId` (data identifier) is how specific data within a data set is accessed. Find more about the `dataId` [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/dimensions.html#data-ids).\n",
+    "The data ID is a dictionary-like identifier for a data product.\n",
+    "Find more about the data IDs [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/dimensions.html#data-ids).\n",
     "\n",
-    "Each `DatasetType` uses a different set of keys as the `dataId`.\n",
-    "For example, in the `DatasetType` list printed to screen (above), next to `calexp` in curly brackets is listed the band, instrument, detector, physical_filter, visit_system, and visit. These are the keys of the `dataId` for a `calexp`.\n",
+    "Each `DatasetType` uses a different set of keys in its data ID.\n",
+    "For example, in the `DatasetType` list printed to screen (above), next to `calexp` in curly brackets is listed the band, instrument, detector, physical_filter, visit_system, and visit.\n",
+    "These are the keys of the data ID for a `calexp`, which are also called \"dimensions\".\n",
     "\n",
-    "In the following cell, the `DatasetRef` is queried for `calexp` data in our collection of interest, and the full `dataId` are printed to screen (for just a few examples).\n",
+    "In the following cell, the `DatasetRef` is queried for `calexp` data in our collection of interest, and the full data IDs are printed to screen (for just a few examples).\n",
+    "Data IDs can be represented in code as regular Python `dict` objects, but when returned from the `Butler` the `DataCoordinate` class is used instead.\n",
     "\n",
-    "The `dataId` contains both *implied* and *required* keys. For example, the value of *band* would be *implied* by the *visit*, because a single visit refers to a single exposure at a single pointing in a single band. \n",
-    "In the following cell, printing the `dataId` without specifying `.full` shows only the required keys.\n",
+    "The data ID contains both *implied* and *required* keys.\n",
+    "For example, the value of *band* would be *implied* by the *visit*, because a single visit refers to a single exposure at a single pointing in a single band. \n",
+    "In the following cell, printing the data ID without specifying `.full` shows only the required keys.\n",
     "The value of a single key, in this case *band*, can also be printed by specifying the key name.\n",
     "\n",
     "The following cell will fail and return an error if the query is requesting a `DatasetRef` for data that does not exist."
@@ -350,10 +352,13 @@
    "source": [
     "<br>\n",
     "\n",
-    "The `dataId` can be retrieved directly by using `queryDataIds` instead of `queryDatasets`, as in the following two examples.\n",
-    "Note the flexibility in the use of the query keys and the where statement.\n",
-    "Also note that both the `calexp` and `src` data sets can be found by the registry, but this will not always necessarily be the case.\n",
-    "Queries for non-existent data will cause an error to be returned."
+    "Each data ID key-value pair is associated with a metadata row called a `DimensionRecord`.\n",
+    "Like dataset types, these exist independent of any collection, but they are also identified by data IDs.\n",
+    "\n",
+    "The `queryDimensionsRecords` method provides a way to query for these records.\n",
+    "Most of the arguments accepted by `queryDatasets` can be used here (including `where`).\n",
+    "\n",
+    "An example of this is provided below:"
    ]
   },
   {
@@ -362,12 +367,34 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "dataIds = registry.queryDataIds([\"visit\", \"detector\", \"band\"], datasets=[\"calexp\"],\n",
-    "                                where='visit = 703697', collections=collection)\n",
-    "for i, dataId in enumerate(dataIds):\n",
-    "    print(dataId.full)\n",
-    "    if i > 2:\n",
-    "        break"
+    "for dim in ['exposure', 'visit', 'detector']:\n",
+    "    print(list(registry.queryDimensionRecords(dim, where='visit = 971990 and detector=0'))[0])\n",
+    "    print()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Another query method, `queryDataIds`, can be used to query for data IDs independent of any dataset, but it's less useful for general data exploration.\n",
+    "\n",
+    "It is also possible to pass `datasets` and `collections` to both `queryDataIds` and `queryDimensionRecords` in order to return records whose data IDs match those of existing datasets.\n",
+    "But this is quite a bit more subtle than searching directly for a dataset, and rarely wanted when exploring a data repository.\n",
+    "\n",
+    "More information on all of the query methods can be found [here](https://pipelines.lsst.io/v/weekly/middleware/faq.html#when-should-i-use-each-of-the-query-methods-commands)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 2.5 Temporal and spatial queries\n",
+    "\n",
+    "The following examples show how to query for data sets that include a desired coordinate and observation date.\n",
+    "\n",
+    "Above, we can see that for visit 971990, the (RA,Dec) are (70.37770,-37.1757) and the observation date is 20251201\n",
+    "But these are just human-readable summaries of the more precise spatial and temporal information stored in the registry, which are represented in Python by `Timespan` and `Region` objects, respectively.\n",
+    "`DimensionRecord` objects that represent spatial or temporal concepts (a `visit` is both) have these objects attached to them:"
    ]
   },
   {
@@ -376,24 +403,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "dataIds = registry.queryDataIds([\"visit\", \"detector\"], datasets=[\"src\"],\n",
-    "                                where=\"band='g' and detector=0 and visit > 700000\",\n",
-    "                                collections=collection)\n",
-    "for i, dataId in enumerate(dataIds):\n",
-    "    print(dataId.full)\n",
-    "    if i > 2:\n",
-    "        break"
+    "(record,) = registry.queryDimensionRecords('visit', visit=971990)\n",
+    "print(record.timespan)\n",
+    "print(record.region)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<br>\n",
-    "\n",
-    "The `queryDimensions` method provides a more flexible way to query for multiple datasets (requiring an instance of all datasets to be available for that `dataId`) or to ask for different `dataId` keys than what is used to identify the dataset (which invokes various built-in relationships).\n",
-    "\n",
-    "An example of this is provided below:"
+    "If the timespan or spatial region that's being used as a query constraint is already associated with a data ID in the database, spatial and temporal overlap constraints are automatic.\n",
+    "For example, if we query for `deepCoadd` datasets with a `visit`+`detector` data ID, we'll get just the ones that overlap that observation and have the same band (because a visit implies a band):"
    ]
   },
   {
@@ -402,22 +422,49 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "for dim in ['exposure', 'visit', 'detector']:\n",
-    "    print(list(registry.queryDimensionRecords(dim, where='visit = 971990 and detector=0'))[0])\n",
-    "    print()"
+    "for ref in registry.queryDatasets(\"deepCoadd\", visit=971990, detector=50):\n",
+    "    print(ref)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<br>\n",
+    "To query for dimension records or datasets that overlap an arbitrary time range, we can use the `bind` argument to pass times through to `where`; we'll use this to look for visits within one minute of this one on either side:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import astropy.time\n",
+    "minute = astropy.time.TimeDelta(60, format=\"sec\")\n",
+    "timespan = dafButler.Timespan(record.timespan.begin - minute, record.timespan.end + minute)\n",
     "\n",
-    "**NEED HELP HERE WITH THIS FINAL BIT!!**\n",
+    "for visit in registry.queryDimensionRecords(\"visit\", where=\"visit.timespan OVERLAPS my_timespan\", bind={\"my_timespan\": timespan}):\n",
+    "    print(visit.id, visit.timespan, visit.physical_filter)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Using `bind` to define an alias for a variable saves us from having to string-format the times into the `where` expression.\n",
+    "Unfortunately, there is a bug in `queryDatasets` that prevents `bind` from working there (fixed in `w_2021_25`).\n",
     "\n",
-    "The following examples show how to query for data sets that include a desired coordinate and observation date.\n",
+    "A `Timespan` can have a `begin` or `end` of `None` if it is unbounded on that side."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Arbitrary spatial queries are not supported, but we do have set of dimensions that correspond to different levels of the HTM (hierarchical triangular mesh) pixelization of the sky.\n",
     "\n",
-    "Above, we can see that for visit 971990, the (RA,Dec) are (70.37770,-37.1757) and the observation date is 20251201. The following example uses the RA,Dec and date to retrieve the visit."
+    "So one can transform a region or point into one or more HTM IDs, and then seach using that as a spatial data ID.\n",
+    "The `lsst.sphgeom` library is what backs our region objects, and we can also use it to find the HTM ID for a point."
    ]
   },
   {
@@ -426,39 +473,58 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "ra = 70.37770\n",
-    "dec = -37.1757\n",
-    "s1 = \"exposure.day_obs = 20251201\"\n",
-    "s2 = \"exposure.tracking_ra > \"+str(ra-1.0)\n",
-    "s3 = \"exposure.tracking_ra < \"+str(ra+1.0)\n",
-    "s4 = \"exposure.tracking_dec > \"+str(dec-1.0)\n",
-    "s5 = \"exposure.tracking_dec < \"+str(dec+1.0)\n",
-    "\n",
-    "results = registry.queryDimensionRecords('visit',\n",
-    "                                         where=s1+\" AND \"+s2+\" AND \"+s3+\" AND \"+s4+\" AND \"+s5,\n",
-    "                                         collections=collection)\n",
+    "import lsst.sphgeom\n",
     "\n",
-    "# Use expandDataId to fill in the implicit dataId keys with values\n",
-    "for i, ref in enumerate(results):\n",
-    "    tempId = butler.registry.expandDataId(ref.dataId)\n",
-    "    print(tempId.full)\n",
-    "    if i > 10:\n",
+    "pixelization = lsst.sphgeom.HtmPixelization(7)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "htm_id = pixelization.index(\n",
+    "    lsst.sphgeom.UnitVector3d(\n",
+    "        lsst.sphgeom.LonLat.fromDegrees(70.37699524983329, -37.17573628348882)\n",
+    "    )\n",
+    ")\n",
+    "scale = pixelization.triangle(htm_id).getBoundingCircle().getOpeningAngle().asDegrees()*3600\n",
+    "print(f'HTM ID={htm_id} at level={pixelization.getLevel()} is a ~{scale:0.2}\" triangle.')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And we can use that to query for (e.g.) the set of all `src` data products that overlap this point in i:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for i, src_ref in enumerate(registry.queryDatasets(\"src\", htm20=htm_id, band=\"i\")):\n",
+    "    print(src_ref)\n",
+    "    if i > 2:\n",
     "        break"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Above, our query terms were not sufficiently unqiue to return only visit 971990, because there were other images of that sky location obtained on that date. **IS IT WEIRD THEY ARE ALL Z BAND?**\n",
-    "\n",
-    "<br>\n",
-    "\n",
-    "**TO BE ADDED:**\n",
-    "\n",
-    "* use of regions instead of the above kludge with RA and Dec within a degree\n",
-    "* how to figure out which detector the coordinates are in, instead of matching to exposure center\n",
-    "* how to use timespan, to be more specific about time instead of just date"
+    "The butler's spatial reasoning is designed to work well for regions the size of full data products, like detector- or patch-level images and catalogs, and it's a poor choice for object-scale searches.\n",
+    "The query above is slow in large part because it actually searches for all `src` datasets that overlap the much larger htm7 pixel (about a degree on a side), and then filters the results down to the htm20 pixel in Python."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "That said, it's something we'll use frequently below, so it will be useful to wrap this in a function:"
    ]
   },
   {
@@ -588,7 +654,7 @@
     "    Parameters\n",
     "    ----------\n",
     "    butler: lsst.daf.persistence.Butler\n",
-    "        Servant providing access to a data repository\n",
+    "        Client providing access to a data repository\n",
     "    ra: float\n",
     "        Right ascension of the center of the cutout, degrees\n",
     "    dec: float\n",