a few mods to the last section

kadrlica · kadrlica · commit 10219d78214b · 2021-06-11T02:31:49.000Z
diff --git a/04_Intro_to_Butler.ipynb b/04_Intro_to_Butler.ipynb
@@ -5,8 +5,8 @@
    "metadata": {},
    "source": [
     "<img align=\"left\" src = https://project.lsst.org/sites/default/files/Rubin-O-Logo_0.png width=250> \n",
-    "<b>Introduction to the LSST data Butler</b> <br>\n",
-    "Last verified to run on <b>TBD</b> with LSST Science Pipelines release <b>TBD</b> <br>\n",
+    "<b>Introduction to the LSST Data Butler</b> <br>\n",
+    "Last verified to run on <b>2020-06-10</b> with LSST Science Pipelines release <b>w_2020_20</b> <br>\n",
     "Contact author: Alex Drlica-Wagner <br>\n",
     "Credit: Originally developed by Alex Drlica-Wagner in the context of the LSST Stack Club <br>\n",
     "Target audience: All DP0 delegates. <br>\n",
@@ -22,8 +22,8 @@
     "The goals of this notebook are to:<br>\n",
     "1. create an instance of the Butler<br>\n",
     "2. explore the DP0.1 data repository<br>\n",
-    "3. retrive and display some image and catalog data<br>\n",
-    "4. create in image cutout at a user-specified coordinate<br>\n",
+    "3. retrieve and display some image and catalog data<br>\n",
+    "4. create an image cutout at a specific location<br>\n",
     "5. retrieve and plot catalog data \n",
     "\n",
     "\n",
@@ -58,6 +58,7 @@
    "source": [
     "# Generic imports\n",
     "import os,glob\n",
+    "import numpy as np\n",
     "import pylab as plt\n",
     "plt.rcParams['figure.figsize'] = (8.0, 8.0)"
    ]
@@ -67,9 +68,7 @@
    "metadata": {},
    "source": [
     "We import several packages from the LSST Science Pipelines. \n",
-    "The first import gives us access to the Butler, while the second provides tools for displaying data.\n",
-    "\n",
-    "More details and techniques regarding image display can be found in the `rubin-dp0` GitHub Organization's [tutorial-notebooks](https://github.com/rubin-dp0/tutorial-notebooks) repository."
+    "The first import gives us access to the Butler, while the second provides tools for displaying data. More details about image display can be found in the `rubin-dp0` GitHub Organization's [tutorial-notebooks](https://github.com/rubin-dp0/tutorial-notebooks) repository."
    ]
   },
   {
@@ -88,7 +87,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To create the Butler, we need to provide it with a path to the data set, which is called a \"data repository\". Butler repositories can be remote (i.e., pointing to an S3 bucket) or local (i.e., pointing to a directory on the local file system). In this case, we point to an S3 bucket."
+    "To create the Butler, we need to provide it with a path to the data set, which is called a \"data repository\". The Butler can access repositories that are remote (e.g., pointing to an S3 bucket) or local (e.g., pointing to a path on the local file system). In this case, we point to an S3 bucket."
    ]
   },
   {
@@ -101,6 +100,23 @@
     "butler = dafButler.Butler(repo)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We now have an instance of the Butler that we can explore with the Python `help` function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Uncomment the following line to see the Butler help documentation\n",
+    "#help(butler)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -118,7 +134,7 @@
    "source": [
     "registry = butler.registry\n",
     "\n",
-    "# We can examine the registry with\n",
+    "# We can also examine the registry\n",
     "#help(registry)"
    ]
   },
@@ -143,15 +159,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This is our first glimpse at the data contained in the repository, but it doesn't teach us *which* collection we are actually interested in. The names do give us some hints though...\n",
+    "This is our first glimpse at the data sets contained in the repository, but it doesn't teach us *which* collection we are actually interested in. The names do give us some hints though...\n",
     "\n",
-    "* `2.2i` - refers to the processing run of the LSST DESC DC2 data (the `i` stands for `imSim`)\n",
+    "* `2.2i` - refers to the processing run of the LSST DESC DC2 data (the `i` stands the `imSim` tool that was used to simulate the images)\n",
     "* `calib` - refers to calibration products that are used for instrument signature removal\n",
     "* `runs` - refers to processed data products\n",
     "* `refcats` - refers to the reference catalogs used for astrometric and photometric calibration\n",
     "* `skymaps` - are the geometric representations of the sky coverage\n",
     "\n",
-    "Collections can be nested, so we can access to everything for DC2 Run 2.2i (the primary DP0.1 data set) by selecting the collection `2.2i/runs/DP0.1`. This is a pointer to other collections that expand out recursively... More on collections can be found here: https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/organizing.html#collections"
+    "Collections can be nested, so we can access to everything for DC2 Run 2.2i (the primary DP0.1 data set) by selecting the collection `2.2i/runs/DP0.1`. This is a pointer to other collections that expand out recursively... More on collections can be found [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/organizing.html#collections)."
    ]
   },
   {
@@ -171,7 +187,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We now create a new Butler instance specifying that we are specifically interested in the `2.2i/runs/DP0.1` data collection. For most uses, this will be the line you will use to create a Butler to work on DP0.1, since you now know that this collection exists."
+    "We now create a new Butler instance specifying that we are specifically interested in the `2.2i/runs/DP0.1` data collection. For most uses, this will be the line you will use to create a Butler to work on DP0.1."
    ]
   },
   {
@@ -213,20 +229,20 @@
     "- `src` - refers the the catalog of sources\n",
     "- `skyMap` - refers to geometric representations of the sky coverage\n",
     "\n",
+    "You can look up these and other LSST terms in the searchable [LSST Glossary](https://www.lsst.org/scientists/glossary-acronyms).\n",
+    "\n",
     "<b> Which data sets are most appropriate for DP0.1? </b><br>\n",
     "Most DP0.1 delegates will only be interested in data sets with types `ExposureF` or `SourceCatalog`. \n",
     "For images, stick to the `calexp` (processed visit images, or PVIs) and `deepCoadd` (stacked PVIs).\n",
-    "For catalogs, the `src` should be used with the `calexp` images, and the `deepCoadd_forced_src` are the most appropriate to be used with the `coadds`.\n",
-    "More information can be found in the DP0.1 Data Products Definitions Document (DPDD) at [dp0-1.lsst.io](http://dp0-1.lsst.io).\n",
-    "\n",
-    "<br>\n"
+    "For catalogs, the `src` table should be used with the `calexp` images, and the `deepCoadd_forced_src` table are the most appropriate to be used with the `deepCoadd`.\n",
+    "More information can be found in the DP0.1 Data Products Definitions Document (DPDD) at [dp0-1.lsst.io](http://dp0-1.lsst.io)."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We access specific data sets through a set of specifications known as a data identifier (`dataId`). Each `DatasetType` can be identified with a different set of properties, so it is important to be able to determine what properties need to be specified to access data of a specific type. It is possible to get all `DatasetRef` (which include the `dataId`) for a specific `datasetType` in a specific collection with a query like this. Note that this doesn't necessarily guarentee that the specific data set exists, so we include a check that the data set has a valid Uniform Resource Identifier (URI)."
+    "We access specific data sets through a set of specifications known as a data identifier (`dataId`). Each `DatasetType` can be identified with a different set of properties, so it is important to be able to determine what properties need to be specified to access data of a specific type. It is possible to get all `DatasetRef` (which include the `dataId`) for a specific `datasetType` in a specific collection with a query like this. Note that this doesn't necessarily guarentee that the specific data set exists, so we include a check that the data set has a valid ."
    ]
   },
   {
@@ -247,7 +263,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can get the path from the URI that is returned by `butler.getURI`. Note that this URI does not refer to a local path on the filesystem. We do not need to know exactly where the data live in order to access it. That's the power of the Butler."
+    "In the code above we were accessing the Uniform Resource Identifier (URI) for each data product from `butler.getURI`. Note that in this case, the URI is pointing to an S3 bucket and not a local path on the filesystem. We do not need to know exactly where the data live in order to access it. That's the power of the Butler."
    ]
   },
   {
@@ -266,7 +282,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now say we want to restrict our selection to datasets associated with a specific filter. We can add that constraint to our query, but first we need to figure out what the filters are called... Looking at the dataId object, we see the attributes `physical_filter` and `band` look promising."
+    "Now say we want to restrict our selection to images that were taken in a specific optical filter. We can add that constraint to our query, but first we need to figure out what the LSST filters are called... Looking at the `dataId`, we see that the attribute `band` looks promising."
    ]
   },
   {
@@ -284,7 +300,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "print(f\"physical_filter = {ref.dataId['physical_filter']}\")\n",
     "print(f\"band = {ref.dataId['band']}\")"
    ]
   },
@@ -335,7 +350,7 @@
     "# We could pass the datasetRef that we found above, but since the query may \n",
     "# return results in a different order we define the dataId explicitly for reproducibility. \n",
     "dataId = {'visit': '703697', 'detector': 80, 'band':'g'}\n",
-    "calexp = butler.get('calexp',dataId=dataId)\n",
+    "calexp = butler.get('calexp', dataId=dataId)\n",
     "\n",
     "# This will print a warning related to the gen2 to gen3 Butler conversion that was performed on DP0.1"
    ]
@@ -344,7 +359,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The `calexp` is a calibrated CCD from a single exposure. We'll use the afwDisplay interface to show the pixel values and mask plane (more on afwDisplay can be found in other notebooks)."
+    "The `calexp`, also known as a \"Processed Visit Image\" (PVI), is a calibrated CCD from a single exposure. We'll use the afwDisplay interface to show the pixel values and mask plane (more on afwDisplay can be found in other notebooks)."
    ]
   },
   {
@@ -373,10 +388,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# We could get the src table using the dataId as we did above for the calexp, \n",
-    "# but this would require the butler to perform another query of the database. \n",
-    "# Instead, we can just pass the ref itself directly to butler.get\n",
-    "src = butler.get('src',dataId)\n",
+    "# We can get the src table using the dataId as we did above for the calexp\n",
+    "# (note that it is also possible to pass the data ref)\n",
+    "src = butler.get('src',dataId=dataId)\n",
     "src = src.copy(True)\n",
     "src.asAstropy()"
    ]
@@ -385,7 +399,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can now plot the `calexp` with the `src` catalog overlaid. We leave an investigation of this image as an exercise to the user :)"
+    "We can now plot the `calexp` with the `src` catalog overlaid. We'll leave further investigation of this image as an exercise to the user :)"
    ]
   },
   {
@@ -413,7 +427,7 @@
    "source": [
     "### 4. How to query for multiple data sets\n",
     "\n",
-    "In the case above, both the `calexp` and `src` can be found by the registry, but this will not always necessarily be the case. The `queryDimensions` method provides a more flexible way to query for multiple datasets (requiring an instance of all datasets to be available for that dataId) or ask for different dataId keys than what is used to identify the dataset (which invokes various built-in relationships). An example of this is provided below:"
+    "In the case above, both the `calexp` and `src` can be found by the registry, but this will not always necessarily be the case. The `queryDimensions` method provides a more flexible way to query for multiple datasets (requiring an instance of all datasets to be available for that `dataId`) or ask for different `dataId` keys than what is used to identify the dataset (which invokes various built-in relationships). An example of this is provided below:"
    ]
   },
   {
@@ -469,7 +483,7 @@
    "source": [
     "### 5. Generate an Image Cutout\n",
     "\n",
-    "Say we want to grab a cutout of the DP0.1 coadded images at a specific location. In order to do this, we need a few other packages from the LSST Science Pipelines. In particular, access to the geometry and coordinate packages."
+    "Now say we want to grab a cutout of the DP0.1 coadded images at a specific location. In order to do this, we need a few other packages from the LSST Science Pipelines. In particular, access to the geometry and coordinate packages."
    ]
   },
   {
@@ -482,6 +496,13 @@
     "import lsst.afw.coord as afwCoord"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next we define a `cutout_coadd` function to query an instance of the Butler for a cutout image at a specific location, band, and size."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -565,11 +586,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 6. Retrieve and plot catalog data from the Butler\n",
+    "### 6. Exploring and retrieving catalog data from the Butler\n",
     "\n",
-    "The TAP service is the recommended way to retrieve DP0.1 catalog data for a notebook, and there are several other tutuorials that demonstrate how to use the TAP service.\n",
+    "The TAP service is the recommended way to retrieve DP0.1 catalog data for a notebook, and there are several other [tutuorials](https://github.com/rubin-dp0/tutorial-notebooks) that demonstrate how to use the TAP service.\n",
     "\n",
-    "But if Butler access to catalog data is needed, an easy way to start is by retrieving only the schema data set for a Butler catalog, which can be done without specifying the ``dataId``. "
+    "However, as we saw above, the Butler can also be used to access to catalog data. We can investigate the table schema for a specific source catalog by  Butler appending `_schema` to the `datasetType`. Note that this does not require you to specify the ``dataId``. "
    ]
   },
   {
@@ -586,7 +607,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Each of the following lines will print the schema to the screen in different ways."
+    "The table `schema` stores information about the columns stored in the table. Each of the following lines will print the schema to the screen in different ways."
    ]
   },
   {
@@ -643,7 +664,8 @@
     "# Print the associated docstring for each of the named columns of interest\n",
     "for name in ['base_SdssShape_psf_xx','base_SdssShape_psf_yy','base_SdssShape_psf_xy']:\n",
     "    doc = schema_dict[name].getField().getDoc()\n",
-    "    print(name, ' = ', doc)"
+    "    units = schema_dict[name].getField().getUnits()\n",
+    "    print(name, '[%s]'%units, ' = ', doc)"
    ]
   },
   {
@@ -658,7 +680,7 @@
    "metadata": {},
    "source": [
     "<br>\n",
-    "The catalogs are very large and it is not feasible to try and retrieve it in its entirety.\n",
+    "The full catalogs are very large and it is not feasible to try and retrieve them in their entirety.\n",
     "Instead, in this example we identify the tract and patch of interest and retrieve only catalog data for a small region of sky.\n",
     "Use the same ra and dec coordinates as above to find the patch and tract."
    ]
@@ -684,7 +706,7 @@
     "coaddId = {'tract': tract, 'patch': patch, 'band':'i'}\n",
     "\n",
     "coadd_src = butler.get('deepCoadd_forced_src',coaddId)\n",
-    "coadd_src = src.copy(True)"
+    "coadd_src = coadd_src.copy(True)"
    ]
   },
   {
@@ -748,7 +770,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Plot the locations of sources in the patch."
+    "Plot the locations of sources in the patch as well as the ra,dec that we requested. Note that the `coord_ra` and `coord_dec` are in radians, so we need to convert them to degrees."
    ]
   },
   {
@@ -758,11 +780,19 @@
    "outputs": [],
    "source": [
     "fig = plt.figure()\n",
-    "plt.plot( data['coord_ra'].values, data['coord_dec'].values, 'o', ms=2, alpha=0.5 )\n",
-    "plt.xlabel('RA')\n",
-    "plt.ylabel('Dec')\n",
+    "plt.plot(np.degrees(data['coord_ra'].values), np.degrees(data['coord_dec'].values), 'o', ms=2, alpha=0.5 )\n",
+    "plt.plot(ra,dec,'*',ms=25, mec='k')\n",
+    "plt.xlabel('RA (deg)')\n",
+    "plt.ylabel('Dec (deg)')\n",
     "plt.title('Butler coadd_forced_src objects in tract 4638 patch 43')"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {