|
| 1 | +# Biochemistry Example |
| 2 | + |
| 3 | +This hands-on example demonstrates how to: |
| 4 | + |
| 5 | +1. Import a table from Excel/CSV into MetaConfigurator |
| 6 | +2. Automatically infer the data model and generate a schema |
| 7 | +3. Use the MetaConfigurator UI to make changes to the schema |
| 8 | +4. Export the schema and data to a JSON file |
| 9 | +5. Profit from having the data and schema in a machine-readable format by applying a simple Python script on it, enriching the synthesis data with additional metadata from the PubChem database |
| 10 | + |
| 11 | +The example is based on a dataset of metal-organic framework (MOF) synthesis data. |
| 12 | +MetaConfigurator itself, however, is a generic tool and can be used for any kind of data of any domain. |
| 13 | + |
| 14 | +The goal is to demonstrate how data from Excel/CSV can be turned machine-readable using MetaConfigurator. |
| 15 | +Also, it is shown how a data model (schema) can be created, which will helpful to communicate the structure of the data to others. |
| 16 | +Finally, we show that having the data in this machine-readable format allows for easy integration with other tools and services, such as the PubChem database. |
| 17 | +This applies for any kind of data: once it is in a machine-readable format, any tooling, programming language and also machine learning can be applied on it. |
| 18 | + |
| 19 | +## Step 1: Import the data |
| 20 | + |
| 21 | +Download the [ec-mof-synthesis.csv](ec-mof-synthesis.csv) file. |
| 22 | +Note that the data was originally in an Excel file and was exported to the CSV format already. |
| 23 | + |
| 24 | +Open MetaConfigurator and click on the "Import Data..." button (not to be confused with the "Open Data" button). |
| 25 | + |
| 26 | + |
| 27 | + |
| 28 | +Select the "Import CSV Data" option and choose the `ec-mof-synthesis.csv` file. |
| 29 | + |
| 30 | + |
| 31 | + |
| 32 | + |
| 33 | + |
| 34 | +Keep the "Independent Table" and "Infer and generate schema for the data" options selected. |
| 35 | + |
| 36 | + |
| 37 | + |
| 38 | +Press the "Import" button to import the data and generate the schema. |
| 39 | + |
| 40 | + |
| 41 | + |
| 42 | +Now the data is successfully imported and the schema is generated. |
| 43 | + |
| 44 | + |
| 45 | + |
| 46 | + |
| 47 | +## Step 2: Making changes to the schema |
| 48 | + |
| 49 | +The schema generated by MetaConfigurator can be further refined. |
| 50 | +Notice, that the `phase_purity` attribute is inferred as a string, but it should be a boolean (True/False) value. |
| 51 | + |
| 52 | +First, let's navigate to the Schema Editor tab, by clicking the "Data Editor" button on the top left and then selecting the "Schema Editor" tab. |
| 53 | + |
| 54 | + |
| 55 | + |
| 56 | +This tab has different available views. |
| 57 | +The text view shows the schema in its raw text form in JSON format. |
| 58 | +It is best to be used by advanced users who are familiar with the JSON schema format. |
| 59 | +The GUI view is more user-friendly and assists the user by showing all options available for defining the schema. |
| 60 | +The most easy and simple view is the diagram view, which shows the schema in a graphical form. |
| 61 | +Some more advanced schema options, such as conditionals and composition, can not be achieved in the diagram view and require the GUI or text view. |
| 62 | +For our example, the diagram view is sufficient. |
| 63 | +Hence, let's hide the other views and open only the diagram view. |
| 64 | +This can be done using the top toolbar. |
| 65 | + |
| 66 | + |
| 67 | + |
| 68 | +Click on the buttons to hide the text editor and the GUI editor. |
| 69 | +Afterward, only the diagram view should be visible. |
| 70 | + |
| 71 | + |
| 72 | + |
| 73 | +In the diagram, click on the `phase_purity` attribute to edit it. |
| 74 | +Then, change the type to "boolean". |
| 75 | + |
| 76 | + |
| 77 | + |
| 78 | +In the top menu bar, ckick the "Show Preview of resulting GUI" button to see how the schema will look like in the GUI view. |
| 79 | + |
| 80 | + |
| 81 | + |
| 82 | +In the GUI view, the `phase_purity` attribute is now represented as a checkbox instead of a text field. |
| 83 | +Todo: auto-convert yes and no values? |
| 84 | + |
| 85 | + |
| 86 | + |
| 87 | +In our data, we also notice that `metal_salt_mass_unit` and `linker_mass_unit` both seem to be the same unit. |
| 88 | +Rather than allowing an arbitrary string for these fields, we can define a common unit type, which is used by both attributes. |
| 89 | +This can be done by adding a new enum to the schema, by clicking on the "Add Enum" button in the diagram view. |
| 90 | + |
| 91 | + |
| 92 | + |
| 93 | +A new dummy enumeration will be added to the schema. |
| 94 | +Change the name of the enumeration to `mass_unit` and add the possible values `kg`, `g` and `mg`. |
| 95 | + |
| 96 | + |
| 97 | + |
| 98 | +Now, change the type of the `metal_salt_mass_unit` and `linker_mass_unit` attributes to the newly defined `mass_unit` enumeration. |
| 99 | + |
| 100 | + |
| 101 | + |
| 102 | +This will automatically create new edges from the attributes to the enumeration in the diagram. |
| 103 | +Notice, that in the GUI preview, the `metal_salt_mass_unit` and `linker_mass_unit` attributes are now represented as dropdowns with the possible values. |
| 104 | + |
| 105 | + |
| 106 | + |
| 107 | +In the same manner, create a new enum `time_unit` with the values `s`, `min`, `h`, `day` and `week`. |
| 108 | +To the `time_unit` attribute, change the type to the new enumeration. |
| 109 | + |
| 110 | +Also do the same for the `temperature_unit` attribute, with the values `K`, `deg C` and `deg F`. |
| 111 | + |
| 112 | +The resulting schema is provided in the [ecmofsynthesis.schema.json](ecmofsynthesis.schema.json) file. |
| 113 | +The data is provided in the [ecmofsynthesis.json](ecmofsynthesis.json) file. |
| 114 | + |
| 115 | + |
| 116 | + |
| 117 | +## Step 3: Enriching the data with additional metadata |
| 118 | + |
| 119 | +Having this initial data model (schema) and data, now we want to apply a Python script on the data, which extends all the compounds (metal salt and linker) with additional metadata from the PubChem database: |
| 120 | +- Inchi Code |
| 121 | +- Smiles Code |
| 122 | +- Molecular Weight |
| 123 | +- cid (PubChem Compound ID) |
| 124 | + |
| 125 | +Let's first adapt our data model and add the new attributes to the schema. |
| 126 | +Because the new attributes apply to both the `metal_salt` and `linker` compounds, we define a new schema object `compound`. |
| 127 | + |
| 128 | +Click on the 'Add Object' button in the diagram view to add a new object to the schema. |
| 129 | +Change the name of the object to `compound`. |
| 130 | +Add the following attributes: |
| 131 | +- `inchi_code` (string) |
| 132 | +- `smiles_code` (string) |
| 133 | +- `molecular_weight` (number) |
| 134 | +- `cid` (number) |
| 135 | + |
| 136 | + |
| 137 | + |
| 138 | +Now, we can introduce a new property `metal_salt` of type `compound` to the schema. |
| 139 | +Let's do the same for the `linker` property. |
| 140 | + |
| 141 | + |
| 142 | + |
| 143 | +The resulting schema is provided in the [ecmofsynthesis_enriched.schema.json](ecmofsynthesis_enriched.schema.json) file. |
| 144 | + |
| 145 | +Note that the design of the data model itself is up to preference and use case. |
| 146 | +In this example, we added a new object `compound` to the schema, which is used by both the `metal_salt` and `linker` properties. |
| 147 | +We did NOT CHANGE any existing properties, but only added new ones. |
| 148 | +It would also be a valid choice to move the `metal_salt_mass_unit`, `metal_salt_mass`, `linker_mass_unit` and `linker_mass` attributes to the `compound` object. |
| 149 | +The same applies for the `linker_name` and `metal_salt_name` properties. |
| 150 | +They could also be fully removed, as the `cid` attribute will be used to identify the compounds (exception: if no corresponding molecule is found in PubChem, then information would be lost). |
| 151 | + |
| 152 | +Now, let's write a Python script to enrich the data with the additional metadata. |
| 153 | +The script is provided in the [enrich_data.py](enrich_data.py) file. |
| 154 | +The resulting JSON file is provided in the [ecmofsynthesis_enriched.json](ecmofsynthesis_enriched.json) file. |
| 155 | + |
| 156 | +If we load the enriched data into MetaConfigurator (using the 'Import Data' button), we can see that the `metal_salt` and `linker` properties now have the additional metadata. |
| 157 | + |
| 158 | + |
| 159 | + |
| 160 | +## Takeaways |
| 161 | + |
| 162 | +This example demonstrated how to import data from Excel/CSV into MetaConfigurator, generate a schema and make changes to the schema. |
| 163 | +It also showed how to enrich the data with additional metadata from the PubChem database using a Python script. |
| 164 | + |
| 165 | +By having the data in a machine-readable format, it is easy to apply additional tools and services on it. |
| 166 | +This can be useful for data integration, data analysis, data visualization, machine learning and many other applications. |
| 167 | +MetaConfigurator is a generic tool and can be used for any kind of data of any domain. |
0 commit comments