Skip to content

Commit 73aacab

Browse files
authored
Merge pull request #292 from nhsengland/mt_update_cme
MT update cme page
2 parents 162d65b + 727cbd3 commit 73aacab

6 files changed

Lines changed: 90 additions & 26 deletions

File tree

100 KB
Loading
91 KB
Loading
44.7 KB
Loading
101 KB
Loading
-68.2 KB
Binary file not shown.

docs/our_work/clinical_measurement_extractor.md

Lines changed: 90 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -4,48 +4,112 @@ summary: 'Building a LLM-based named-entity extraction pipeline using AWS resour
44
tags: ['WIP','MACHINE LEARNING','NATURAL LANGUAGE PROCESSING', 'LLM','PYTHON', 'TEXT DATA']
55
---
66

7-
!!! warning
87

9-
This project is currently in development, and as such the following is subject to change.
8+
## About
109

11-
## Aim
10+
The National Disease Registration Service (NDRS) processes hundreds of thousands of free text pathology reports every year where key metrics and information are manually recorded and logged. Given the increasing data quantities and lack of scalability of the current process, the use of Large Language Models (LLMs) to support the registration process could have a positive effect on efficiency and cost. By using an LLM for some of the more straight-forward information extraction tasks, registration staff time could be freed up to work on more complex cases, expand the range of metrics that are collected and reduce the time lag between receiving a report and information being logged.
1211

13-
We are building an LLM-based named-entity extraction pipeline to extract clinical measurements from free-text data. This codebase is in Python, developed in SageMaker Notebooks, and is sending request to models hosted on BedRock.
12+
This project is a proof-of-concept LLM-based named-entity extraction pipeline to extract clinical measurements from free-text pathology reports. This codebase is in Python, developed in Amazon SageMaker and uses Amazon BedRock to access LLMs.
13+
14+
This proof-of-concept project focusses on extracting measurements from breast cancer pathology reports. For simplicity we restricted the task to five metrics:
15+
16+
- ER Status
17+
- ER Score
18+
- PR Status
19+
- PR Score
20+
- HER2 Status
21+
22+
## Data
23+
24+
A sample of 500 breast cancer pathology reports were taken, with 300 reserved for prompt engineering and 200 set aside as an evaluation or hold-out set. The evaluation set consisted of data from two different time periods, the first 100 reports were from the same time range as the 300 records used in prompt engineering, the last 100 from a year later to assess how data drift may affect the results.
25+
26+
All data was anonymised by redacting all sensitive information such as names, locations and dates.
1427

1528
## Methodology
1629

17-
![High-level Structure of the Pipeline to be created in AWS](../images/clinical_measurement_pipeline.png)
18-
<figcaption>Figure 1: High-level of the LLM-based named-entity extraction pipeline</figcaption>
30+
![High-level Structure of the Pipeline to be created in AWS](../images/clinical_measurement_extractor/clinical_measurement_pipeline.png)
31+
<figcaption>Figure 1: High-level steps of the LLM-based named-entity extraction pipeline</figcaption>
1932

2033

21-
1. **Pre-processing Reports**: Applying NLP techniques to reduce and clean free-text input data.
22-
2. **Detecting Contradictions**:
23-
* Prompting an LLM to detect contradictions in free-text data.
24-
* Using NLP techniques to identify contradictions (STRETCH).
25-
3. **Named-entity Recognition**:
26-
* Prompting an LLM hosted on Bedrock to extract entities from free-text data:
27-
* Providing the model with defined accepted values and proposed JSON structure to guide the response.
28-
* Using a question-answering approach guided by instructions to shape model behaviour.
29-
* Exploring the use of zero-shot and few-shot examples in prompts.
30-
* Parsing the output into a JSON structure.
31-
4. **Post-Processing**: Validating that the keys and values in the JSON structure are as expected, and potentially applying further post-processing steps.
34+
1. **Pre-processing reports**: Some simple cleaning steps on the free-text input data.
35+
2. **Detecting multiple tumours**:
36+
* Prompt an LLM to identify reports that discuss multiple tumours.
37+
* Reports with multiple tumours don't progress further in the pipeline
38+
3. **Entity-extraction**:
39+
* Prompt an LLM hosted on Bedrock to extract entities from free-text data:
40+
* Provide the model with the relevant guidelines and instructions to extract the metrics
41+
* Provide the model with the accepted values for each metric and the JSON structure to guide the response
42+
* Provide examples to the model to improve performance and to further guide the output structure
43+
* Parsing the output into a JSON/python dictionary and handling any incorrect structure accordingly
44+
4. **Post-Processing**:
45+
* Validating that the keys and values in the JSON structure are as expected
46+
* Certain values can be automatically mapped to the correct format
47+
* Some values can be inferred from other extracted metrics
3248

33-
## Results
3449

35-
### Evaluation Approach
50+
## Evaluation Approach
3651

3752
We aim to evaluate the outputs of this work using multiple methods:
3853

39-
1. **Accuracy, Precision, Recall, and F1 Score** for each extracted entity, as well as overall performance: This assesses the correctness of our approach. Ideally, we aim for high accuracy, followed by precision, then recall.
40-
2. **Out-of-Distribution Performance Evaluation**: We will engineer prompts using a subset of free-text data from a specific time period, then evaluate performance on a holdout set from both the same and a later time period to assess the impact of data drift.
41-
3. **Stratified Performance Evaluation**: We will analyse how metrics vary across different dimensions of the data to measure and mitigate potential biases introduced by the system.
42-
4. **G-Eval**: Using an LLM to evaluate hallucinations, grounding of outputs, and to sense-check the correctness of extracted values.
54+
1. **Accuracy, Precision, Recall, and F1 Score** for each extracted entity, and from every potential value from each metric to assess if performance differs across each one.
55+
2. **Out-of-Distribution Performance Evaluation**: We evaluated performance on a holdout set from both the same and a later time period to assess the impact of data drift.
56+
3. **Ensemble/consensus approach**: We will get two different LLMs to do the same extraction and compare the agreement rate between them. We can then see if this approach improves the performance or provides more confidence in the final answer.
57+
58+
## Results
59+
60+
#### Performance achieved during prompt engineering:
61+
62+
63+
| Dataset | Record count | Multi-tumour records flagged |
64+
| ------ | -------- | -------- |
65+
| Prompt engineering | 300 | 40 |
66+
67+
| Metric | % Correct |
68+
| ------ | -------- |
69+
| ER Status | 98.1 |
70+
| ER Score | 98.5 |
71+
| PR Status | 98.8 |
72+
| PR Score | 100 |
73+
| HER2 Status | 98.8 |
74+
75+
??? info "Graph View"
76+
77+
![Prompt engineering results](../images/clinical_measurement_extractor/prompt_engineering_results.png)
78+
<figcaption>Figure 2: Two bar charts showing the results from prompt engineering</figcaption>
79+
80+
81+
#### Performance on evaluation datasets:
82+
83+
**Eval set 1:** Data from the same time period as the prompt engineering data
84+
**Eval set 2:** Data from a time period one year later than all other data
85+
86+
| Dataset | Record count | Multi-tumour records flagged |
87+
| ------ | -------- | -------- |
88+
| Eval set 1 | 100 | 14 |
89+
| Eval set 2 | 100 | 17 |
90+
91+
| Metric | % Correct (eval set 1) | % Correct (eval set 2) |
92+
| ------ | -------- | -------- |
93+
| ER Status | 100 | 97.6 |
94+
| ER Score | 100 | 95.2 |
95+
| PR Status | 100 | 97.6 |
96+
| PR Score | 100 | 96.4 |
97+
| HER2 Status | 90.7 | 95.2 |
98+
99+
??? info "Graph View Evaluation Set 1"
100+
101+
![Evaluation results (same time period)](../images/clinical_measurement_extractor/batch4_results.png)
102+
<figcaption>Two bar charts showing the results from the evaluation set (same time period as the prompt engineering data)</figcaption>
43103

104+
??? info "Graph View Evaluation Set 2"
44105

45-
### Outcome
106+
![Evaluation results (different time period)](../images/clinical_measurement_extractor/batch5_results.png)
107+
<figcaption>Two bar charts showing the results from the evaluation set (different time period)</figcaption>
46108

47-
TBD
48109

110+
### Outputs
49111

112+
| Output | Link |
113+
| --------- | ------ |
114+
| Public CME Repo | [Github Repo](https://github.com/NHSE-NDRS/clinical_measurement_extractor_public)|
50115

51-
#

0 commit comments

Comments
 (0)