You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/our_work/clinical_measurement_extractor.md
+90-26Lines changed: 90 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,48 +4,112 @@ summary: 'Building a LLM-based named-entity extraction pipeline using AWS resour
4
4
tags: ['WIP','MACHINE LEARNING','NATURAL LANGUAGE PROCESSING', 'LLM','PYTHON', 'TEXT DATA']
5
5
---
6
6
7
-
!!! warning
8
7
9
-
This project is currently in development, and as such the following is subject to change.
8
+
## About
10
9
11
-
## Aim
10
+
The National Disease Registration Service (NDRS) processes hundreds of thousands of free text pathology reports every year where key metrics and information are manually recorded and logged. Given the increasing data quantities and lack of scalability of the current process, the use of Large Language Models (LLMs) to support the registration process could have a positive effect on efficiency and cost. By using an LLM for some of the more straight-forward information extraction tasks, registration staff time could be freed up to work on more complex cases, expand the range of metrics that are collected and reduce the time lag between receiving a report and information being logged.
12
11
13
-
We are building an LLM-based named-entity extraction pipeline to extract clinical measurements from free-text data. This codebase is in Python, developed in SageMaker Notebooks, and is sending request to models hosted on BedRock.
12
+
This project is a proof-of-concept LLM-based named-entity extraction pipeline to extract clinical measurements from free-text pathology reports. This codebase is in Python, developed in Amazon SageMaker and uses Amazon BedRock to access LLMs.
13
+
14
+
This proof-of-concept project focusses on extracting measurements from breast cancer pathology reports. For simplicity we restricted the task to five metrics:
15
+
16
+
- ER Status
17
+
- ER Score
18
+
- PR Status
19
+
- PR Score
20
+
- HER2 Status
21
+
22
+
## Data
23
+
24
+
A sample of 500 breast cancer pathology reports were taken, with 300 reserved for prompt engineering and 200 set aside as an evaluation or hold-out set. The evaluation set consisted of data from two different time periods, the first 100 reports were from the same time range as the 300 records used in prompt engineering, the last 100 from a year later to assess how data drift may affect the results.
25
+
26
+
All data was anonymised by redacting all sensitive information such as names, locations and dates.
14
27
15
28
## Methodology
16
29
17
-

18
-
<figcaption>Figure 1: High-level of the LLM-based named-entity extraction pipeline</figcaption>
30
+

31
+
<figcaption>Figure 1: High-level steps of the LLM-based named-entity extraction pipeline</figcaption>
19
32
20
33
21
-
1.**Pre-processing Reports**: Applying NLP techniques to reduce and clean free-text input data.
22
-
2.**Detecting Contradictions**:
23
-
* Prompting an LLM to detect contradictions in free-text data.
24
-
* Using NLP techniques to identify contradictions (STRETCH).
25
-
3.**Named-entity Recognition**:
26
-
* Prompting an LLM hosted on Bedrock to extract entities from free-text data:
27
-
* Providing the model with defined accepted values and proposed JSON structure to guide the response.
28
-
* Using a question-answering approach guided by instructions to shape model behaviour.
29
-
* Exploring the use of zero-shot and few-shot examples in prompts.
30
-
* Parsing the output into a JSON structure.
31
-
4.**Post-Processing**: Validating that the keys and values in the JSON structure are as expected, and potentially applying further post-processing steps.
34
+
1.**Pre-processing reports**: Some simple cleaning steps on the free-text input data.
35
+
2.**Detecting multiple tumours**:
36
+
* Prompt an LLM to identify reports that discuss multiple tumours.
37
+
* Reports with multiple tumours don't progress further in the pipeline
38
+
3.**Entity-extraction**:
39
+
* Prompt an LLM hosted on Bedrock to extract entities from free-text data:
40
+
* Provide the model with the relevant guidelines and instructions to extract the metrics
41
+
* Provide the model with the accepted values for each metric and the JSON structure to guide the response
42
+
* Provide examples to the model to improve performance and to further guide the output structure
43
+
* Parsing the output into a JSON/python dictionary and handling any incorrect structure accordingly
44
+
4.**Post-Processing**:
45
+
* Validating that the keys and values in the JSON structure are as expected
46
+
* Certain values can be automatically mapped to the correct format
47
+
* Some values can be inferred from other extracted metrics
32
48
33
-
## Results
34
49
35
-
###Evaluation Approach
50
+
## Evaluation Approach
36
51
37
52
We aim to evaluate the outputs of this work using multiple methods:
38
53
39
-
1.**Accuracy, Precision, Recall, and F1 Score** for each extracted entity, as well as overall performance: This assesses the correctness of our approach. Ideally, we aim for high accuracy, followed by precision, then recall.
40
-
2.**Out-of-Distribution Performance Evaluation**: We will engineer prompts using a subset of free-text data from a specific time period, then evaluate performance on a holdout set from both the same and a later time period to assess the impact of data drift.
41
-
3.**Stratified Performance Evaluation**: We will analyse how metrics vary across different dimensions of the data to measure and mitigate potential biases introduced by the system.
42
-
4.**G-Eval**: Using an LLM to evaluate hallucinations, grounding of outputs, and to sense-check the correctness of extracted values.
54
+
1.**Accuracy, Precision, Recall, and F1 Score** for each extracted entity, and from every potential value from each metric to assess if performance differs across each one.
55
+
2.**Out-of-Distribution Performance Evaluation**: We evaluated performance on a holdout set from both the same and a later time period to assess the impact of data drift.
56
+
3.**Ensemble/consensus approach**: We will get two different LLMs to do the same extraction and compare the agreement rate between them. We can then see if this approach improves the performance or provides more confidence in the final answer.
57
+
58
+
## Results
59
+
60
+
#### Performance achieved during prompt engineering:
61
+
62
+
63
+
| Dataset | Record count | Multi-tumour records flagged |
0 commit comments