Repository for classifying a set of patents according to the NACE Rev. 2.1 / ATECO classification, combining:
- filtered raw patent dataset download from Hugging Face;
- stratified patent sample creation;
- NACE taxonomy preprocessing;
- NACE class embedding generation;
- semantic retrieval of candidate classes;
- final patent classification through OpenAI models.
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
python src/patents_download_filtered.py
python src/patents_build_sample.py
python src/preprocess_nace.py
python src/build_nace_embeddings.py
python src/classify_patents_ateco.pyFor production runs, make sure that src/classify_patents_ateco.py is configured with:
MAX_ROWS = Nonepatents-ateco/
|-- data/
| |-- raw/ # raw inputs and filtered patent datasets
| |-- interim/ # intermediate files generated by the scripts
| `-- processed/ # final classification outputs
|-- resources/
| `-- classification/
| `-- NACE_Rev2_1_Structure_Explanatory_Notes.xlsx
|-- src/
| |-- utils/
| | |-- config.py
| | |-- prompting.py
| | |-- retrieval.py
| | `-- validation.py
| |-- patents_download_filtered.py
| |-- patents_build_sample.py
| |-- preprocess_nace.py
| |-- build_nace_embeddings.py
| |-- classify_patents_ateco.py
| `-- test.py
|-- requirements.txt
`-- .env
Inputs:
resources/classification/NACE_Rev2_1_Structure_Explanatory_Notes.xlsx: source NACE classification file.
Generated raw datasets:
data/raw/patents_filtered.parquet: filtered patent dataset downloaded from Hugging Face.data/raw/patents_sample_20pct_stratified_by_year.csv: stratified 20% patent sample built from the filtered dataset.
Generated intermediate and final files:
data/interim/nace_level4_preprocessed.parquet: preprocessed level-4 NACE dataset.data/interim/nace_level4_embeddings.parquet: NACE dataset enriched with embeddings.data/processed/patents_ateco_predictions_test.csv: classification output in test mode.data/processed/patents_ateco_predictions.csv: classification output for the full run.
The project requires Python 3.10+ and the libraries used by the scripts, including:
pandasnumpypython-dotenvopenaipyarrowopenpyxldatasets
Environment setup:
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txtThe .env file must include at least:
OPENAI_API_KEY=your_api_key_hereTo run the raw patent download step, you also need:
HF_TOKEN=your_huggingface_token_hereThe main scripts should be executed in the following order.
Script: src/patents_download_filtered.py
What it does:
- reads the
istat-ai/ai-patentsdataset from Hugging Face; - filters records based on dataset quality constraints;
- excludes records with
ita_only != 0; - removes records without an abstract;
- removes records with abstracts shorter than the configured threshold;
- derives the year from
priority dateorgrant date; - saves the filtered dataset into
data/raw/.
Command:
python src/patents_download_filtered.pyOutput:
data/raw/patents_filtered.parquet
Script: src/patents_build_sample.py
What it does:
- loads the filtered raw dataset from
data/raw/patents_filtered.parquet; - computes the number of available patents per year;
- builds a 20% stratified sample by year;
- writes the sample dataset used by the classification pipeline.
Command:
python src/patents_build_sample.pyOutput:
data/raw/patents_sample_20pct_stratified_by_year.csv
Note:
- this step depends on
data/raw/patents_filtered.parquet; - it can be rerun independently if you want to rebuild the sample starting from the filtered raw dataset.
Script: src/preprocess_nace.py
What it does:
- reads the NACE Excel file from
resources/classification/; - filters level-4 classes;
- builds a semantic
textfield with code, division, title, and explanatory notes; - saves the result as parquet.
Command:
python src/preprocess_nace.pyOutput:
data/interim/nace_level4_preprocessed.parquet
Script: src/build_nace_embeddings.py
What it does:
- loads the preprocessed NACE dataset;
- generates embeddings for the
textfield usingtext-embedding-3-small; - saves the dataframe with the
embeddingcolumn.
Command:
python src/build_nace_embeddings.pyOutput:
data/interim/nace_level4_embeddings.parquet
Script: src/classify_patents_ateco.py
What it does:
- loads the patent dataset from
data/raw/; - builds the patent text from
titleandabstract; - retrieves the top-k most similar NACE classes via cosine similarity on embeddings;
- sends the patent and candidate classes to an OpenAI model;
- validates and saves the final output as CSV.
Command:
python src/classify_patents_ateco.pyOutput:
data/processed/patents_ateco_predictions_test.csvwhen the script is run in test mode;data/processed/patents_ateco_predictions.csvwhen it is run on the full dataset.
At the moment, src/classify_patents_ateco.py is configured with:
MAX_ROWS = 40This means:
- only the first 40 patents are classified;
- the output is written to
data/processed/patents_ateco_predictions_test.csv.
To run the full dataset classification, set:
MAX_ROWS = NoneIn production, the correct setting is:
MAX_ROWS = NoneIn that case, the output will be written to data/processed/patents_ateco_predictions.csv.
Small utility script to inspect the output CSV and check the total number of rows and duplicates on row_id.
src/utils/config.py: defines project paths and directories.src/utils/retrieval.py: patent text embedding and top-k NACE retrieval.src/utils/prompting.py: builds the classification prompt.src/utils/validation.py: cleans and validates the codes returned by the model.
The final output file includes, among others, the following columns:
id: patent identifier;title,abstract,year: patent metadata;primary_code: assigned main NACE code;secondary_codes: optional secondary codes;top_k_codes: candidate codes passed to the model;top1_code,top2_code: top retrieved classes;top1_similarity,top2_similarity: similarity scores.
classify_patents_ateco.py supports:
- incremental checkpoint saving every few records;
- restart from an existing output CSV;
- final deduplication by
id.
This makes it possible to stop and resume a run without losing already written results.