patents-ateco

Repository for classifying a set of patents according to the NACE Rev. 2.1 / ATECO classification, combining:

filtered raw patent dataset download from Hugging Face;
stratified patent sample creation;
NACE taxonomy preprocessing;
NACE class embedding generation;
semantic retrieval of candidate classes;
final patent classification through OpenAI models.

Quick Start

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
python src/patents_download_filtered.py
python src/patents_build_sample.py
python src/preprocess_nace.py
python src/build_nace_embeddings.py
python src/classify_patents_ateco.py

For production runs, make sure that src/classify_patents_ateco.py is configured with:

MAX_ROWS = None

Repository Structure

patents-ateco/
|-- data/
|   |-- raw/          # raw inputs and filtered patent datasets
|   |-- interim/      # intermediate files generated by the scripts
|   `-- processed/    # final classification outputs
|-- resources/
|   `-- classification/
|       `-- NACE_Rev2_1_Structure_Explanatory_Notes.xlsx
|-- src/
|   |-- utils/
|   |   |-- config.py
|   |   |-- prompting.py
|   |   |-- retrieval.py
|   |   `-- validation.py
|   |-- patents_download_filtered.py
|   |-- patents_build_sample.py
|   |-- preprocess_nace.py
|   |-- build_nace_embeddings.py
|   |-- classify_patents_ateco.py
|   `-- test.py
|-- requirements.txt
`-- .env

Main Inputs and Outputs

Inputs:

resources/classification/NACE_Rev2_1_Structure_Explanatory_Notes.xlsx: source NACE classification file.

Generated raw datasets:

data/raw/patents_filtered.parquet: filtered patent dataset downloaded from Hugging Face.
data/raw/patents_sample_20pct_stratified_by_year.csv: stratified 20% patent sample built from the filtered dataset.

Generated intermediate and final files:

data/interim/nace_level4_preprocessed.parquet: preprocessed level-4 NACE dataset.
data/interim/nace_level4_embeddings.parquet: NACE dataset enriched with embeddings.
data/processed/patents_ateco_predictions_test.csv: classification output in test mode.
data/processed/patents_ateco_predictions.csv: classification output for the full run.

Requirements

The project requires Python 3.10+ and the libraries used by the scripts, including:

pandas
numpy
python-dotenv
openai
pyarrow
openpyxl
datasets

Environment setup:

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

The .env file must include at least:

OPENAI_API_KEY=your_api_key_here

To run the raw patent download step, you also need:

HF_TOKEN=your_huggingface_token_here

Workflow

The main scripts should be executed in the following order.

1. Download and Filter Raw Patent Dataset

Script: src/patents_download_filtered.py

What it does:

reads the istat-ai/ai-patents dataset from Hugging Face;
filters records based on dataset quality constraints;
excludes records with ita_only != 0;
removes records without an abstract;
removes records with abstracts shorter than the configured threshold;
derives the year from priority date or grant date;
saves the filtered dataset into data/raw/.

Command:

python src/patents_download_filtered.py

Output:

data/raw/patents_filtered.parquet

2. Build Patent Sample

Script: src/patents_build_sample.py

What it does:

loads the filtered raw dataset from data/raw/patents_filtered.parquet;
computes the number of available patents per year;
builds a 20% stratified sample by year;
writes the sample dataset used by the classification pipeline.

Command:

python src/patents_build_sample.py

Output:

data/raw/patents_sample_20pct_stratified_by_year.csv

Note:

this step depends on data/raw/patents_filtered.parquet;
it can be rerun independently if you want to rebuild the sample starting from the filtered raw dataset.

3. Preprocess NACE

Script: src/preprocess_nace.py

What it does:

reads the NACE Excel file from resources/classification/;
filters level-4 classes;
builds a semantic text field with code, division, title, and explanatory notes;
saves the result as parquet.

Command:

python src/preprocess_nace.py

Output:

data/interim/nace_level4_preprocessed.parquet

4. Build NACE Embeddings

Script: src/build_nace_embeddings.py

What it does:

loads the preprocessed NACE dataset;
generates embeddings for the text field using text-embedding-3-small;
saves the dataframe with the embedding column.

Command:

python src/build_nace_embeddings.py

Output:

data/interim/nace_level4_embeddings.parquet

5. Classify Patents ATECO

Script: src/classify_patents_ateco.py

What it does:

loads the patent dataset from data/raw/;
builds the patent text from title and abstract;
retrieves the top-k most similar NACE classes via cosine similarity on embeddings;
sends the patent and candidate classes to an OpenAI model;
validates and saves the final output as CSV.

Command:

python src/classify_patents_ateco.py

Output:

data/processed/patents_ateco_predictions_test.csv when the script is run in test mode;
data/processed/patents_ateco_predictions.csv when it is run on the full dataset.

Test vs Production

At the moment, src/classify_patents_ateco.py is configured with:

MAX_ROWS = 40

This means:

only the first 40 patents are classified;
the output is written to data/processed/patents_ateco_predictions_test.csv.

To run the full dataset classification, set:

MAX_ROWS = None

In production, the correct setting is:

MAX_ROWS = None

In that case, the output will be written to data/processed/patents_ateco_predictions.csv.

Supporting Scripts

`src/test.py`

Small utility script to inspect the output CSV and check the total number of rows and duplicates on row_id.

Utility Modules

src/utils/config.py: defines project paths and directories.
src/utils/retrieval.py: patent text embedding and top-k NACE retrieval.
src/utils/prompting.py: builds the classification prompt.
src/utils/validation.py: cleans and validates the codes returned by the model.

Classification Output

The final output file includes, among others, the following columns:

id: patent identifier;
title, abstract, year: patent metadata;
primary_code: assigned main NACE code;
secondary_codes: optional secondary codes;
top_k_codes: candidate codes passed to the model;
top1_code, top2_code: top retrieved classes;
top1_similarity, top2_similarity: similarity scores.

Resume and Checkpoints

classify_patents_ateco.py supports:

incremental checkpoint saving every few records;
restart from an existing output CSV;
final deduplication by id.

This makes it possible to stop and resume a run without losing already written results.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
resources/classification		resources/classification
src		src
.gitignore		.gitignore
README.md		README.md
azure-setup.txt		azure-setup.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

patents-ateco

Quick Start

Repository Structure

Main Inputs and Outputs

Requirements

Workflow

1. Download and Filter Raw Patent Dataset

2. Build Patent Sample

3. Preprocess NACE

4. Build NACE Embeddings

5. Classify Patents ATECO

Test vs Production

Supporting Scripts

`src/test.py`

Utility Modules

Classification Output

Resume and Checkpoints

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

patents-ateco

Quick Start

Repository Structure

Main Inputs and Outputs

Requirements

Workflow

1. Download and Filter Raw Patent Dataset

2. Build Patent Sample

3. Preprocess NACE

4. Build NACE Embeddings

5. Classify Patents ATECO

Test vs Production

Supporting Scripts

src/test.py

Utility Modules

Classification Output

Resume and Checkpoints

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`src/test.py`

Packages