Skip to content

istat-methodology/patents-ateco

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

patents-ateco

Repository for classifying a set of patents according to the NACE Rev. 2.1 / ATECO classification, combining:

  • filtered raw patent dataset download from Hugging Face;
  • stratified patent sample creation;
  • NACE taxonomy preprocessing;
  • NACE class embedding generation;
  • semantic retrieval of candidate classes;
  • final patent classification through OpenAI models.

Quick Start

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
python src/patents_download_filtered.py
python src/patents_build_sample.py
python src/preprocess_nace.py
python src/build_nace_embeddings.py
python src/classify_patents_ateco.py

For production runs, make sure that src/classify_patents_ateco.py is configured with:

MAX_ROWS = None

Repository Structure

patents-ateco/
|-- data/
|   |-- raw/          # raw inputs and filtered patent datasets
|   |-- interim/      # intermediate files generated by the scripts
|   `-- processed/    # final classification outputs
|-- resources/
|   `-- classification/
|       `-- NACE_Rev2_1_Structure_Explanatory_Notes.xlsx
|-- src/
|   |-- utils/
|   |   |-- config.py
|   |   |-- prompting.py
|   |   |-- retrieval.py
|   |   `-- validation.py
|   |-- patents_download_filtered.py
|   |-- patents_build_sample.py
|   |-- preprocess_nace.py
|   |-- build_nace_embeddings.py
|   |-- classify_patents_ateco.py
|   `-- test.py
|-- requirements.txt
`-- .env

Main Inputs and Outputs

Inputs:

  • resources/classification/NACE_Rev2_1_Structure_Explanatory_Notes.xlsx: source NACE classification file.

Generated raw datasets:

  • data/raw/patents_filtered.parquet: filtered patent dataset downloaded from Hugging Face.
  • data/raw/patents_sample_20pct_stratified_by_year.csv: stratified 20% patent sample built from the filtered dataset.

Generated intermediate and final files:

  • data/interim/nace_level4_preprocessed.parquet: preprocessed level-4 NACE dataset.
  • data/interim/nace_level4_embeddings.parquet: NACE dataset enriched with embeddings.
  • data/processed/patents_ateco_predictions_test.csv: classification output in test mode.
  • data/processed/patents_ateco_predictions.csv: classification output for the full run.

Requirements

The project requires Python 3.10+ and the libraries used by the scripts, including:

  • pandas
  • numpy
  • python-dotenv
  • openai
  • pyarrow
  • openpyxl
  • datasets

Environment setup:

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

The .env file must include at least:

OPENAI_API_KEY=your_api_key_here

To run the raw patent download step, you also need:

HF_TOKEN=your_huggingface_token_here

Workflow

The main scripts should be executed in the following order.

1. Download and Filter Raw Patent Dataset

Script: src/patents_download_filtered.py

What it does:

  • reads the istat-ai/ai-patents dataset from Hugging Face;
  • filters records based on dataset quality constraints;
  • excludes records with ita_only != 0;
  • removes records without an abstract;
  • removes records with abstracts shorter than the configured threshold;
  • derives the year from priority date or grant date;
  • saves the filtered dataset into data/raw/.

Command:

python src/patents_download_filtered.py

Output:

  • data/raw/patents_filtered.parquet

2. Build Patent Sample

Script: src/patents_build_sample.py

What it does:

  • loads the filtered raw dataset from data/raw/patents_filtered.parquet;
  • computes the number of available patents per year;
  • builds a 20% stratified sample by year;
  • writes the sample dataset used by the classification pipeline.

Command:

python src/patents_build_sample.py

Output:

  • data/raw/patents_sample_20pct_stratified_by_year.csv

Note:

  • this step depends on data/raw/patents_filtered.parquet;
  • it can be rerun independently if you want to rebuild the sample starting from the filtered raw dataset.

3. Preprocess NACE

Script: src/preprocess_nace.py

What it does:

  • reads the NACE Excel file from resources/classification/;
  • filters level-4 classes;
  • builds a semantic text field with code, division, title, and explanatory notes;
  • saves the result as parquet.

Command:

python src/preprocess_nace.py

Output:

  • data/interim/nace_level4_preprocessed.parquet

4. Build NACE Embeddings

Script: src/build_nace_embeddings.py

What it does:

  • loads the preprocessed NACE dataset;
  • generates embeddings for the text field using text-embedding-3-small;
  • saves the dataframe with the embedding column.

Command:

python src/build_nace_embeddings.py

Output:

  • data/interim/nace_level4_embeddings.parquet

5. Classify Patents ATECO

Script: src/classify_patents_ateco.py

What it does:

  • loads the patent dataset from data/raw/;
  • builds the patent text from title and abstract;
  • retrieves the top-k most similar NACE classes via cosine similarity on embeddings;
  • sends the patent and candidate classes to an OpenAI model;
  • validates and saves the final output as CSV.

Command:

python src/classify_patents_ateco.py

Output:

  • data/processed/patents_ateco_predictions_test.csv when the script is run in test mode;
  • data/processed/patents_ateco_predictions.csv when it is run on the full dataset.

Test vs Production

At the moment, src/classify_patents_ateco.py is configured with:

MAX_ROWS = 40

This means:

  • only the first 40 patents are classified;
  • the output is written to data/processed/patents_ateco_predictions_test.csv.

To run the full dataset classification, set:

MAX_ROWS = None

In production, the correct setting is:

MAX_ROWS = None

In that case, the output will be written to data/processed/patents_ateco_predictions.csv.

Supporting Scripts

src/test.py

Small utility script to inspect the output CSV and check the total number of rows and duplicates on row_id.

Utility Modules

  • src/utils/config.py: defines project paths and directories.
  • src/utils/retrieval.py: patent text embedding and top-k NACE retrieval.
  • src/utils/prompting.py: builds the classification prompt.
  • src/utils/validation.py: cleans and validates the codes returned by the model.

Classification Output

The final output file includes, among others, the following columns:

  • id: patent identifier;
  • title, abstract, year: patent metadata;
  • primary_code: assigned main NACE code;
  • secondary_codes: optional secondary codes;
  • top_k_codes: candidate codes passed to the model;
  • top1_code, top2_code: top retrieved classes;
  • top1_similarity, top2_similarity: similarity scores.

Resume and Checkpoints

classify_patents_ateco.py supports:

  • incremental checkpoint saving every few records;
  • restart from an existing output CSV;
  • final deduplication by id.

This makes it possible to stop and resume a run without losing already written results.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages