Human Pose Classifier using Visual Transformers (ViT)

This project implements a human pose classifier using Visual Transformers (ViT), covering all required steps from data preprocessing to real-world inference.

The best-performing model achieves high classification metrics:

Metric	Value
Accuracy	0.796
Precision (weighted)	0.801
Recall (weighted)	0.796
F1-Score (weighted)	0.798

The model can be deployed on an Amazon EC2 Instance, and a live prototype is accessible via Streamlit Server: https://cv-human-pose-classifier-vit-aws-c.streamlit.app/

Key Features & Technologies

Data Source: Human action images are obtained from the Bingsu/Human_Action_Recognition dataset, loaded and preprocessed with the Hugging Face library.
Model Architecture: A Vision Transformer (ViT) fine-tuned to classify images into 15 action categories, such as calling, clapping, cycling, dancing, drinking, eating, fighting, hugging, laughing, listening_to_music, running, sitting, sleeping, texting, using_laptop.
AWS Integration: Trained models are automatically uploaded to S3 Bucket using boto3, enabling easy retrieval during deployment.
Deployment: Model can be served on EC2 Instance
Web Interface: Interactive inference via FastAPI backend and Streamlit frontend.

Pipelines Overview

Preprocessing Pipeline (src/preprocessing_pipeline.py)

The preprocessing stage is designed to be fully configurable through the JSON file config/preprocessing_config.json. This allows the dataset loading and splitting process to be easily adapted without modifying the code.

Parameter	Type	Default	Description
`huggingface_dataset_name`	`str`	`"Bingsu/Human_Action_Recognition"`	Name of the dataset on Hugging Face Hub.
`test_size`	`float`	`0.2`	Proportion of the dataset reserved for testing. Must be strictly between 0 and 1.
`validation_size`	`float`	`0.2`	Proportion of the training set reserved for validation. Must be strictly between 0 and 1.
`output_dir`	`str`	`"data"`	Directory where the processed splits will be saved.

The main steps of the data preprocessing pipeline are as follows:

Download the dataset from Hugging Face using the name specified in the configuration
Train/Validation/Test Split:
- First split: dataset is divided into training and testing sets (test_size ratio).
- Second split: training set is further divided into training and validation subsets (validation_size ratio).
- Ensures that all subsets (train, validation, test) contain the same label classes to avoid imbalance issues.
⚠️ Note (August 2025): The official test set of the Bingsu/Human_Action_Recognition dataset is currently erroneous. It only contains a single class ("calling") but with images from all 15 classes. For this reason, the test set in our pipeline is re-sampled from the training split to ensure proper evaluation.
Dataset Description: prints a statistical overview of each split (number of samples per class, total size, etc.).
Save to Disk: store the three subsets in the specified output_dir as train/, val/, and test/.

Training Pipeline (src/training_pipeline.py)

The training process is fully configurable through the JSON config file config/training_config.json, which is loaded using Pydantic schemas. This allows changing model, training parameters, and output directories without touching the core code.

Config Section	Field	Type	Description
directories_config	`input_dir`	`str`	Path to load preprocessed `train/` and `val/` datasets
	`clean_train_dir_before_training`	`bool`	Whether to clean checkpoints before training (default: `True`)
	`train_dir`	`str`	Directory to save training checkpoints
	`training_curve_path`	`str`	Path to save training & validation loss/accuracy curves
	`best_model_path`	`str`	Path to save the best model after training
model_params	`model_name`	`str`	Pretrained model name (e.g., `google/vit-base-patch16-224`)
	`nb_layers_to_freeze`	`int`	Number of ViT encoder layers to freeze (0–12) (default: `None`, i.e., train all encoder layers)
training_config	`enable_gpu`	`bool`	Enable GPU training if available (default: `False`)
	`learning_rate`	`float`	Learning rate for the optimizer
	`batch_size`	`int`	Training and validation batch size
	`num_train_epochs`	`int`	Number of training epochs

The main steps of the train pipeline are as follows:

Load Preprocessed Data
- Train and validation datasets are loaded from the specified input_dir.
- Checks that both sets contain the same label classes.
- Creates a mapping label2id and id2label for consistent training and evaluation.
Model Building
- Loads the pretrained model (ViT) from Hugging Face.
- Optionally freezes the first nb_layers_to_freeze encoder layers.
- Initializes image preprocessing transforms (RandomResizedCrop, ToTensor, Normalize).
- Selects device (CPU or GPU).
Apply Transforms
- Register the transformation steps in both train and validation sets (applied on-the-fly during training).
Training Arguments Setup
- Configures TrainingArguments (optimizer, logging, evaluation & save strategy).
- Sets metric for best model selection (accuracy).
Training Loop
- Cleans previous checkpoints if clean_train_dir_before_training=True.
- Trains the model with Hugging Face Trainer.
Metrics & Curves
- Tracks train loss, validation loss, and validation accuracy.
- Plots training/validation curves and saves them to training_curve_path.
Save the pipeline
- Saves the best model to best_model_path.
- Saves preprocessing transforms (transforms.pkl) for later inference and testing.

Testing Pipeline (src/testing_pipeline.py)

The testing process is fully configurable through the JSON config file config/testing_config.json, which is loaded using Pydantic schemas. This allows changing model pushing conditions without modifying the core code.

Parameter	Type	Description
`input_dir`	`str`	Path to load the preprocessed test dataset
`trained_model_path`	`str`	Path to the trained model to be loaded for testing
`metrics_output_file`	`str`	File path to save evaluation results (Excel)
`push_model_s3.enabled`	`bool`	If `true`, allows pushing the model to an S3 bucket if defined conditions are met
`push_model_s3.conditions`	`list`	List of metric-based conditions that must be satisfied to trigger a model push
`push_model_s3.conditions.metric`	`str`	Name of the metric to check (e.g., `accuracy`, `precision`)
`push_model_s3.conditions.threshold`	`float`	Minimum required value for the metric to allow model upload to S3 Bucket
`push_model_s3.bucket_name`	`str`	Name of the S3 bucket where the model will be uploaded
`push_model_s3.prefix`	`str`	Folder or path prefix in the bucket under which the model will be stored

The main steps of the testing pipeline are as follows:

Load Test Data: loads the preprocessed test dataset from the specified input_dir.
Load Model and Transforms
- Loads the trained ViT model from trained_model_path.
- Loads the preprocessing transforms (transforms.pkl) saved during training.
Predictions
- Calculates predictions on the entire test dataset.
- Collects both predicted and ground truth labels.
Metrics & Reporting
- Computes global metrics (accuracy, precision, recall, F1-score).
- Generates confusion matrix and per-class accuracy.
- Saves all results into the Excel file defined by metrics_output_file. An example output file can be found in data/output folder.
Model Push to S3 (Optional)
- If enabled, checks whether the evaluation metrics meet the configured thresholds.
- If all conditions are satisfied, uploads the model directory to the specified S3 bucket.

Inference using FastAPI / Streamlit Application

Experiments and Performance Analysis

1. Impact of Freezing Layers (Fine-Tuning Strategy)

To identify the best-performing model, the impact of the number of frozen layers during fine-tuning was analyzed. As shown in the performance curve, the highest accuracy is obtained when freezing 2 layers, while keeping the rest of the network trainable. This configuration provides a good trade-off between preserving pretrained features and adapting to the target dataset.

The observed trend shows that freezing too many layers leads to a drop in performance and this is mainly due to the limited adaptability of the model to the target data. Freezing fewer layers increases the model’s capacity to fit the dataset. However, it also leads to longer training time and a higher risk of overfitting (see train/validation curves in figs folder), since the network can more easily memorize dataset-specific patterns instead of learning generalizable features.

2. Performance Analysis (Test Set)

The confusion matrix (data/output/performance_metrics_15_2.xlsx) reveals that most classes are correctly classified, but several activities are frequently confused due to visual similarity and overlapping contexts.

In particular, calling, texting, using laptop, and listening to music show mutual confusion. These actions often involve similar postures (e.g., holding a phone, sitting, minimal body motion), which makes them harder to distinguish visually. Similar confusion is also observed between running and cycling, as both involve fast motion and similar body dynamics in outdoor scenes. Likewise, dancing and fighting are occasionally confused, likely due to similar arm movements and high-energy gestures. To test this behavior, feel free to use the following demo: https://cv-human-pose-classifier-vit-aws-c.streamlit.app/

Overall, the lowest-performing classes in terms of accuracy are:

using_laptop (69.9%)
calling (71.1%)
sitting (71.6%)
texting (72.5%)
listening_to_music (72.9%)

These classes correspond to activities with subtle visual differences, highlighting the difficulty of fine-grained action recognition when motions are small or visually ambiguous.

AWS Services Configuration for Model Deployment

Create a User using AWS IAM Service with the following permissions:
- AmazonEC2FullAccess
- AmazonS3FullAccess
(You can also create a custom policy for more restricted access if needed.)
Generate an Access Key for this IAM user and save the following securely:
- Access Key ID
- Secret Access Key
The S3 Bucket will be created dynamically during the execution of the testing_pipeline.py. If the best found model during training achieves the required score in the testing_config.json, it will be uploaded to this bucket.
⚠️ (Make sure the bucket name you configure is globally unique to avoid conflicts.)
Create an EC2 instance with the following specifications:
- AMI: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.7 (Ubuntu 22.04)
- Instance type: t3.medium
- Key Pair: Use the key pair you downloaded to connect via SSH
- Security Group:
  - Allow inbound rules for ports: 22, 80, 8501, 8502 (all TCP)
- Storage: 120 GiB (gp3)

Run the EC2 instance and interact with it via SSH or any other remote access method to set up the environment:

Create a working directory:
```
mkdir mlops
cd mlops
```

Clone the GitHub repo:

git clone https://github.com/Lahdhirim/CV-human-pose-classifier-ViT-aws.git
cd CV-human-pose-classifier-ViT-aws

Install dependencies:
```
pip install -r requirements.txt
```

Configure the AWS credentials using AWS CLI:

aws configure Press ENTER
AWS Access Key ID: ************
AWS Secret Access Key: ************
Default region name: Press ENTER
Default output format: Press ENTER

Add Streamlit to PATH (for command-line use):

echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

(Optional but recommended) Run the application from the terminal to ensure that everything is working correctly:
```
python3 src/web_app/server.py
```
in a new terminal:
```
streamlit run src/web_app/interface.py
```
Normally, at this stage, if everything works fine, the application is accessible at http://<public IPv4 address>:8501

Automatically launch Streamlit on instance reboot:

Create the startup script:

nano /home/ubuntu/start_streamlit.sh

Paste the following:

  #!/bin/bash
  cd /home/ubuntu/mlops/CV-human-pose-classifier-ViT-aws
  source /home/ubuntu/.bashrc
  nohup /usr/bin/python3 src/web_app/server.py >> /home/ubuntu/fastapi.log 2>&1 &
  sleep 20
  nohup /home/ubuntu/.local/bin/streamlit run src/web_app/interface.py --server.port 8501 >> /home/ubuntu/streamlit.log 2>&1 &

Make the script executable:

chmod +x /home/ubuntu/start_streamlit.sh

Add the script to crontab for reboot:
```
crontab -e
```
Add the following line at the end of the file:
```
@reboot /home/ubuntu/start_streamlit.sh
```

Each time the instance is rebooted, Streamlit will automatically launch the web application at the address http://<public IPv4 address>:8501. Two log files named fastapi.log and streamlit.log will be created in the /home/ubuntu directory. These files can be used to monitor the application’s status and debug any errors.
The application will be publicly accessible to anyone with the instance’s public IP address. Access can be controlled via the EC2 Security Group:

To allow access from any IP address, set the Source to 0.0.0.0/0 on TCP port 8501.
⚠️ Use 0.0.0.0/0 only if you're aware of the security implications. For more restricted access, specify your own IP or a limited range.

Alternative Model Deployment (Docker-Based)

After training phase, the model can be deployed easily using Docker which removes the need for manual environment setup and complex startup scripts. This is the recommended deployment method for both local and cloud environments:

Clone the GitHub repo:

git clone https://github.com/Lahdhirim/CV-human-pose-classifier-ViT-aws.git
cd CV-human-pose-classifier-ViT-aws

Configure AWS Credentials (Required to download the trained model from S3)
1. Install AWS CLI
2. Configure your credentials using aws configure
3. Verify the configuration using cat ~/.aws/credentials
Build and run the containers:
```
docker compose up --build -d
```
Access the application:
- Streamlit UI: http://<PUBLIC_IP>:8501
- FastAPI docs: http://<PUBLIC_IP>:8502/docs

Running the Pipelines

There are four main execution modes, recommended in the following order:

Preprocess the Data (mandatory to collect the data)

python main.py preprocess_data

Run the Training Pipeline (mandatory to find the best model)

python main.py train

Run the Testing Pipeline (mandatory to push the best model to S3 Bucket)

python main.py test

Run the Web Application

python main.py inference

Human Pose Classification & Beyond

This classifier is designed not only for human action recognition but also as a flexible image classification framework. Thanks to its modular pipeline and configuration-driven design:

Configurable Pipelines: Each stage of the workflow (preprocessing, training, testing and inference) is controlled by its own configuration file (preprocessing_config.json, training_config.json, testing_config.json, inference_config.json).
No Code Changes Required: You can adapt the application to new datasets or classification tasks simply by updating the configuration files.
End-to-End Workflow: From data preprocessing to model training, evaluation, and deployment, all steps are fully automated and modular.
Rapid Deployment: The same FastAPI + Streamlit interface can serve any trained model without modification, making it suitable for a wide range of computer vision tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
assets		assets
config		config
data		data
figs		figs
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile.fastapi		Dockerfile.fastapi
Dockerfile.streamlit		Dockerfile.streamlit
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt
streamlit_deployment.py		streamlit_deployment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Human Pose Classifier using Visual Transformers (ViT)

Key Features & Technologies

Pipelines Overview

Preprocessing Pipeline (src/preprocessing_pipeline.py)

Training Pipeline (src/training_pipeline.py)

Testing Pipeline (src/testing_pipeline.py)

Inference using FastAPI / Streamlit Application

Experiments and Performance Analysis

1. Impact of Freezing Layers (Fine-Tuning Strategy)

2. Performance Analysis (Test Set)

AWS Services Configuration for Model Deployment

Alternative Model Deployment (Docker-Based)

Running the Pipelines

Preprocess the Data (mandatory to collect the data)

Run the Training Pipeline (mandatory to find the best model)

Run the Testing Pipeline (mandatory to push the best model to S3 Bucket)

Run the Web Application

Human Pose Classification & Beyond

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Human Pose Classifier using Visual Transformers (ViT)

Key Features & Technologies

Pipelines Overview

Preprocessing Pipeline (src/preprocessing_pipeline.py)

Training Pipeline (src/training_pipeline.py)

Testing Pipeline (src/testing_pipeline.py)

Inference using FastAPI / Streamlit Application

Experiments and Performance Analysis

1. Impact of Freezing Layers (Fine-Tuning Strategy)

2. Performance Analysis (Test Set)

AWS Services Configuration for Model Deployment

Alternative Model Deployment (Docker-Based)

Running the Pipelines

Preprocess the Data (mandatory to collect the data)

Run the Training Pipeline (mandatory to find the best model)

Run the Testing Pipeline (mandatory to push the best model to S3 Bucket)

Run the Web Application

Human Pose Classification & Beyond

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages