🧾 Data Cleaning Analyst Environment (OpenEnv)

title	Data Cleaning Agent
emoji	🧹
colorFrom	blue
colorTo	green
sdk	docker
app_file	inference.py
pinned	true

🧾 Data Cleaning Analyst Environment (OpenEnv)

Overview

This project implements a real-world OpenEnv environment simulating the task of a data analyst cleaning unorganised datasets.

Data cleaning is one of the most time-consuming and important steps in real-world data workflows. This environment allows AI agents to learn and be evaluated on:

Handling missing values
Standardizing inconsistent formats
Converting incorrect data types
Removing duplicates

Why this matters

In real-world data science pipelines:

Up to 80% of time is spent cleaning data

This environment models that workflow in a structured, testable way — making it useful for training and evaluating AI agents. The agent being implemented here works on some of the low priority but necessary tasks required during the cleaning of a dataset and it has notable future scope.

Environment Design

Observation Space

{
  "dataset": [...],
  "step_count": int,
  "remaining_errors": int
}

Action Space

{
  "action_type": "fill_missing | standardize_name | convert_type | fix_date_format | remove_duplicates",
  "column": "optional"
}

Reward Function

The reward is dense and progressive, encouraging efficient and correct cleaning:

+0.15 per error fixed
+0.05 efficiency bonus
-0.05 for ineffective actions
-0.25 for worsening dataset
+0.5 completion bonus

Tasks

Easy — Missing Values

Single error type
Objective: fill missing values

Medium — Mixed Issues

Missing values + inconsistent formats
Requires multi-step reasoning

Hard — Noisy Data

Duplicates
Incorrect types
Multiple date formats
Extra spaces and inconsistencies

Grading System

Evaluation is deterministic and returns a score between 0.0 → 1.0.

Field-level accuracy scoring
Partial credit for partially correct rows
Exact match yields full score

Baseline Agent

A hybrid baseline agent is provided:

Uses rule-based heuristics for stability
Falls back to LLM for generalization

This ensures reproducible and meaningful baseline scores.

API Endpoints

Endpoint	Description
`/reset`	Initialize environment
`/step`	Apply action
`/state`	Get current state
`/tasks`	List tasks
`/grader`	Get final score
`/baseline`	Run baseline agent

Setup Instructions

1. Install dependencies

pip install -r requirements.txt

2. Run locally

uvicorn api.app:app --reload

3. Run baseline

python inference.py

Example (Before → After)

Input

{"name": "JOHN    DOE", "age": "thirty", "date": "03-12-24"}

Output

{"name": "John Doe", "age": 30, "date": "2024-03-12"}

Key Highlights

Deterministic grading
Dense reward shaping
Multi-step reasoning environment
Fully OpenEnv compliant
Dockerized + deployable

Deployment

This environment is containerized and deployable on Hugging Face Spaces using Docker.

Team

This project was made and implemented for the participation in the Meta PyTorch OpenEnv Hackathon x Scaler School of Technology.
Created by Team Bug Smashers: Rian, Amogh, Elveena

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
api		api
env		env
graders		graders
server		server
tasks		tasks
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
inference.py		inference.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧾 Data Cleaning Analyst Environment (OpenEnv)

Overview

Why this matters

Environment Design

Observation Space

Action Space

Reward Function

Tasks

Easy — Missing Values

Medium — Mixed Issues

Hard — Noisy Data

Grading System

Baseline Agent

API Endpoints

Setup Instructions

1. Install dependencies

2. Run locally

3. Run baseline

Example (Before → After)

Input

Output

Key Highlights

Deployment

Team

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧾 Data Cleaning Analyst Environment (OpenEnv)

Overview

Why this matters

Environment Design

Observation Space

Action Space

Reward Function

Tasks

Easy — Missing Values

Medium — Mixed Issues

Hard — Noisy Data

Grading System

Baseline Agent

API Endpoints

Setup Instructions

1. Install dependencies

2. Run locally

3. Run baseline

Example (Before → After)

Input

Output

Key Highlights

Deployment

Team

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages