AdaptiveTutor-Env 🎓

title	Adaptive Tutor Env
emoji	🌖
colorFrom	blue
colorTo	pink
sdk	docker
pinned	false
license	mit
short_description	Personalized AI tutoring RL Environment

AdaptiveTutor-Env 🎓

An OpenEnv-compliant reinforcement learning environment for personalized AI tutoring — simulating real-world EdTech dynamics with psychometric student modeling and multi-objective pedagogical optimization.

Problem Statement

Personalized tutoring is one of the most effective pedagogical interventions known — studies show one-on-one tutoring yields two standard deviations of improvement over classroom instruction (Bloom, 1984). Yet human tutors scale poorly: globally, 500 million students lack access to qualified tutors, and the global EdTech market — projected to exceed $400 billion by 2027 — is powered largely by static, one-size-fits-none content sequencing.

The core challenge is sequential instructional decision-making: a tutor must, at each interaction, decide what to teach, how to teach it, and when to revisit it — all while tracking a student's evolving knowledge state, engagement, and fatigue. This is a genuinely hard sequential decision problem with real-world stakes.

Existing RL environments in education are typically:

Toy problems with no real student modeling
Static curricula that don't adapt to individual learners
Game-based (e.g., MathWorld, PEARL) with no transfer to real pedagogy

There is no open-source, OpenEnv-compliant RL environment that models real educational dynamics at this fidelity.

Abstract

AdaptiveTutor-Env fills this gap. It is a full-featured, OpenEnv-compliant reinforcement learning environment for personalized AI tutoring, submitted to the Meta PyTorch OpenEnv Hackathon × SST.

The environment simulates a personalized AI tutor interacting with a single student across 5 academic subjects and 25 topics over a bounded episode. Unlike toy environments, AdaptiveTutor-Env embeds:

Item Response Theory (IRT) 2PL — the gold standard of educational measurement — to govern learning transitions and model student ability as a continuous latent variable.
Ebbinghaus exponential forgetting — real knowledge decays between visits — forcing agents to schedule proactive spaced repetition.
Dynamic engagement and fatigue — student state evolves with teaching choices, creating genuine multi-objective optimization.
A composite dense reward function rewarding learning gains, engagement maintenance, subject balance, and retention simultaneously.

Three graded tasks span easy → medium → hard difficulty, with deterministic programmatic graders scoring in [0.0, 1.0]. A baseline inference script provides reproducible LLM-driven and heuristic agent scores.

Why This Matters

Dimension	Traditional LMS	AdaptiveTutor-Env
Student model	Rule-based tags	Probabilistic IRT
Knowledge decay	Ignored	Ebbinghaus exponential
Engagement	Static	Dynamic with fatigue
Reward signal	Quiz score only	Multi-objective composite
RL training	Not supported	First-class OpenEnv
Reproducibility	Non-deterministic	Seed-controlled

An RL agent trained on AdaptiveTutor-Env learns genuine pedagogical skills: difficulty calibration, spaced repetition scheduling, engagement management, and curriculum balancing — transferable to real educational AI products.

Environment Overview

┌─────────────────────────────────────────────────────────────────┐
│                     AdaptiveTutor-Env                           │
│                                                                 │
│  ┌──────────────┐    ┌─────────────┐    ┌───────────────────┐  │
│  │   RL Agent   │───▶│  Environment │───▶│  Student Model    │  │
│  │  (policy π)  │◀───│  step/reset  │◀───│  IRT + Forgetting │  │
│  └──────────────┘    └─────────────┘    └───────────────────┘  │
│                              │                                  │
│                    ┌─────────▼──────────┐                       │
│                    │  Reward Computer   │                       │
│                    │  + Task Grader     │                       │
│                    └────────────────────┘                       │
└─────────────────────────────────────────────────────────────────┘

Episode flow:

reset(task_id, seed) → initial student state (low masteries, high engagement)
Agent selects a 5D teaching action per step
Environment applies IRT learning + Ebbinghaus decay + engagement dynamics
Dense reward emitted; step counter incremented
Episode ends at max_steps or early termination (mastery ≥ 0.92)
grade() computes final task score

The Student Model

IRT 2PL — Ability & Difficulty

The Item Response Theory (2-Parameter Logistic) model governs every learning interaction. Student ability (θ) is a continuous latent variable linked to mastery via the logit transform:

P(correct | θ, b) = σ(a × (θ − b))

Where:
  a  = 1.7   (discrimination — fixed)
  b  = {-1.0, 0.0, +1.5}  (difficulty per level: easy, medium, hard)
  θ  = logit(mastery)     (ability from mastery ∈ [0,1])
  σ  = sigmoid

When the agent's chosen difficulty b matches the student's ability θ, learning efficiency is maximized (Zone of Proximal Development). Misaligned difficulty yields reduced or negative learning gains.

Ebbinghaus Forgetting Curve

Unvisited topics decay exponentially between teaching steps:

R(t) = exp(−t / (τ × strength))

decayed_mastery = max(mastery × R(t),  0.05 × original_mastery)

Where:
  t       = steps elapsed since last visit
  τ       = 100.0 (base time constant)
  strength = 0.7 + 0.5 × mastery  (stronger memories decay slower)
  0.05    = procedural memory floor (never fully lost)

This creates a genuine trade-off between exploration and retention — the agent must revisit topics before they decay too far, but can't revisit everything simultaneously within a bounded episode.

Engagement & Fatigue Dynamics

Activity	Learning Rate	Fatigue Δ	Notes
Video lesson	0.21	+0.02	Low intensity
Practice exercise	0.42	+0.05	Highest gain, most fatigue
Quiz	0.33	+0.04	Tests and reinforces
Revision	0.27	+0.03	Low fatigue, efficient

Engagement recovers passively (+0.04/step on success) but decays with a base rate of −0.03/step × 0.3 (~−0.009/step effective) plus a fatigue drag proportional to current fatigue. A disengaged or fatigued student learns less effectively — the reward signal reflects this.

Action Space

Each action is a 5-dimensional discrete composite:

Action = {
  subject:       str   (5 choices)     — which subject to teach
  topic:         str   (5 per subject) — which topic within the subject
  activity_type: int   {0, 1, 2, 3}   — video, practice, quiz, revision
  difficulty:    int   {0, 1, 2}       — easy (b=−1.0), medium (b=0), hard (b=+1.5)
  strategy:      int   {0, 1, 2}       — introduce, reinforce, spaced_repetition
}

Total action space size: 5 × 5 × 4 × 3 × 3 = 900 distinct actions

# Example action
{
    "subject": "mathematics",
    "topic": "algebra",
    "activity_type": 1,   # practice_exercise
    "difficulty": 0,      # easy (b = -1.0)
    "strategy": 0         # introduce_new_concept
}

Strategy Semantics

Strategy	When to Use	Effect
`0` — Introduce	mastery < 0.30	New concept delivery, low fatigue
`1` — Reinforce	mastery 0.30–0.70	Practice with feedback, mastery building
`2` — Spaced Repetition	topic not visited in 6+ steps	Counteract Ebbinghaus decay

Observation Space

The agent receives a rich state observation at every step:

Observation {
  masteries:            dict[subject][topic] → [0.0, 1.0]
  engagement:           float ∈ [0.0, 1.0]
  fatigue:              float ∈ [0.0, 1.0]
  step:                 int    (current step, 0-indexed)
  subject_masteries:    dict[subject] → mean mastery
  overall_mastery:      float (global mean)
  time_since_last_visit: dict[subject][topic] → int (steps)
  last_action:          dict | null (previous action details)
  task_id:              str   (active task identifier)
  max_steps:            int   (episode budget)
}

State size: 25 topic masteries + 5 subject means + 25 time-since-visit + scalar fields = 57 observable variables

Reward Function

The reward is a dense, multi-objective composite emitted at every step:

R_total =  0.50 × learning_gain
         + 0.20 × engagement_bonus
         + 0.20 × balance_reward
         − 0.15 × retention_penalty
         − 0.3  × (step / max_steps)   ← progressive step penalty

Component	Formula	Range
`learning_gain`	`clip(Δmastery / 0.15, −1, 1)`	[−1, 1]
`engagement_bonus`	`clip(2 × engagement − 1, −0.5, 0.5)`	[−0.5, 0.5]
`balance_reward`	`1 − Gini(subject_means)`	[0, 1]
`retention_penalty`	`mean(Δmastery / masteries_before)` on unvisited topics	[0, 1]
`step_penalty`	`−0.3 × (step / max_steps)`	[0, −0.3]

Total reward clipped to [−1.0, 1.0]

The progressive step penalty increases urgency as the episode nears its end, preventing agents from wasting steps on low-value activities.

Tasks & Grading

Three tasks with escalating difficulty and distinct optimization objectives:

Task 1 — Single Subject Mastery (Easy · 20 steps · Pass ≥ 0.60)

Objective: Maximize mean mastery across all 5 Mathematics topics.

Grader:

score = mean(maths_masteries)
      + 0.05  if all topics ≥ 0.60

What it tests: Basic IRT-aligned difficulty selection and learning progression.

Task 2 — Multi-Subject Balancing (Medium · 25 steps · Pass ≥ 0.50)

Objective: Raise global mastery AND minimize inter-subject inequality.

Grader:

score = 0.60 × overall_mastery
      + 0.40 × (1 − Gini(subject_means))

What it tests: The agent's ability to balance a curriculum — agents that over-invest in one subject score poorly on balance.

Task 3 — Long-Horizon Retention (Hard · 40 steps · Pass ≥ 0.45)

Objective: Maximize mastery AND prevent topics from decaying below 60% of their peak.

Grader:

retention_rate = count(topics where final ≥ 0.60 × peak) / 25

score = 0.40 × overall_mastery
      + 0.20 × min_subject_mastery
      + 0.40 × retention_rate

What it tests: Proactive spaced-repetition scheduling — agents must revisit topics before Ebbinghaus decay erodes them. This is the hardest task and the most educationally meaningful.

API Reference

REST Endpoints

Method	Endpoint	Description
`GET`	`/`	Environment metadata
`GET`	`/health`	Health check (for HF Spaces)
`GET`	`/tasks`	List all task descriptors
`POST`	`/reset`	Start new episode
`POST`	`/step`	Execute teaching action
`GET`	`/state`	Full state snapshot
`POST`	`/grade`	Grade current episode
`GET`	`/docs`	Swagger/OpenAPI UI

Quick Start with curl

# Start a new episode (Task 1)
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "single_subject_mastery", "seed": 42}'

# Take a teaching action
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{
    "action": {
      "subject": "mathematics",
      "topic": "algebra",
      "activity_type": 1,
      "difficulty": 0,
      "strategy": 0
    }
  }'

# Inspect full state
curl http://localhost:7860/state

# Grade the episode
curl -X POST http://localhost:7860/grade

Setup & Usage

Local Development

git clone https://github.com/your-username/adaptive-tutor-env
cd adaptive-tutor-env
pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860

Docker

docker build -t adaptive-tutor-env .
docker run -p 7860:7860 adaptive-tutor-env

Run Tests

pytest tests/ -v

Run Validation

python validate.py

Run Baseline Inference

export HF_TOKEN="hf_..."              # REQUIRED
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export ENV_BASE_URL="http://localhost:7860"

python inference.py

Expected output format:

[START] task=single_subject_mastery env=adaptive-tutor-env model=gpt-4o-mini
[STEP] step=1 action=mathematics.algebra:1:0:0 reward=0.12 done=false error=null
...
[END] success=true steps=20 score=0.72 rewards=0.12,0.08,...

Architecture

adaptive-tutor-env/
├── server/
│   ├── __init__.py
│   ├── app.py                # FastAPI application & OpenEnv REST endpoints
│   ├── models.py             # Pydantic models (Action, Observation, Reward, etc.)
│   ├── environment.py        # Core RL environment: reset() / step() / state()
│   ├── student_model.py      # IRT 2PL + Ebbinghaus forgetting + engagement dynamics
│   ├── reward.py             # Dense multi-objective composite reward function
│   └── tasks.py              # 3 graded tasks with deterministic programmatic graders
├── tests/
│   ├── __init__.py
│   ├── test_env.py           # Environment & model unit tests
│   └── test_server.py        # REST API integration tests
├── inference.py              # Baseline LLM inference script (OpenAI-compatible)
├── validate.py               # Pre-submission validation (9 checks)
├── openenv.yaml              # OpenEnv specification metadata
├── Dockerfile                # Container for Hugging Face Spaces deployment
├── requirements.txt          # Python dependencies
└── LICENSE                   # MIT License

Key Design Decisions

Decision	Rationale
IRT 2PL over simpler models	Produces realistic, calibrated learning curves; difficulty-ability alignment directly maps to Zone of Proximal Development
Ebbinghaus over forgetting suppression	Forces genuine long-horizon scheduling; agents must actively prevent decay
Gini coefficient for balance	Continuous, well-understood inequality measure in [0,1]; directly incentivizes proportional curriculum
Dense rewards over sparse	Long-horizon RL is intractable with final-only rewards; dense signals accelerate policy learning
5D composite action space	Captures the full richness of teaching decisions without exploding into combinatorial space

Design Philosophy

Realism first. Every modeling choice — IRT, forgetting curves, engagement dynamics — reflects genuine educational psychology research. We did not build a toy; we built a simulation of a real tutoring interaction.

Grounded in evaluation science. The grading rubrics mirror how real educational assessments are constructed: weighted combinations of achievement, equity, and retention. A perfect score requires genuine mastery, not score-hacking.

OpenEnv-native. Full spec compliance, typed Pydantic models, standard reset/step/state API, and REST deployment — ready for agentic evaluation against other environments in the benchmark.

Reproducibility by design. Every stochastic element (initial masteries, student responses) is controlled by an explicit seed. Any researcher can reproduce exact trajectories.

License

MIT License — see LICENSE.

Citation

AdaptiveTutor-Env: A Real-World OpenEnv RL Environment for Personalized AI Tutoring.
Meta PyTorch OpenEnv Hackathon × SST, 2026.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
server		server
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
inference.py		inference.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock
validate.py		validate.py

Folders and files

Latest commit

History

Repository files navigation

AdaptiveTutor-Env 🎓

Table of Contents

Problem Statement

Abstract

Why This Matters

Environment Overview

The Student Model

IRT 2PL — Ability & Difficulty

Ebbinghaus Forgetting Curve

Engagement & Fatigue Dynamics

Action Space

Strategy Semantics

Observation Space

Reward Function

Tasks & Grading

Task 1 — Single Subject Mastery (Easy · 20 steps · Pass ≥ 0.60)

Task 2 — Multi-Subject Balancing (Medium · 25 steps · Pass ≥ 0.50)

Task 3 — Long-Horizon Retention (Hard · 40 steps · Pass ≥ 0.45)

API Reference

REST Endpoints

Quick Start with curl

Setup & Usage

Local Development

Docker

Run Tests

Run Validation

Run Baseline Inference

Architecture

Key Design Decisions

Design Philosophy

License

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages