| title | Adaptive Tutor Env |
|---|---|
| emoji | 🌖 |
| colorFrom | blue |
| colorTo | pink |
| sdk | docker |
| pinned | false |
| license | mit |
| short_description | Personalized AI tutoring RL Environment |
An OpenEnv-compliant reinforcement learning environment for personalized AI tutoring — simulating real-world EdTech dynamics with psychometric student modeling and multi-objective pedagogical optimization.
- Problem Statement
- Abstract
- Why This Matters
- Environment Overview
- The Student Model
- Action Space
- Observation Space
- Reward Function
- Tasks & Grading
- API Reference
- Setup & Usage
- Baseline Performance
- Architecture
- Design Philosophy
Personalized tutoring is one of the most effective pedagogical interventions known — studies show one-on-one tutoring yields two standard deviations of improvement over classroom instruction (Bloom, 1984). Yet human tutors scale poorly: globally, 500 million students lack access to qualified tutors, and the global EdTech market — projected to exceed $400 billion by 2027 — is powered largely by static, one-size-fits-none content sequencing.
The core challenge is sequential instructional decision-making: a tutor must, at each interaction, decide what to teach, how to teach it, and when to revisit it — all while tracking a student's evolving knowledge state, engagement, and fatigue. This is a genuinely hard sequential decision problem with real-world stakes.
Existing RL environments in education are typically:
- Toy problems with no real student modeling
- Static curricula that don't adapt to individual learners
- Game-based (e.g., MathWorld, PEARL) with no transfer to real pedagogy
There is no open-source, OpenEnv-compliant RL environment that models real educational dynamics at this fidelity.
AdaptiveTutor-Env fills this gap. It is a full-featured, OpenEnv-compliant reinforcement learning environment for personalized AI tutoring, submitted to the Meta PyTorch OpenEnv Hackathon × SST.
The environment simulates a personalized AI tutor interacting with a single student across 5 academic subjects and 25 topics over a bounded episode. Unlike toy environments, AdaptiveTutor-Env embeds:
- Item Response Theory (IRT) 2PL — the gold standard of educational measurement — to govern learning transitions and model student ability as a continuous latent variable.
- Ebbinghaus exponential forgetting — real knowledge decays between visits — forcing agents to schedule proactive spaced repetition.
- Dynamic engagement and fatigue — student state evolves with teaching choices, creating genuine multi-objective optimization.
- A composite dense reward function rewarding learning gains, engagement maintenance, subject balance, and retention simultaneously.
Three graded tasks span easy → medium → hard difficulty, with deterministic programmatic graders scoring in [0.0, 1.0]. A baseline inference script provides reproducible LLM-driven and heuristic agent scores.
| Dimension | Traditional LMS | AdaptiveTutor-Env |
|---|---|---|
| Student model | Rule-based tags | Probabilistic IRT |
| Knowledge decay | Ignored | Ebbinghaus exponential |
| Engagement | Static | Dynamic with fatigue |
| Reward signal | Quiz score only | Multi-objective composite |
| RL training | Not supported | First-class OpenEnv |
| Reproducibility | Non-deterministic | Seed-controlled |
An RL agent trained on AdaptiveTutor-Env learns genuine pedagogical skills: difficulty calibration, spaced repetition scheduling, engagement management, and curriculum balancing — transferable to real educational AI products.
┌─────────────────────────────────────────────────────────────────┐
│ AdaptiveTutor-Env │
│ │
│ ┌──────────────┐ ┌─────────────┐ ┌───────────────────┐ │
│ │ RL Agent │───▶│ Environment │───▶│ Student Model │ │
│ │ (policy π) │◀───│ step/reset │◀───│ IRT + Forgetting │ │
│ └──────────────┘ └─────────────┘ └───────────────────┘ │
│ │ │
│ ┌─────────▼──────────┐ │
│ │ Reward Computer │ │
│ │ + Task Grader │ │
│ └────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Episode flow:
reset(task_id, seed)→ initial student state (low masteries, high engagement)- Agent selects a 5D teaching action per step
- Environment applies IRT learning + Ebbinghaus decay + engagement dynamics
- Dense reward emitted; step counter incremented
- Episode ends at
max_stepsor early termination (mastery ≥ 0.92) grade()computes final task score
The Item Response Theory (2-Parameter Logistic) model governs every learning interaction. Student ability (θ) is a continuous latent variable linked to mastery via the logit transform:
P(correct | θ, b) = σ(a × (θ − b))
Where:
a = 1.7 (discrimination — fixed)
b = {-1.0, 0.0, +1.5} (difficulty per level: easy, medium, hard)
θ = logit(mastery) (ability from mastery ∈ [0,1])
σ = sigmoid
When the agent's chosen difficulty b matches the student's ability θ, learning efficiency is maximized (Zone of Proximal Development). Misaligned difficulty yields reduced or negative learning gains.
Unvisited topics decay exponentially between teaching steps:
R(t) = exp(−t / (τ × strength))
decayed_mastery = max(mastery × R(t), 0.05 × original_mastery)
Where:
t = steps elapsed since last visit
τ = 100.0 (base time constant)
strength = 0.7 + 0.5 × mastery (stronger memories decay slower)
0.05 = procedural memory floor (never fully lost)
This creates a genuine trade-off between exploration and retention — the agent must revisit topics before they decay too far, but can't revisit everything simultaneously within a bounded episode.
| Activity | Learning Rate | Fatigue Δ | Notes |
|---|---|---|---|
| Video lesson | 0.21 | +0.02 | Low intensity |
| Practice exercise | 0.42 | +0.05 | Highest gain, most fatigue |
| Quiz | 0.33 | +0.04 | Tests and reinforces |
| Revision | 0.27 | +0.03 | Low fatigue, efficient |
Engagement recovers passively (+0.04/step on success) but decays with a base rate of −0.03/step × 0.3 (~−0.009/step effective) plus a fatigue drag proportional to current fatigue. A disengaged or fatigued student learns less effectively — the reward signal reflects this.
Each action is a 5-dimensional discrete composite:
Action = {
subject: str (5 choices) — which subject to teach
topic: str (5 per subject) — which topic within the subject
activity_type: int {0, 1, 2, 3} — video, practice, quiz, revision
difficulty: int {0, 1, 2} — easy (b=−1.0), medium (b=0), hard (b=+1.5)
strategy: int {0, 1, 2} — introduce, reinforce, spaced_repetition
}
Total action space size: 5 × 5 × 4 × 3 × 3 = 900 distinct actions
# Example action
{
"subject": "mathematics",
"topic": "algebra",
"activity_type": 1, # practice_exercise
"difficulty": 0, # easy (b = -1.0)
"strategy": 0 # introduce_new_concept
}| Strategy | When to Use | Effect |
|---|---|---|
0 — Introduce |
mastery < 0.30 | New concept delivery, low fatigue |
1 — Reinforce |
mastery 0.30–0.70 | Practice with feedback, mastery building |
2 — Spaced Repetition |
topic not visited in 6+ steps | Counteract Ebbinghaus decay |
The agent receives a rich state observation at every step:
Observation {
masteries: dict[subject][topic] → [0.0, 1.0]
engagement: float ∈ [0.0, 1.0]
fatigue: float ∈ [0.0, 1.0]
step: int (current step, 0-indexed)
subject_masteries: dict[subject] → mean mastery
overall_mastery: float (global mean)
time_since_last_visit: dict[subject][topic] → int (steps)
last_action: dict | null (previous action details)
task_id: str (active task identifier)
max_steps: int (episode budget)
}
State size: 25 topic masteries + 5 subject means + 25 time-since-visit + scalar fields = 57 observable variables
The reward is a dense, multi-objective composite emitted at every step:
R_total = 0.50 × learning_gain
+ 0.20 × engagement_bonus
+ 0.20 × balance_reward
− 0.15 × retention_penalty
− 0.3 × (step / max_steps) ← progressive step penalty
| Component | Formula | Range |
|---|---|---|
learning_gain |
clip(Δmastery / 0.15, −1, 1) |
[−1, 1] |
engagement_bonus |
clip(2 × engagement − 1, −0.5, 0.5) |
[−0.5, 0.5] |
balance_reward |
1 − Gini(subject_means) |
[0, 1] |
retention_penalty |
mean(Δmastery / masteries_before) on unvisited topics |
[0, 1] |
step_penalty |
−0.3 × (step / max_steps) |
[0, −0.3] |
Total reward clipped to [−1.0, 1.0]
The progressive step penalty increases urgency as the episode nears its end, preventing agents from wasting steps on low-value activities.
Three tasks with escalating difficulty and distinct optimization objectives:
Objective: Maximize mean mastery across all 5 Mathematics topics.
Grader:
score = mean(maths_masteries)
+ 0.05 if all topics ≥ 0.60
What it tests: Basic IRT-aligned difficulty selection and learning progression.
Objective: Raise global mastery AND minimize inter-subject inequality.
Grader:
score = 0.60 × overall_mastery
+ 0.40 × (1 − Gini(subject_means))
What it tests: The agent's ability to balance a curriculum — agents that over-invest in one subject score poorly on balance.
Objective: Maximize mastery AND prevent topics from decaying below 60% of their peak.
Grader:
retention_rate = count(topics where final ≥ 0.60 × peak) / 25
score = 0.40 × overall_mastery
+ 0.20 × min_subject_mastery
+ 0.40 × retention_rate
What it tests: Proactive spaced-repetition scheduling — agents must revisit topics before Ebbinghaus decay erodes them. This is the hardest task and the most educationally meaningful.
| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
Environment metadata |
GET |
/health |
Health check (for HF Spaces) |
GET |
/tasks |
List all task descriptors |
POST |
/reset |
Start new episode |
POST |
/step |
Execute teaching action |
GET |
/state |
Full state snapshot |
POST |
/grade |
Grade current episode |
GET |
/docs |
Swagger/OpenAPI UI |
# Start a new episode (Task 1)
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "single_subject_mastery", "seed": 42}'
# Take a teaching action
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{
"action": {
"subject": "mathematics",
"topic": "algebra",
"activity_type": 1,
"difficulty": 0,
"strategy": 0
}
}'
# Inspect full state
curl http://localhost:7860/state
# Grade the episode
curl -X POST http://localhost:7860/gradegit clone https://github.com/your-username/adaptive-tutor-env
cd adaptive-tutor-env
pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860docker build -t adaptive-tutor-env .
docker run -p 7860:7860 adaptive-tutor-envpytest tests/ -vpython validate.pyexport HF_TOKEN="hf_..." # REQUIRED
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export ENV_BASE_URL="http://localhost:7860"
python inference.pyExpected output format:
[START] task=single_subject_mastery env=adaptive-tutor-env model=gpt-4o-mini
[STEP] step=1 action=mathematics.algebra:1:0:0 reward=0.12 done=false error=null
...
[END] success=true steps=20 score=0.72 rewards=0.12,0.08,...
adaptive-tutor-env/
├── server/
│ ├── __init__.py
│ ├── app.py # FastAPI application & OpenEnv REST endpoints
│ ├── models.py # Pydantic models (Action, Observation, Reward, etc.)
│ ├── environment.py # Core RL environment: reset() / step() / state()
│ ├── student_model.py # IRT 2PL + Ebbinghaus forgetting + engagement dynamics
│ ├── reward.py # Dense multi-objective composite reward function
│ └── tasks.py # 3 graded tasks with deterministic programmatic graders
├── tests/
│ ├── __init__.py
│ ├── test_env.py # Environment & model unit tests
│ └── test_server.py # REST API integration tests
├── inference.py # Baseline LLM inference script (OpenAI-compatible)
├── validate.py # Pre-submission validation (9 checks)
├── openenv.yaml # OpenEnv specification metadata
├── Dockerfile # Container for Hugging Face Spaces deployment
├── requirements.txt # Python dependencies
└── LICENSE # MIT License
| Decision | Rationale |
|---|---|
| IRT 2PL over simpler models | Produces realistic, calibrated learning curves; difficulty-ability alignment directly maps to Zone of Proximal Development |
| Ebbinghaus over forgetting suppression | Forces genuine long-horizon scheduling; agents must actively prevent decay |
| Gini coefficient for balance | Continuous, well-understood inequality measure in [0,1]; directly incentivizes proportional curriculum |
| Dense rewards over sparse | Long-horizon RL is intractable with final-only rewards; dense signals accelerate policy learning |
| 5D composite action space | Captures the full richness of teaching decisions without exploding into combinatorial space |
Realism first. Every modeling choice — IRT, forgetting curves, engagement dynamics — reflects genuine educational psychology research. We did not build a toy; we built a simulation of a real tutoring interaction.
Grounded in evaluation science. The grading rubrics mirror how real educational assessments are constructed: weighted combinations of achievement, equity, and retention. A perfect score requires genuine mastery, not score-hacking.
OpenEnv-native. Full spec compliance, typed Pydantic models, standard reset/step/state API, and REST deployment — ready for agentic evaluation against other environments in the benchmark.
Reproducibility by design. Every stochastic element (initial masteries, student responses) is controlled by an explicit seed. Any researcher can reproduce exact trajectories.
MIT License — see LICENSE.
AdaptiveTutor-Env: A Real-World OpenEnv RL Environment for Personalized AI Tutoring.
Meta PyTorch OpenEnv Hackathon × SST, 2026.