Skip to content

PaulBrytonRaj18/Adaptive-Tutor-Env

Repository files navigation

title Adaptive Tutor Env
emoji 🌖
colorFrom blue
colorTo pink
sdk docker
pinned false
license mit
short_description Personalized AI tutoring RL Environment

AdaptiveTutor-Env 🎓

An OpenEnv-compliant reinforcement learning environment for personalized AI tutoring — simulating real-world EdTech dynamics with psychometric student modeling and multi-objective pedagogical optimization.

OpenEnv PyTorch Python 3.11 License: MIT Hugging Face


Table of Contents

  1. Problem Statement
  2. Abstract
  3. Why This Matters
  4. Environment Overview
  5. The Student Model
  6. Action Space
  7. Observation Space
  8. Reward Function
  9. Tasks & Grading
  10. API Reference
  11. Setup & Usage
  12. Baseline Performance
  13. Architecture
  14. Design Philosophy

Problem Statement

Personalized tutoring is one of the most effective pedagogical interventions known — studies show one-on-one tutoring yields two standard deviations of improvement over classroom instruction (Bloom, 1984). Yet human tutors scale poorly: globally, 500 million students lack access to qualified tutors, and the global EdTech market — projected to exceed $400 billion by 2027 — is powered largely by static, one-size-fits-none content sequencing.

The core challenge is sequential instructional decision-making: a tutor must, at each interaction, decide what to teach, how to teach it, and when to revisit it — all while tracking a student's evolving knowledge state, engagement, and fatigue. This is a genuinely hard sequential decision problem with real-world stakes.

Existing RL environments in education are typically:

  • Toy problems with no real student modeling
  • Static curricula that don't adapt to individual learners
  • Game-based (e.g., MathWorld, PEARL) with no transfer to real pedagogy

There is no open-source, OpenEnv-compliant RL environment that models real educational dynamics at this fidelity.


Abstract

AdaptiveTutor-Env fills this gap. It is a full-featured, OpenEnv-compliant reinforcement learning environment for personalized AI tutoring, submitted to the Meta PyTorch OpenEnv Hackathon × SST.

The environment simulates a personalized AI tutor interacting with a single student across 5 academic subjects and 25 topics over a bounded episode. Unlike toy environments, AdaptiveTutor-Env embeds:

  1. Item Response Theory (IRT) 2PL — the gold standard of educational measurement — to govern learning transitions and model student ability as a continuous latent variable.
  2. Ebbinghaus exponential forgetting — real knowledge decays between visits — forcing agents to schedule proactive spaced repetition.
  3. Dynamic engagement and fatigue — student state evolves with teaching choices, creating genuine multi-objective optimization.
  4. A composite dense reward function rewarding learning gains, engagement maintenance, subject balance, and retention simultaneously.

Three graded tasks span easy → medium → hard difficulty, with deterministic programmatic graders scoring in [0.0, 1.0]. A baseline inference script provides reproducible LLM-driven and heuristic agent scores.


Why This Matters

Dimension Traditional LMS AdaptiveTutor-Env
Student model Rule-based tags Probabilistic IRT
Knowledge decay Ignored Ebbinghaus exponential
Engagement Static Dynamic with fatigue
Reward signal Quiz score only Multi-objective composite
RL training Not supported First-class OpenEnv
Reproducibility Non-deterministic Seed-controlled

An RL agent trained on AdaptiveTutor-Env learns genuine pedagogical skills: difficulty calibration, spaced repetition scheduling, engagement management, and curriculum balancing — transferable to real educational AI products.


Environment Overview

┌─────────────────────────────────────────────────────────────────┐
│                     AdaptiveTutor-Env                           │
│                                                                 │
│  ┌──────────────┐    ┌─────────────┐    ┌───────────────────┐  │
│  │   RL Agent   │───▶│  Environment │───▶│  Student Model    │  │
│  │  (policy π)  │◀───│  step/reset  │◀───│  IRT + Forgetting │  │
│  └──────────────┘    └─────────────┘    └───────────────────┘  │
│                              │                                  │
│                    ┌─────────▼──────────┐                       │
│                    │  Reward Computer   │                       │
│                    │  + Task Grader     │                       │
│                    └────────────────────┘                       │
└─────────────────────────────────────────────────────────────────┘

Episode flow:

  1. reset(task_id, seed) → initial student state (low masteries, high engagement)
  2. Agent selects a 5D teaching action per step
  3. Environment applies IRT learning + Ebbinghaus decay + engagement dynamics
  4. Dense reward emitted; step counter incremented
  5. Episode ends at max_steps or early termination (mastery ≥ 0.92)
  6. grade() computes final task score

The Student Model

IRT 2PL — Ability & Difficulty

The Item Response Theory (2-Parameter Logistic) model governs every learning interaction. Student ability (θ) is a continuous latent variable linked to mastery via the logit transform:

P(correct | θ, b) = σ(a × (θ − b))

Where:
  a  = 1.7   (discrimination — fixed)
  b  = {-1.0, 0.0, +1.5}  (difficulty per level: easy, medium, hard)
  θ  = logit(mastery)     (ability from mastery ∈ [0,1])
  σ  = sigmoid

When the agent's chosen difficulty b matches the student's ability θ, learning efficiency is maximized (Zone of Proximal Development). Misaligned difficulty yields reduced or negative learning gains.

Ebbinghaus Forgetting Curve

Unvisited topics decay exponentially between teaching steps:

R(t) = exp(−t / (τ × strength))

decayed_mastery = max(mastery × R(t),  0.05 × original_mastery)

Where:
  t       = steps elapsed since last visit
  τ       = 100.0 (base time constant)
  strength = 0.7 + 0.5 × mastery  (stronger memories decay slower)
  0.05    = procedural memory floor (never fully lost)

This creates a genuine trade-off between exploration and retention — the agent must revisit topics before they decay too far, but can't revisit everything simultaneously within a bounded episode.

Engagement & Fatigue Dynamics

Activity Learning Rate Fatigue Δ Notes
Video lesson 0.21 +0.02 Low intensity
Practice exercise 0.42 +0.05 Highest gain, most fatigue
Quiz 0.33 +0.04 Tests and reinforces
Revision 0.27 +0.03 Low fatigue, efficient

Engagement recovers passively (+0.04/step on success) but decays with a base rate of −0.03/step × 0.3 (~−0.009/step effective) plus a fatigue drag proportional to current fatigue. A disengaged or fatigued student learns less effectively — the reward signal reflects this.


Action Space

Each action is a 5-dimensional discrete composite:

Action = {
  subject:       str   (5 choices)     — which subject to teach
  topic:         str   (5 per subject) — which topic within the subject
  activity_type: int   {0, 1, 2, 3}   — video, practice, quiz, revision
  difficulty:    int   {0, 1, 2}       — easy (b=−1.0), medium (b=0), hard (b=+1.5)
  strategy:      int   {0, 1, 2}       — introduce, reinforce, spaced_repetition
}

Total action space size: 5 × 5 × 4 × 3 × 3 = 900 distinct actions

# Example action
{
    "subject": "mathematics",
    "topic": "algebra",
    "activity_type": 1,   # practice_exercise
    "difficulty": 0,      # easy (b = -1.0)
    "strategy": 0         # introduce_new_concept
}

Strategy Semantics

Strategy When to Use Effect
0 — Introduce mastery < 0.30 New concept delivery, low fatigue
1 — Reinforce mastery 0.30–0.70 Practice with feedback, mastery building
2 — Spaced Repetition topic not visited in 6+ steps Counteract Ebbinghaus decay

Observation Space

The agent receives a rich state observation at every step:

Observation {
  masteries:            dict[subject][topic] → [0.0, 1.0]
  engagement:           float ∈ [0.0, 1.0]
  fatigue:              float ∈ [0.0, 1.0]
  step:                 int    (current step, 0-indexed)
  subject_masteries:    dict[subject] → mean mastery
  overall_mastery:      float (global mean)
  time_since_last_visit: dict[subject][topic] → int (steps)
  last_action:          dict | null (previous action details)
  task_id:              str   (active task identifier)
  max_steps:            int   (episode budget)
}

State size: 25 topic masteries + 5 subject means + 25 time-since-visit + scalar fields = 57 observable variables


Reward Function

The reward is a dense, multi-objective composite emitted at every step:

R_total =  0.50 × learning_gain
         + 0.20 × engagement_bonus
         + 0.20 × balance_reward
         − 0.15 × retention_penalty
         − 0.3  × (step / max_steps)   ← progressive step penalty
Component Formula Range
learning_gain clip(Δmastery / 0.15, −1, 1) [−1, 1]
engagement_bonus clip(2 × engagement − 1, −0.5, 0.5) [−0.5, 0.5]
balance_reward 1 − Gini(subject_means) [0, 1]
retention_penalty mean(Δmastery / masteries_before) on unvisited topics [0, 1]
step_penalty −0.3 × (step / max_steps) [0, −0.3]

Total reward clipped to [−1.0, 1.0]

The progressive step penalty increases urgency as the episode nears its end, preventing agents from wasting steps on low-value activities.


Tasks & Grading

Three tasks with escalating difficulty and distinct optimization objectives:

Task 1 — Single Subject Mastery (Easy · 20 steps · Pass ≥ 0.60)

Objective: Maximize mean mastery across all 5 Mathematics topics.

Grader:

score = mean(maths_masteries)
      + 0.05  if all topics ≥ 0.60

What it tests: Basic IRT-aligned difficulty selection and learning progression.


Task 2 — Multi-Subject Balancing (Medium · 25 steps · Pass ≥ 0.50)

Objective: Raise global mastery AND minimize inter-subject inequality.

Grader:

score = 0.60 × overall_mastery
      + 0.40 × (1 − Gini(subject_means))

What it tests: The agent's ability to balance a curriculum — agents that over-invest in one subject score poorly on balance.


Task 3 — Long-Horizon Retention (Hard · 40 steps · Pass ≥ 0.45)

Objective: Maximize mastery AND prevent topics from decaying below 60% of their peak.

Grader:

retention_rate = count(topics where final ≥ 0.60 × peak) / 25

score = 0.40 × overall_mastery
      + 0.20 × min_subject_mastery
      + 0.40 × retention_rate

What it tests: Proactive spaced-repetition scheduling — agents must revisit topics before Ebbinghaus decay erodes them. This is the hardest task and the most educationally meaningful.


API Reference

REST Endpoints

Method Endpoint Description
GET / Environment metadata
GET /health Health check (for HF Spaces)
GET /tasks List all task descriptors
POST /reset Start new episode
POST /step Execute teaching action
GET /state Full state snapshot
POST /grade Grade current episode
GET /docs Swagger/OpenAPI UI

Quick Start with curl

# Start a new episode (Task 1)
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "single_subject_mastery", "seed": 42}'

# Take a teaching action
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{
    "action": {
      "subject": "mathematics",
      "topic": "algebra",
      "activity_type": 1,
      "difficulty": 0,
      "strategy": 0
    }
  }'

# Inspect full state
curl http://localhost:7860/state

# Grade the episode
curl -X POST http://localhost:7860/grade

Setup & Usage

Local Development

git clone https://github.com/your-username/adaptive-tutor-env
cd adaptive-tutor-env
pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860

Docker

docker build -t adaptive-tutor-env .
docker run -p 7860:7860 adaptive-tutor-env

Run Tests

pytest tests/ -v

Run Validation

python validate.py

Run Baseline Inference

export HF_TOKEN="hf_..."              # REQUIRED
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export ENV_BASE_URL="http://localhost:7860"

python inference.py

Expected output format:

[START] task=single_subject_mastery env=adaptive-tutor-env model=gpt-4o-mini
[STEP] step=1 action=mathematics.algebra:1:0:0 reward=0.12 done=false error=null
...
[END] success=true steps=20 score=0.72 rewards=0.12,0.08,...

Architecture

adaptive-tutor-env/
├── server/
│   ├── __init__.py
│   ├── app.py                # FastAPI application & OpenEnv REST endpoints
│   ├── models.py             # Pydantic models (Action, Observation, Reward, etc.)
│   ├── environment.py        # Core RL environment: reset() / step() / state()
│   ├── student_model.py      # IRT 2PL + Ebbinghaus forgetting + engagement dynamics
│   ├── reward.py             # Dense multi-objective composite reward function
│   └── tasks.py              # 3 graded tasks with deterministic programmatic graders
├── tests/
│   ├── __init__.py
│   ├── test_env.py           # Environment & model unit tests
│   └── test_server.py        # REST API integration tests
├── inference.py              # Baseline LLM inference script (OpenAI-compatible)
├── validate.py               # Pre-submission validation (9 checks)
├── openenv.yaml              # OpenEnv specification metadata
├── Dockerfile                # Container for Hugging Face Spaces deployment
├── requirements.txt          # Python dependencies
└── LICENSE                   # MIT License

Key Design Decisions

Decision Rationale
IRT 2PL over simpler models Produces realistic, calibrated learning curves; difficulty-ability alignment directly maps to Zone of Proximal Development
Ebbinghaus over forgetting suppression Forces genuine long-horizon scheduling; agents must actively prevent decay
Gini coefficient for balance Continuous, well-understood inequality measure in [0,1]; directly incentivizes proportional curriculum
Dense rewards over sparse Long-horizon RL is intractable with final-only rewards; dense signals accelerate policy learning
5D composite action space Captures the full richness of teaching decisions without exploding into combinatorial space

Design Philosophy

Realism first. Every modeling choice — IRT, forgetting curves, engagement dynamics — reflects genuine educational psychology research. We did not build a toy; we built a simulation of a real tutoring interaction.

Grounded in evaluation science. The grading rubrics mirror how real educational assessments are constructed: weighted combinations of achievement, equity, and retention. A perfect score requires genuine mastery, not score-hacking.

OpenEnv-native. Full spec compliance, typed Pydantic models, standard reset/step/state API, and REST deployment — ready for agentic evaluation against other environments in the benchmark.

Reproducibility by design. Every stochastic element (initial masteries, student responses) is controlled by an explicit seed. Any researcher can reproduce exact trajectories.


License

MIT License — see LICENSE.


Citation

AdaptiveTutor-Env: A Real-World OpenEnv RL Environment for Personalized AI Tutoring.
Meta PyTorch OpenEnv Hackathon × SST, 2026.

About

An OpenEnv-compliant reinforcement learning environment for personalized AI tutoring — simulating real-world EdTech dynamics with psychometric student modeling and multi-objective pedagogical optimization.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors