A production-grade OpenEnv environment that simulates real-world email triage — the daily task of processing, prioritizing, and responding to a mixed work inbox. Built for the OpenEnv Hackathon with 3 difficulty-graded tasks, continuous partial rewards, and dynamic mid-episode events.
Email triage is a task professionals perform daily: scanning an inbox, deciding what to archive, what needs a reply, coordinating calendar availability, and handling urgent escalations. This makes it an ideal testbed for evaluating agent decision-making, prioritization, and multi-step planning under changing conditions.
The environment defines 3 benchmark tasks with increasing difficulty:
| Task ID | Name | Emails | Max Steps | Dynamic Events | Description |
|---|---|---|---|---|---|
easy |
Quick Sort | 3 | 6 | ❌ | Archive 3 spam/newsletter emails. Tests basic categorization. |
medium |
Priority Triage | 5 | 10 | ❌ | Triage 5 mixed-priority emails with calendar scheduling. Tests reading, drafting, and archiving decisions. |
hard |
Dynamic Crisis | 7–10 | 12 | ✅ | Handle a full inbox with mid-episode urgent emails and calendar changes. Tests adaptation and escalation handling. |
In addition, the root submission manifest defines 3 deterministic validator tasks used for task/grader compliance checks:
| Task ID | Module | Grader |
|---|---|---|
email_classification |
tasks.email_classification:solve |
graders.email_classification_grader:grade |
priority_detection |
tasks.priority_detection:solve |
graders.priority_detection_grader:grade |
response_generation |
tasks.response_generation:solve |
graders.response_generation_grader:grade |
All grader scores are normalized to the range 0.0-1.0.
The agent sends an EmailtriageAction with these fields:
| Field | Type | Description |
|---|---|---|
action_type |
"read" | "archive" | "query_calendar" | "draft_email" |
The tool/action to execute |
target_email_id |
int |
Email ID to act on (-1 for query_calendar) |
draft_content |
str |
Reply text for draft_email actions |
proposed_slot |
str |
Calendar slot for scheduling drafts |
After each step the agent receives an EmailtriageObservation:
| Field | Type | Description |
|---|---|---|
inbox_preview |
List[Dict] |
Metadata for up to 5 unread emails (id, sender, subject, priority, status) |
returned_emails |
List[str] |
Full email text from read actions |
calendar_slots |
List[str] |
Available calendar slots |
last_action_result |
str |
Grader feedback for the most recent action |
inbox_remaining |
int |
Count of unread emails |
conversation_history |
List[str] |
Recent action/feedback trace |
reward |
float |
Step reward in [0, 1] |
done |
bool |
Whether the episode has ended |
Rewards are continuous and partially informative (not binary pass/fail):
- Archive spam/newsletters: 0.62–0.80 per correct archive
- Read emails: 0.09–0.25 depending on priority (higher for critical emails)
- Query calendar: 0.10–0.46 based on pending scheduling workload
- Draft replies: Multi-factor scoring based on:
- Task appropriateness (is this email worth drafting?)
- Draft quality (length, professionalism, keyword relevance)
- Calendar awareness (did you check availability first?)
- Valid proposed slot
- Urgency handling for escalations
- Progress bonus: +0.12 for each email successfully processed
- Completion bonus: +0.10 when all inbox items are triaged
- Penalties: Archiving important emails scores 0.03–0.08 (not zero)
- Python 3.10+
- Docker (for containerized deployment)
openenv-coreanduvinstalled
# Root-level (for inference script)
pip install -r requirements.txt
# Environment (using uv)
cd EmailTriage
uv sync# Start the environment server
cd EmailTriage
uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload# Set required environment variables
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="your-hf-token"
export LOCAL_IMAGE_NAME="emailtriage-env:latest"
# Run all 3 tasks
python inference.pydocker build -t emailtriage-env:latest .# Environment validation (inner package)
cd EmailTriage
openenv validate
# Return to root and run deterministic task/grader sanity check
cd ..
python -c "from tasks import list_tasks; print([t['id'] for t in list_tasks()])"cd EmailTriage
openenv push --repo-id OMCHOKSI108/EmailopenenvrlGalcogens-OpenEnv/
├── inference.py # Hackathon inference script (runs 3 tasks)
├── openenv.yaml # Root submission manifest (entrypoint/endpoints/tasks/graders)
├── Dockerfile # Root container definition
├── requirements.txt # Inference-only dependencies
├── tasks/ # Deterministic validator task definitions
├── graders/ # Deterministic validator graders (score in [0.0, 1.0])
├── README.md # This file
└── EmailTriage/
├── __init__.py # Package exports
├── client.py # EnvClient implementation
├── models.py # Pydantic Action/Observation/State models
├── openenv.yaml # Inner OpenEnv manifest
├── pyproject.toml # Package configuration
├── README.md # HF Space README
└── server/
├── app.py # FastAPI server
├── EmailTriage_environment.py # Core environment + 3 task graders
└── Dockerfile # Server container definition
- Real-world task simulation (email triage)
- Full OpenEnv spec: typed models, step()/reset()/state(), openenv.yaml
- 3 benchmark tasks (easy → medium → hard) with continuous grading
- 3 deterministic submission validator tasks with matching graders (scores 0.0-1.0)
- Meaningful reward function with partial progress signals
- Baseline inference script with reproducible scores
- Dockerfile builds
- README with environment description, action/observation spaces, setup instructions
Use this command to reproduce baseline scores:
python inference.pyEnvironment used for reproducible runs:
API_BASE_URL=https://router.huggingface.co/v1MODEL_NAME=Qwen/Qwen2.5-72B-InstructHF_TOKEN=<your-token>ENV_BASE_URL=http://localhost:8000(or your deployed Space URL)
Recorded scores from a successful local containerized run (emailtriage-env:local-check):
| Task | Score | Notes |
|---|---|---|
| easy | 0.72 | 6 steps |
| medium | 0.60 | 10 steps |
| hard | 0.62 | 12 steps |
Aggregate baseline (mean across tasks): 0.65
The inference logger prints scores in [0.00, 1.00] and emits strict [START], [STEP], and [END] stdout lines for evaluator parsing.
| Variable | Required | Default | Description |
|---|---|---|---|
API_BASE_URL |
No | https://router.huggingface.co/v1 |
LLM API endpoint |
MODEL_NAME |
No | Qwen/Qwen2.5-72B-Instruct |
Model identifier |
HF_TOKEN |
Yes | — | Hugging Face API key |
LOCAL_IMAGE_NAME |
No | emailtriage-env:latest |
Docker image name |