Skip to content

Commit cb433b6

Browse files
authored
Merge pull request #11 from scaleapi/lhaw/swebench-pro-instructions
Update swebench-pro instructions
1 parent 82324c3 commit cb433b6

6 files changed

Lines changed: 37 additions & 32 deletions

File tree

.gitignore

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,9 +31,6 @@ initial-data/
3131
# Generated underspecified variants
3232
task_pairs_agentcompany/underspecified/
3333

34-
# SWEBench repo + evaluation + user simulator code
35-
swebenchpro/SWE-bench_Pro-os
36-
3734
# MCP Atlas repo (clone separately, see experiments/mcpatlas/README.md)
3835
experiments/mcpatlas/mcp-atlas/
3936

@@ -56,4 +53,4 @@ experiments/mcpatlas/reports/
5653

5754
hf_variants/
5855
__pycache__
59-
.env
56+
.env

.gitmodules

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
[submodule "research/lhaw/swebenchpro/SWE-bench_Pro-os"]
1+
[submodule "swebenchpro/SWE-bench_Pro-os"]
22
path = swebenchpro/SWE-bench_Pro-os
3-
url = https://github.com/scaleapi/SWE-bench_Pro-os.git
3+
url = https://github.com/scaleapi/SWE-bench_Pro-os.git

experiments/swebench/README.md

Lines changed: 13 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -4,36 +4,28 @@ End-to-end pipeline for generating, running, evaluating, and classifying undersp
44

55
## Setup
66

7-
```bash
8-
cd lhaw
9-
10-
# Python 3.11+ environment (conda or venv)
11-
conda create -n lhaw311 python=3.11 -y && conda activate lhaw311
12-
# OR: python3.11 -m venv .venv311 && source .venv311/bin/activate
7+
Follow the environment setup in the [root README](../../README.md#setup), then activate and install SWE-bench-specific dependencies:
138

14-
python -m pip install -r requirements.txt
9+
```bash
10+
# In LHAW root
11+
source .venv/bin/activate
1512

1613
# SWE-bench Pro + SWE-agent (submodules)
14+
git submodule sync
1715
git submodule update --init swebenchpro/SWE-bench_Pro-os
1816
cd swebenchpro/SWE-bench_Pro-os && git submodule update --init SWE-agent
1917

2018
# Switch SWE-agent to ask_user fork branch
2119
cd SWE-agent
22-
git remote add fork https://github.com/yash-scaleai/SWE-agent.git
23-
git fetch fork yash/ask-user-host-interception
24-
git checkout -b yash/ask-user-host-interception fork/yash/ask-user-host-interception
20+
git fetch origin
21+
git checkout -b lhaw/ask-user-tool origin/lhaw/ask-user-tool
2522
cd ../../..
2623

27-
# Install SWE-agent (requires Python >=3.11)
28-
# Use python -m pip to ensure the conda env's pip is used
29-
python -m pip install -e swebenchpro/SWE-bench_Pro-os/SWE-agent
24+
# Install SWE-agent
25+
uv pip install -e swebenchpro/SWE-bench_Pro-os/SWE-agent
3026

3127
# Modal auth (for container deployment)
3228
modal token new
33-
34-
# Environment variables (add to .env — see .env.example)
35-
export LLM_API_KEY="your-api-key"
36-
# export LLM_BASE_URL="https://your-litellm-proxy-url" # optional, for LiteLLM proxy
3729
```
3830

3931
Source `.env` before every session:
@@ -49,7 +41,7 @@ Run baseline SWE-agent on original tasks and export `.traj` files for grounded s
4941
Mirrors TAC step 1 (`tac.sh` + `export_tac_golden_trajectories.py`).
5042
Results are written to `baseline_N/` directories (not `exp_N/`) so they coexist with Stage 3's underspec trials in the same directory — no copying needed.
5143

52-
See `run_swebench_example.sh` step 1 for the full commands.
44+
See `bash run_swebench_example.sh` step 1 for the full commands.
5345

5446
**Produces:**
5547
- `baseline_1/`, `baseline_2/`, `baseline_3/` — baseline trial results (preds.json, trajectories)
@@ -95,6 +87,7 @@ python task_completion_swebench.py --run \
9587
### Stage 4: Evaluate predictions
9688

9789
Runs SWE-bench Pro Docker evaluation on all patches. Handles both variant (`exp_N/`) and baseline (`baseline_N/`) predictions.
90+
The source dataset file `swe_bench_pro_full.csv` is downloaded automatically on first evaluation if it is missing.
9891

9992
```bash
10093
# Evaluate only (no classification)
@@ -109,6 +102,8 @@ python scripts/process_swebench_underspec.py \
109102
--run-eval --dockerhub-username jefzda --judge
110103
```
111104

105+
If `--eval-only` fails for any trial or baseline, the command now exits non-zero so downstream summary steps do not continue with missing eval outputs.
106+
112107
**Produces:** `exp_N/eval_results/` and `baseline_N/eval_results/` directories with per-instance `*_output.json` files.
113108

114109
### Stage 5: Classify variants

run_swebench_example.sh

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,7 @@
55
#
66
# Mirrors run_tac_example.sh for the SWE-Bench Pro pipeline.
77
#
8-
# Prerequisites:
9-
# - conda activate lhaw311 (Python 3.11+)
10-
# - python -m pip install -e swebenchpro/SWE-bench_Pro-os/SWE-agent
11-
# - modal token new (Modal auth for container orchestration)
12-
# - LLM_API_KEY and LLM_BASE_URL set (or OPENAI_API_KEY fallback)
13-
# - source .env
8+
# Prerequisites: Refer to the setup instructions in experiments/swebench/README.md
149
#
1510
# Task selection:
1611
# BASELINE_MODELS controls which models run baselines. The paper required

scripts/process_swebench_underspec.py

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -377,6 +377,7 @@ def run_evaluation(
377377
redo: bool = False,
378378
) -> bool:
379379
"""Run SWE-bench Pro evaluation on a trial's predictions."""
380+
ensure_swebench_csv()
380381
preds_path = exp_dir / f"exp_{trial_num}" / "preds.json"
381382
eval_output_dir = exp_dir / f"exp_{trial_num}" / "eval_results"
382383

@@ -456,6 +457,7 @@ def run_baseline_evaluation(
456457
Baselines use original instance IDs (no variant suffix stripping needed).
457458
Output goes to baseline_N/eval_results/ with prefix 'baselineN'.
458459
"""
460+
ensure_swebench_csv()
459461
baseline_dir = exp_dir / f"baseline_{baseline_num}"
460462
preds_path = baseline_dir / "preds.json"
461463
eval_output_dir = baseline_dir / "eval_results"
@@ -1065,6 +1067,7 @@ def main():
10651067
if not args.dockerhub_username:
10661068
print("Error: --dockerhub-username required with --run-eval")
10671069
sys.exit(1)
1070+
ensure_swebench_csv()
10681071
# Detect num trials from exp_N dirs (only match exp_1, exp_2, etc.)
10691072
num_trials = len(
10701073
[
@@ -1080,15 +1083,29 @@ def main():
10801083
print(f"Error: No exp_* or baseline_* directories found in {exp_dir}")
10811084
sys.exit(1)
10821085

1086+
eval_failures = []
10831087
if num_trials > 0:
10841088
print(f"Running variant evaluation ({num_trials} trials)")
10851089
for trial_num in range(1, num_trials + 1):
1086-
run_evaluation(exp_dir, trial_num, args.dockerhub_username, args.num_workers)
1090+
if not run_evaluation(
1091+
exp_dir, trial_num, args.dockerhub_username, args.num_workers
1092+
):
1093+
eval_failures.append(f"exp_{trial_num}")
10871094

10881095
if baseline_nums:
10891096
print(f"Running baseline evaluation ({len(baseline_nums)} baselines)")
10901097
for bnum in baseline_nums:
1091-
run_baseline_evaluation(exp_dir, bnum, args.dockerhub_username, args.num_workers)
1098+
if not run_baseline_evaluation(
1099+
exp_dir, bnum, args.dockerhub_username, args.num_workers
1100+
):
1101+
eval_failures.append(f"baseline_{bnum}")
1102+
1103+
if eval_failures:
1104+
print(
1105+
f"\nEvaluation failed for: {', '.join(eval_failures)}",
1106+
file=sys.stderr,
1107+
)
1108+
sys.exit(1)
10921109

10931110
print("\nEvaluation complete.")
10941111
sys.exit(0)

swebenchpro/SWE-bench_Pro-os

Submodule SWE-bench_Pro-os added at 66a831e

0 commit comments

Comments
 (0)