Skip to content

Genentech/AssayBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AssayBench

A benchmark for evaluating machine learning models on phenotypic screen prediction.

1. Installation

Install directly from the repository:

pip install git+ssh://git@github.com/Genentech/AssayBench.git

or clone the repo and install in editable mode:

git clone git@github.com:Genentech/AssayBench.git && cd AssayBench
pip install -e .

Both install the assaybench package, which provides:

  • AssayBenchDataset — loads screens and splits from HuggingFace (Genentech/assaybench)
  • RankingMetrics — computes ranking metrics (adjusted nDCG, precision, FDR, etc.)

With uv

You can add it to your project with

dependencies = [
    "assaybench @ git+ssh://git@github.com/Genentech/AssayBench.git",
]

2. Usage

Loading data and scoring a model

Each example in the dataset contains a question prompt describing a CRISPR screen, along with ground-truth relevance_genes and relevance_scores. To evaluate a model, pass its predicted gene ranking (a plain list[str]) together with the ground-truth genes and scores to RankingMetrics.evaluate():

from assaybench import AssayBenchDataset
from assaybench.benchmark.metrics import RankingMetrics

# Load the dataset with year-based splits
ds = AssayBenchDataset(
    dataset_name="biogrid",
    split_type="year",
    fold=0,
    novel_dataset_name="LaTest",
)
train, val, test, latest = ds.get_train_test_split()

# Define your model — any function that returns a ranked list of gene names
def my_model(prompt: str) -> list[str]:
    return ["BRCA1", "TP53", "MYC", ...]  # top predicted genes

# Score predictions
metrics = RankingMetrics(k_values=[10, 100])

for example in val:
    predicted_genes = my_model(example["question"])
    scores = metrics.evaluate(
        predicted_genes=predicted_genes,
        ground_truth_genes=example["relevance_genes"],
        relevance_scores=example["relevance_scores"],
    )
    print(f"Screen {example['dataset_name']}: AnDCG@100 = {scores['adjusted_ndcg@100']:.4f}")

See examples/load_data.ipynb for a complete walkthrough.

Dataset fields

Each screen returned by get_train_test_split() is a dictionary with the following fields:

Field Type Description
question str The prompt describing the screen and ranking task
relevance_genes list[str] All genes in the screen library
relevance_scores list[float] Thresholded percentile scores for each gene (higher = more relevant)
hit list[bool] Whether each gene is a hit in the screen
dataset_name str Screen identifier
screen_ids list[int] BioGRID screen ID(s) (>1 for merged duplicate screens)
phenotype str Full phenotype description
cleaned_phenotype str Coarse phenotype category (e.g. "Fitness / Proliferation / Viability")
condition_clause str Experimental condition (e.g. drug treatment, dose)
cell_type str Cell type used in the screen
cell_line str Cell line name
screen_type str Selection type (e.g. "Positive Selection", "Negative Selection")
library_methodology str Screen methodology (e.g. "Knockout", "Activation")
screen_rationale str Scientific rationale for the screen
screen_category str Screen directionality (e.g. "unidirectional", "bidirectional")
num_genes int Number of genes in the screen library
author str Publication author and year (e.g. "Wang T (2014)")
source_id str PubMed ID of the source publication
split str Data split assignment: train, validation, test, or novel_dataset
answer str Top 10 genes by relevance score (comma-separated, for reference)

Metrics

RankingMetrics.evaluate() returns a dictionary of scores. The primary metrics (computed at each k in k_values) are:

Metric Description
ndcg@k Normalized Discounted Cumulative Gain — measures ranking quality using graded relevance scores
adjusted_ndcg@k nDCG adjusted for chance performance — the main benchmark metric (AnDCG)
precision@k Fraction of top-k predictions that are hits
normalized_precision@k Precision normalized by the number of true positives (NPrecision)
fdr@k Fraction of top-k predictions that are non-hits (False Discovery Rate)
normalized_fdr@k FDR normalized by the number of true negatives
recall@k Fraction of true hits recovered in the top-k predictions
auroc Area Under the ROC Curve over the full ranked list
mrr Mean Reciprocal Rank — reciprocal of the rank of the first hit
hallucination_rate Fraction of predicted genes not found in the screen library
hit_scaled_ndcg@k nDCG computed using binary hit labels instead of graded relevance
hit_scaled_adjusted_ndcg@k Adjusted nDCG using binary hit labels

By default all metric groups are computed. Pass metric_groups={"adjusted_ndcg", "precision"} to restrict to a subset.

Custom prompts

By default, AssayBenchDataset formats each screen's question field using a built-in prompt template (see src/assaybench/data/prompts/objective_prompts.yaml). You can override it by passing a prompt_template string to the constructor:

my_template = """
You are a genetics expert. Given the following CRISPR screen:
- Cell line: {cell_line} ({cell_type})
- Library: {library_type} ({library_methodology})
- Phenotype: {phenotype}

Rank the top 100 genes most likely to be hits.
Format: GENE1, GENE2, ..., GENE100
"""

ds = AssayBenchDataset(
    dataset_name="biogrid",
    split_type="year",
    fold=0,
    prompt_template=my_template,
)

The template is formatted with Python's str.format() using each screen's metadata fields. Available placeholders:

Placeholder Description
{cell_line} Cell line name
{cell_type} Cell type description
{library_type} Library type (e.g. "CRISPRn")
{library_methodology} Methodology (e.g. "Knockout", "Activation")
{experimental_setup} Experimental design (e.g. "Drug Exposure")
{duration} Screen duration (e.g. "12 Days")
{condition_clause} Condition details (e.g. " under Etoposide treatment (130.0 nM)")
{phenotype} Phenotype description
{significance_criteria} Statistical threshold for hit calling
{ranking_rationale} What makes a gene rank highly
{notes} Additional screen notes

3. Paper reproduction

All figure scripts live in figures/ and read from a results cache built from the prediction files in benchmarking/predictions/.

Step 1: Build the results cache

cd figures
python generate_results_cache.py

This scores all prediction files against the ground truth and saves the results to figures/journal_figures_cache/results_cache.pkl.

To rescore only specific models (faster):

python generate_results_cache.py --model "gemini-3-pro" --model "gpt-5.4"

Step 2: Generate figures and tables

python plot0_proportions.py
python plot1_selected_methods.py
python plot2_phenotype_bar_plot_year.py
python plot3_duplicate_transfer_vs_model.py
python plot4_memorization_analysis.py
python plot5_scaling_laws.py
python plot6_bias.py

Outputs (PNG, PDF, LaTeX tables) are saved to figures/journal_figures/.

Script Description
plot0_proportions.py Dataset statistics table and phenotype composition pie charts
plot1_selected_methods.py Main benchmark bar plot + LaTeX tables for selected methods
plot2_phenotype_bar_plot_year.py Per-phenotype performance bar plot (year split)
plot3_duplicate_transfer_vs_model.py Duplicate-screen cross-transfer vs model performance
plot4_memorization_analysis.py Regression of performance on publication year, phenotype, and citations
plot5_scaling_laws.py Qwen3.5 scaling laws (AnDCG@100 vs model size)
plot6_bias.py Gene-level prediction bias analysis across models

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages