The experiments for the paper Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al., 2024).
-
Implementation of different safety guardrails for a multi-armed bandit agent:
- Cheating guardrail (using ground truth)
- Posterior predictive guardrail
- IID guardrail (Prop 3.4)
- Non-IID guardrail (Prop 4.6)
- New Non-IID guardrail with configurable harm estimation aggregation methods
-
Configurable exploding bandit environment
-
Plotting utilities
-
Experimentation framework with:
- Hyperparameter sweeps using Weights & Biases
- Custom metrics for reward/safety trade-offs
conda env create -f environment.yaml- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Unix/macOS
.\venv\Scripts\activate # On Windows- Install PyTorch:
# For CUDA 11.8
pip install torch==2.3.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# For CPU only
pip install torch==2.3.1 torchvision torchaudio
# For MacOS with MPS (Apple Silicon)
pip install torch==2.2.2 torchvision torchaudio- Install the remaining requirements:
pip install -r requirements.txtThe code supports CPU, CUDA, and Apple Silicon (MPS) devices. The device can be configured in configs/config.yaml or via arguments:
device: "cpu"- CPUdevice: "cuda"- NVIDIA GPUdevice: "mps"- Apple Silicon GPUdevice: "auto"- Automatically select best available device
Run non-IID experiments with default parameters:
# Generates the results in Figure 1
python run_non_iid_experiment.pyCreate plots from the results:
python make_non_iid_plots.pyTo also include the experimental new non-IID guardrail in the comparison (not included in the main paper's figures):
python run_non_iid_experiment.py --include_new_non_iidCreate plots from the results:
# Generates the results in Figure 2
python run_tightness_experiment.pyCreate plots from the results:
python make_tightness_plots.pyFor the new non-IID guardrails (implemented in the NewNonIidGuardrail class in agents/guardrails.py):
python run_new_non_iid.pyThis script uses parameters defined in configs/config.yaml.
Key Parameters in configs/config.yaml:
guardrail_new_non_iid.mean_type: Specifies the type of mean used for aggregating harm estimates (options: "arithmetic", "geometric", "harmonic").guardrail_new_non_iid.posterior_increases: A boolean determining whether the aggregation (for the harm estimate) is over posterior probabilities increases. What we mean by "increases" is the pointwise absolute value of the difference between the posterior probabilities and the prior (i.e. from the previous step) probabilities.guardrail_new_non_iid.softmax_temperature: The temperature parameter for the softmax function, controlling the distribution's sharpness.guardrail_new_non_iid.power_mean_exponent: The exponent applied in the power mean calculation.guardrail_new_non_iid.quantile: The quantile used in quantile-based harm estimation (e.g., 0.8 for the 80th percentile).guardrail_new_non_iid.harm_estimates_weights: Weights assigned to different harm estimation methods (max, mean, quantile).
An example of how these parameters can be varied for hyperparameter tuning is shown in configs/sweep.yaml (see the hyperparameter sweep section below).
Harm Estimation Methods:
The guardrail estimates harm using a combination of the following methods:
- Maximum Harm Estimate: Uses the highest harm probability from the posterior.
- Quantile Harm Estimate: Uses the specified quantile (e.g.,
0.8) of the harm probabilities. - Mean Harm Estimate: Uses arithmetic, geometric, or harmonic means based on the
mean_typeparameter, with a given power mean exponent (power_mean_exponent, defaulting to1.0) and/or softmax temperature (softmax_temperature, defaulting to1.0).
The weights for these methods are defined in harm_estimates_weights. These weights can be based on the posteriors of the theories or on the posterior increases (depending on posterior_increases), favoring theories with rapidly increasing posteriors.
The overall harm estimate is a weighted combination of these individual estimates. The goal is to find a harm bound that mitigates risky behaviors (high reward/high death rate) without resulting in overly conservative actions (low reward/low death rate).
Custom Metric for Evaluation:
A custom metric can be used to roughly approximate the performance of different NewNonIidGuardrail configurations:
custom_metric = reward_mean / exp(2 * death_rate)The point of this metric is to (roughly) prioritize higher rewards at the expense of increased death rates (trying to strike a balanced reward/safety trade-off). It is normalized against the score of a "cheating" guardrail, which should, by definition, have the highest possible metric value. This metric is useful for guiding hyperparameter search and evaluating whether various NewNonIidGuardrail configurations achieve a good balance between safety and performance.
Hyperparameter sweeps use Weights & Biases (W&B) to explore a range of guardrail configurations. You can view an example sweep instance here.
Launch a W&B sweep to explore different NewNonIidGuardrail configurations:
wandb sweep configs/sweep.yaml
wandb agent <sweep_id>Key parameters can be configured in configs/config.yaml:
- Experiment settings (episodes, thresholds, etc.)
experiment.n_episodes: Number of episodes to runexperiment.episode_length: Length of each episodeexperiment.n_actors: Number of parallel actors to use (if null, calculated automatically)experiment.max_actor_fraction: Fraction of CPU cores to use as maximum actors (default: 0.85)
- Environment parameters (dimensions, reward variance, etc.)
- Guardrail parameters (mean types, weights, etc.)
- Plotting parameters (sizes, fonts, styles, etc.)
The parallel execution can be configured in two ways:
- Set
experiment.n_actorsto a specific number to use exactly that many parallel actors - Set
experiment.max_actor_fractionto control what fraction of CPU cores to use (e.g., 0.5 for 50% of cores) - Leave both as default to automatically use 85% of available CPU cores
Run the test suite:
pytest.
├── agents/ # Agent and guardrail implementations
├── configs/ # Configuration files
├── envs/ # Environment implementations
├── utils/ # Helper functions and plotting
├── run_*.py # Main experiment scripts
├── make_*_plots.py # Plotting scripts
└── test_*.py # Test files
@article{bengio2024bayesian,
title={Can a Bayesian Oracle Prevent Harm from an Agent?},
author={Bengio, Yoshua and Cohen, Michael K and Malkin, Nikolay and MacDermott, Matthew and Fornasiere, Damiano and Greiner, Pietro and Kaddar, Younesse},
journal={arXiv preprint arXiv:2408.05284},
year={2024}
}