ASTRA is a comprehensive red-teaming system that consists of three major components working together to perform autonomous vulnerability discovery and assessment of AI software assistants.
ASTRA operates through a three-stage pipeline where each component feeds into the next to create a comprehensive red-teaming workflow:
- Domain Modeling generates structured knowledge graphs from target domains, outputting hierarchical representations of vulnerabilities and attack vectors
- Prompt Generation consumes these knowledge graphs to synthesize diverse jailbreaking prompts through multi-agent collaboration, producing contextually rich attack scenarios
- Online Exploration takes the generated prompts as input and performs real-time adaptive probing against target AI systems, dynamically adjusting strategies based on system responses
Each stage builds upon the previous component's output, enabling systematic and comprehensive security evaluation of AI assistants.
📁 Location: enumerator/
This component takes a target domain as input and outputs a structured knowledge graph that captures domain-specific vulnerabilities and attack vectors.
- Data Structure: The knowledge graph structure is defined in
enumerator/tree_utils.py, which provides the foundational tree-based representation for organizing domain knowledge hierarchically. - LLM-based Enumerator: The core enumeration logic is implemented in
enumerator/enumerator.py, which uses large language models to systematically generate comprehensive domain knowledge graphs.
The repository includes pre-built knowledge graphs for two domains: Secure Code Generation and Security Event Guidance. Examples include enumerator/enumerate_pl_feature.py which enumerates programming language features, enumerator/enumerate_context.py for coding contexts, and enumerator/enumerate_mal_tactics.py for tactics as defined in MITRE ATT&CK.
- Knowledge graph enumeration is a one-time setup process per domain
- Pre-built knowledge graphs are provided in the
kg/directory - Users typically don't need to re-run the enumerator unless extending to new domains
- Interested developers can follow the existing examples to extend ASTRA to additional domains
📁 Location: agent/
This component leverages the structured knowledge graphs to systematically generate diverse and sophisticated jailbreaking prompts through multi-agent collaboration.
In the previous stage, the input domain are decomposed into several orthonogonal dimensions, each represented as a hierarchical tree structure. The agent starts by sampling one leaf node from each dimension to form a multi-dimensional attack scenario. It then compose a concrete attack prompt based on the sampled scenario through multi-agent collaboration.
At a high level, the prompt generation process consists of the following steps:
- A composer agent generates a draft prompt based on the sampled scenario.
- The draft prompt is sent to a textual reviewer agent that ensure the generated prompt is benign, clear, and realistic.
- A set of blue-team systems are used to evaluate the generated prompts. There are three blue-team systems:
- A set of coder models that generate code snippets based on the generated prompts. A successful prompt should pass the built-in intention check of the coder model.
- (For secure code generation) Amazon CodeGuru static analyzer, which provides feedback on whether a generated code snippet is vulnerable or not. A successful prompt should induce a vulnerable code snippet.
- (For security event guidance) A helpfulness checker that evaluates whether a target coder model completes the task as expected. A successful prompt should induce the target coder model to generate helpful responses for malicious purposes.
- The sampling algorithm will consider the past generation process and identify the promising attributes that lead to successful prompts. It will then adjust the sampling strategy to target similar attributes in future generations, making the prompt generation process self-evolving.
- Main Entry Points:
main_sec_code.py: Orchestrates prompt generation for secure code scenariosmain_sec_event.py: Handles security event-based prompt generation
- Multi-Agent Architecture:
sec_code_composer/: Contains agents specialized in composing code-related attack promptssec_event_composer/: Houses agents for security event scenario generationcgr_agent/: Uses Amazon CodeGuru static analyzer to provide feedback on whether a generated code snippet is vulnerable or not
The LLMs used for generating prompts and for local blue-teams are specified in the resources/coder-config.yaml file. Specifically, we use qwen3-coder as the composer and helpfulness checker, and phi4m as the coder model for generating code snippets.
The user needs to specify their own instances of those models in the resources/coder-config.yaml file. After that, use the following commands to generate attack prompts:
python3 agent/main_sec_code.py --fout <output_file-agent-code.jsonl> --log <path to log_file>
python3 agent/main_sec_event.py --fout <output_file-agent-sec.jsonl> --log <path to log_file>Then use the following command to export the generated prompts to a format that can be used by the online exploration component:
## Use the following commands to export the prompts
python3 agent/export_syn_prompt.py --fin <path to the output_file-agent-code (or -sec).jsonl> --fout <output_file-exported.jsonl> 📁 Location: online/
This component performs real-time adaptive red-teaming by dynamically probing target AI systems and adjusting attack strategies based on responses. It takes as input a large pool of generated prompts and composes multi-round interactions with the target system to identify its unique vulnerabilities. The adaptive exploration capabilities of the online system are two-fold:
- Spatial Exploration: It samples prompts based on the past behavior of the target system, prioritizing promising attributes that are more likely to induce vulnerabilities.
- Temporal Exploration: It reasons about the target system's responses over multiple turns, identifying weak links in its reasoning traces and dynamically adjusting prompts to exploit discovered vulnerabilities.
The online system leverages model-based judges to evaluate the target system's responses. For secure code generation, it uses a judge model that mimic the behavior of the static analyzer, which evaluates whether a generated code snippet is vulnerable or not. For security event guidance, it uses a judge model that evaluates whether the target system's response is indeed helpful for the intended malicious purposes.
- Main Runtime:
main.pyorchestrates the online exploration sessions - Runtime Engine:
rt/directory contains the core adaptive exploration logic
- Initiates exploration sessions with configurable parameters
- Performs iterative probing with
N_PROBINGinitial attempts - Conducts multi-turn conversations (
N_TURN) to explore temporal vulnerabilities - Logs detailed interaction traces for analysis
- Dynamically adjusts strategies based on success/failure patterns
First, we need to host the judge models of ASTRA for evaluating whether a generated code snippet is vulnerable or not. Hosting the model using the following command:
vllm serve microsoft/Phi-4-mini-instruct --dtype auto --api-key <YOUR API KEY> --swap_space 32 --max-model-len 8192 --enable-lora --lora-modules PurCL/astra-judge-121k=PurCL/astra-judge-121k PurCL/astra-judge-10k=PurCL/astra-judge-10kIt could be hosted on an A6000 GPU with 48GB memory.
Specify the hosted judge model at resources/online-judge.yaml.
The online temporal exploration component requires a rewriter model to reason about the target system's responses over multiple turns. It is also used as the judge model for evaluating whether the target system's response is helpful for malicious purposes.
It could be a black-box LLM or any model hosted with vLLM.
Specify the hosted model at online/rt/temporal_explorator/config/default_config.yaml.
By default, we use Qwen/Qwen3-Coder-30B-A3B-Instruct.
To configure a blue team model, specify its settings in resources/client-config.yaml and the model will be automatically registered in the _client_registry within online/bt/client.py.
🟦 For open-source models hosted via vLLM:
- Set the client type to
LocalOpenAIBTClient(automatically handled) - Configure
model_name,addr, andapi_keyin the config file - Example models:
microsoft/Phi-4-mini-instruct,Qwen/Qwen2.5-Coder-7B-Instruct,mistralai/Mistral-Instruct-8B
☁️ For models hosted on Amazon Bedrock:
- The system automatically uses the appropriate client class based on the model name
- OpenAI-compatible models (e.g.,
gpt-oss-20b,gpt-oss-120b): UseBedrockOpenAIBTClientwith parameters likemax_completion_tokens,temperature,reasoning_effort - Anthropic models (e.g.,
claude-3.5-haiku): UseBedrockAnthropicBTClientwith parameters likemax_tokens,temperature,anthropic_version,top_k - Claude Sonnet models: Use
BedrockAnthropicBTClientwith the full ARN as model name
The configuration system automatically maps model names to the correct client implementation, so you only need to specify the appropriate parameters in resources/client-config.yaml.
If you want to use a Bedrock model that is not currently provided, you need to:
- Create a client class that inherits from
BedrockBTClient, and implement_construct_bodyand_parse_responsebecause request/response formats can differ across providers/models. - Register the client in the
_client_registry(or callBTClientFactory.register_client). - Add the model’s configuration in
resources/client-config.yaml.
Minimal example:
# online/bt/client.py
class MyBedrockCustomClient(BedrockBTClient):
def __init__(self, **kwargs):
super().__init__(**kwargs)
# Read any custom params from kwargs if needed, e.g.:
# self.max_tokens = kwargs.get("max_tokens", 1024)
# self.temperature = kwargs.get("temperature", 0.8)
def _construct_body(self, messages):
# Convert ASTRA's message format to the model's expected format
converted = []
for msg in messages:
converted.append({
"role": "user" if msg["role"] == "attacker" else "assistant",
"content": msg["content"],
})
return {
"messages": converted,
# include any model-specific fields here
# "max_tokens": self.max_tokens,
# "temperature": self.temperature,
}
def _parse_response(self, response):
# Extract final text from Bedrock provider's response JSON
# return response["choices"][0]["message"]["content"] # example shape
raise NotImplementedError
# Register (either edit the registry dict or call the helper)
BTClientFactory.register_client(
"provider.my-custom-model:1:0", # the identifier you will use in config
MyBedrockCustomClient,
)Example configuration:
# resources/client-config.yaml
my-custom-model:
model_name: provider.my-custom-model:1:0
region: us-west-2
# model-specific params, passed to the client via **kwargs
max_tokens: 1024
temperature: 0.8
read_timeout: 240python3 online/main.py --model_name <name of the blue team model> --log <path to the output log file> --n_session <number of chat sessions> --n_probing <number of initial probing sessions before the chat sessions> --n_turn <maximum number of turns per session>