This is the official implementation of the paper "You Know What I'm Saying - Jailbreak Attack via Implicit Reference".
| Model | LLaMA-3-8b | LLaMA-3-70B | Qwen-2-7B | Qwen-2-72B | GPT-4o-mini | GPT-4o | Claude-3.5-Sonnet |
|---|---|---|---|---|---|---|---|
| Attack Success Rate (%) | 77 | 84 | 80 | 81 | 87 | 95 | 93 |
| Model | Claude-3.5-Sonnet | LLaMA-3-8B |
|---|---|---|
| Attack Success Rate (%) | 96 | 83 |
- Anaconda/Miniconda
- Python 3.10
-
Create a Conda Environment
Open your terminal and create a new Conda environment with Python 3.10:
conda create -n your_environment_name python=3.10 conda activate your_environment_name
-
Install Required Packages
Install all necessary packages using
pip:pip install -r requirements.txt
-
Configure Environment Variables
Copy the
.env_templatefile to.env:cp .env_template .env
Edit the
.envfile to set the required environment variables:nano .env
GPT: Rewrite ModelTARGET: Target ModelCONTEXT: Context Model (can be ignored if not doing cross-model attack)LLAMA3_JUDGE: LLaMA-3-70b Judge (can be ignored if using GPT as judge)
-
Execute the Auto Attack Script
After running the model script, execute the auto attack script for your target model:
bash scripts/run_attack_{target_model}.shReplace
{target_model}with the appropriate target model name.
The project uses the following command-line arguments to control various aspects of the attack:
--n_requests: Number of requests. (default: 100)--n_restarts: Number of restarts. (default: 20)--attack_method: Attack type. Choices are "direct", "k2", "k3", "k4", "k5", "k6". (default: "k4")--target_model: Name of target model.--target_base_url: Base URL of target model.--context_model: Name of context model.--judge: Judge type. Choices are "gpt" or "llama". (default: "gpt")