Introduction

This is the official implementation of the paper "You Know What I'm Saying - Jailbreak Attack via Implicit Reference".

Result

Attack Success Rate with 4 Objectives (as presented in the paper)

Model	LLaMA-3-8b	LLaMA-3-70B	Qwen-2-7B	Qwen-2-72B	GPT-4o-mini	GPT-4o	Claude-3.5-Sonnet
Attack Success Rate (%)	77	84	80	81	87	95	93

Attack Success Rate with 6 Objectives

Model	Claude-3.5-Sonnet	LLaMA-3-8B
Attack Success Rate (%)	96	83

Getting started

Prerequisites

Anaconda/Miniconda
Python 3.10

Installation Steps

Create a Conda Environment

Open your terminal and create a new Conda environment with Python 3.10:
```
conda create -n your_environment_name python=3.10
conda activate your_environment_name
```
Install Required Packages

Install all necessary packages using pip:
```
pip install -r requirements.txt
```
Configure Environment Variables

Copy the .env_template file to .env:
```
cp .env_template .env
```
Edit the .env file to set the required environment variables:
```
nano .env
```
- GPT: Rewrite Model
- TARGET: Target Model
- CONTEXT: Context Model (can be ignored if not doing cross-model attack)
- LLAMA3_JUDGE: LLaMA-3-70b Judge (can be ignored if using GPT as judge)
Execute the Auto Attack Script

After running the model script, execute the auto attack script for your target model:
```
bash scripts/run_attack_{target_model}.sh
```
Replace {target_model} with the appropriate target model name.

Hyperparameters

The project uses the following command-line arguments to control various aspects of the attack:

--n_requests: Number of requests. (default: 100)
--n_restarts: Number of restarts. (default: 20)
--attack_method: Attack type. Choices are "direct", "k2", "k3", "k4", "k5", "k6". (default: "k4")
--target_model: Name of target model.
--target_base_url: Base URL of target model.
--context_model: Name of context model.
--judge: Judge type. Choices are "gpt" or "llama". (default: "gpt")

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
defense_artifacts		defense_artifacts
jailbreak_artifacts		jailbreak_artifacts
scripts		scripts
.env_template		.env_template
.gitignore		.gitignore
README.md		README.md
harmful_behaviors_jailbreakbench.csv		harmful_behaviors_jailbreakbench.csv
judges.py		judges.py
main.py		main.py
models.py		models.py
reform_prompt.py		reform_prompt.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Result

Attack Success Rate with 4 Objectives (as presented in the paper)

Attack Success Rate with 6 Objectives

Getting started

Prerequisites

Installation Steps

Hyperparameters

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Lucas-TY/llm_Implicit_reference

Folders and files

Latest commit

History

Repository files navigation

Introduction

Result

Attack Success Rate with 4 Objectives (as presented in the paper)

Attack Success Rate with 6 Objectives

Getting started

Prerequisites

Installation Steps

Hyperparameters

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages