Official repository for the paper "CodeCrash: Exposing LLM Fragility to Misleading Natural Language in Code Reasoning"
🏠 Home Page • 💻 Data • 🏆 Leaderboard
CodeCrash provides a unified stress-testing benchmark for evaluating the robustness of Large Language Models (LLMs) in code reasoning through code execution tasks. CodeCrash targets deeper comprehension by applying logic-preserving structural changes and misleading textual cues to real code. We systematically perturb two established benchmarks — CRUXEval and LiveCodeBench — with controlled distractions, and evaluate 17 LLMs across input prediction and code execution tasks. CodeCrash reveals key failure modes in modern LLMs and large reasoning models (LRMs), including overreliance on natural language cues in LLMs and reasoning collapse in QwQ-32B.
git clone https://cuhk-arise.github.io/CodeCrash/
cd CodeCrash
conda create -n codecrash python=3.10
conda activate codecrash
pip install -r requirements.txtIn CodeCrash, we prepared three kinds of perturbations
| Tag | Full Name | Type |
|---|---|---|
REN |
Renaming Entities | Structural |
RTF |
Reformatting Conditional Expressions | Structural |
GBC |
Inserting Garbage Code Segments | Structural |
PSC_ALL |
Aggregated Structural Perturbation | Structural |
MCC |
Misleading Code Comments | Contextual-Level |
MPS |
Misleading Print Statements | Contextual-Level |
MHC |
Misleading Hint Comments | Reasoning-Level |
Tip
See the 🎭 Perturbations Introduction section for example usage for each perturbation.
# Apply a perturbation to a pre-defined dataset
python perturb.py \
--dataset [crux|lcb] \
--perturbation [REN|RTF|GBC|PSC_ALL|MCC|MPS] \
--output-name "<output_name>"
# Apply MHC perturbation using GPT-4o to a customized dataset
python perturb.py \
--dataset-path ".../crux.jsonl" \
--perturbation MHC \
--model "<model_name>" \
--platform [openai|anthropic|gemini|azure|deepinfra|deepseek|qwen|sglang] \
--task [input|output] \
--output-name "<output_name>" \
--max-workers 5Tip
See the 🚀 Perturb a Dataset section for more details.
-
All perturbed datasets are saved in the
customize_datasetsdirectory. -
📁 Folder Structure:
customize_datasets/ └── {output_name}.jsonl/
# Run perturbation experiments wth evaluation
python process.py \
--dataset [crux|lcb] \
--perturbation [VAN|REN|RTF|GBC|PSC_ALL|MCC|MPS|MHC] \
--model "<model_name>" \
--platform [openai|anthropic|gemini|azure|deepinfra|deepseek|qwen|sglang] \
--task [input|output] \
--infer-mode [direct|cot] \
--num-samples 2 \
--max-workers 10 \
--load-existing \
--evaluate
# Evaluate a saved output file independently
python eval.py \
--filepath "<filepath>" \
--task [input|output] \
--max_workers 10Tip
See the 🧪 Generate Outputs section for more details.
See the 📊 Evaluate a File section for more details.
-
All results are saved in the
resultsdirectory. -
Generated outputs are stored as
{dataset}_{task}_{perturbation}_{infer_mode}.jsonlunless--output-nameis specified. -
Evaluation results are saved as
{dataset}_{task}_{perturbation}_{infer_mode}_eval.jsonor{output_name}_eval.json. -
📁 Folder Structure:
results/ └── {model_folder_name}/ └── {output_name}.jsonl └── {output_name}_eval.json
All experiments were conducted through API access (including OpenAI, Anthropic, Gemini, AzureChat, DeepInfra, DeepSeek, and Qwen), as well as via SGLang, which allows you to deploy and host your locally trained or Hugging Face LLMs.
To use these APIs, you must create an account and configure your API keys in an .env file.
OPENAI_API_KEY="<your_openai_api_key>"
ANTHROPIC_API_KEY="<your_anthropic_api_key>"
GEMINI_API_KEY="<your_gemini_api_key>"
AZURE_OPENAI_API_KEY="<your_azure_openai_api_key>"
AZURE_ENDPOINT="<your_azure_endpoint>"
AZURE_VERSION="<your_azure_version>"
DEEPINFRA_API_KEY="<your_deepinfra_api_key>"
DEEPSEEK_API_KEY="<your_deepseek_api_key>"
QWEN_API_KEY="<your_qwen_api_key>"@article{lam2025codecrash,
author={Man Ho Lam and Chaozheng Wang and Jen{-}tse Huang and Michael R. Lyu},
title={CodeCrash: Stress Testing LLM Reasoning under Structural and Semantic Perturbations},
journal={arXiv preprint arXiv:2504.14119},
year={2025}
}