Dynamic Power Sampling for LLM Reasoning

This repo contains code and experiment utilities accompanying Dynamic Power Sampling for LLM Reasoning.

What this is based on

This project builds on the power sampling method and its public implementation from:

Reasoning with Sampling: Your Base Model is Smarter Than You Think (Karan & Du, 2025)
Project page: https://aakaran.github.io/reasoning_with_sampling/

The DPS paper keeps the base model, power (\alpha), proposal distribution, and MH schedule unchanged, and only changes how the trajectory budget is chosen per problem.

Paper

Dynamic Power Sampling (this work)

Dynamic_Power_Sampling.pdf
Authors: Tianyu Wu, Yuzhen Chen

Power Sampling (base method)

Reasoning with Sampling: Your Base Model is Smarter Than You Think (Karan & Du, 2025)
Project page: https://aakaran.github.io/reasoning_with_sampling/

Overview

Power sampling targets a sharpened distribution over reasoning traces: [ \pi_\alpha(x\mid q) \propto p_\theta(x\mid q)^\alpha,\ \alpha>1, ] and approximately samples from it using Metropolis–Hastings (MH) updates in token space.

The original setup uses a fixed number of trajectories k for every problem (e.g. k=8), which wastes compute on easy instances. Dynamic Power Sampling (DPS) keeps the base model and MH schedule fixed, but chooses the trajectory budget per problem:

Start with k=2
At a few checkpoints, a small controller predicts whether the current budget is sufficient
If not, increase k by 2 (up to k=8)

What’s implemented in this repo

Fixed-(k) power sampling runners for MATH500/HumanEval/GPQA/AlpacaEval under llm_experiments/.
Logging of DPS-relevant scalar diagnostics (entropy/perplexity/self-confidence, MH acceptance ratio, runtime) into CSVs.
Lightweight hidden-state extraction utilities (note: the paper uses block-wise averaged layer-11 features aggregated across trajectories; this repo currently logs a single vector per sample for analysis).
Optional: metric-based early stopping for a single MH chain (separate from DPS’s dynamic-k scheduling).

Controllers (as described in the paper)

All controllers share the same scalar features (computed at each checkpoint):

mcmc_perplexity: average token-level perplexity under the base model
mcmc_entropy: average token-level entropy of the next-token distribution
mcmc_self_confidence: average top-1 probability
acceptance_ratio: fraction of accepted MH proposals
question_time_sec: elapsed wall-clock time
current_k: current trajectory budget

Controller variants:

DPS-GBDT: gradient-boosted decision tree on the scalar features
DPS-MLP-H: MLP on scalar + layer-11 hidden-state features
DPS-MLP-HQ: MLP on scalar + hidden-state + question embedding

Decision rule (paper notation): [ \hat{p}{\text{inc}} = 1-\hat{p}{\text{enough}},\ \ \text{increase } k \text{ if } \hat{p}_{\text{inc}} > \tau \text{ and } k<8. ]

Main Result (MATH500, Qwen3-7B; from the paper)

Method	Accuracy (%)	Time (s / question)
Fixed-`k` power sampling (`k=8`)	69.7	263.2
DPS-GBDT	68.7	73.2
DPS-MLP-H	69.7	104.2
DPS-MLP-HQ	70.7	87.8

Most MATH500 questions are solved with k=2, and the controller only escalates on a small fraction of harder instances.

Setup

conda env create -f environment.yml
conda activate cloudspace
python llm_experiments/scripts/download_eval_data.py  # fetch HumanEval/GPQA/AlpacaEval data

Running Experiments

The llm_experiments/ folder contains runnable scripts for power sampling on:

MATH500 (llm_experiments/power_samp_math.py, dataset included in llm_experiments/data/MATH500.json)
HumanEval (llm_experiments/power_samp_he.py)
GPQA Diamond (llm_experiments/power_samp_gpqa.py)
AlpacaEval 2.0 (llm_experiments/power_samp_alpaca.py)

To run MATH500 on a Slurm cluster (5 shards × 8 seeds):

sbatch llm_experiments/scripts/power_samp_math.sh

Evaluation

Single-shot grading:

python llm_experiments/eval_math.py --folder=results/qwen_math/MATH

Pass@k plots:

python llm_experiments/passk_math.py --folder=results/qwen_math/MATH

Metrics Logging (for DPS features)

This repo logs entropy/perplexity/self-confidence (and MH acceptance ratios) into the result CSVs to support controller training. You can aggregate those metrics and fit a lightweight predictor:

# Aggregate metrics into analysis/sampling_predictors/sampling_metrics.csv
python llm_experiments/sampling_analysis/collect_metrics.py --folder results/qwen_math/MATH

# Train/evaluate a linear predictor for sampling counts
python llm_experiments/sampling_analysis/eval_predictor.py

Per-metric slices are also exported under analysis/entropy_predictor/, analysis/perplexity_predictor/, and analysis/self_confidence_predictor/ for further experimentation.

Optional: Metric-Based Early Stopping (separate from DPS)

This repo also contains an experimental early stopping controller that halts MH sampling when entropy/perplexity/self-confidence crosses a threshold. This is not the same as the paper’s DPS controller (which schedules the trajectory budget k).

Wrappers live in llm_experiments/dynamic_runs/ (for local scripts):

bash llm_experiments/dynamic_runs/run_entropy.sh ./llm_experiments/local_scripts/power_samp_gpqa_local.sh

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
llm_experiments		llm_experiments
.gitignore		.gitignore
README.md		README.md
command.sh		command.sh
environment.yml		environment.yml
teaser.png		teaser.png
temp.sh		temp.sh
toy_composition.py		toy_composition.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dynamic Power Sampling for LLM Reasoning

What this is based on

Paper

Dynamic Power Sampling (this work)

Power Sampling (base method)

Overview

What’s implemented in this repo

Controllers (as described in the paper)

Main Result (MATH500, Qwen3-7B; from the paper)

Setup

Running Experiments

Evaluation

Metrics Logging (for DPS features)

Optional: Metric-Based Early Stopping (separate from DPS)

About

Uh oh!

Releases

Packages

Languages

Lucas-TY/dynamic_power_sampling

Folders and files

Latest commit

History

Repository files navigation

Dynamic Power Sampling for LLM Reasoning

What this is based on

Paper

Dynamic Power Sampling (this work)

Power Sampling (base method)

Overview

What’s implemented in this repo

Controllers (as described in the paper)

Main Result (MATH500, Qwen3-7B; from the paper)

Setup

Running Experiments

Evaluation

Metrics Logging (for DPS features)

Optional: Metric-Based Early Stopping (separate from DPS)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages