picoVLM: Efficient Vision-Language Model Optimization

A fork of nanoVLM exploring efficiency optimizations for the A-OKVQA dataset.

This repository demonstrates how to achieve 65% FLOP reduction while maintaining high accuracy on visual question answering tasks through post-hoc optimization and fine-tuning.

🎯 Project Overview

This project explores making vision-language models more efficient for the A-OKVQA dataset (visual question answering with multiple choice). Starting from lusxvr/nanoVLM (460M parameters), we achieved:

🏆 Final Results

Method	Accuracy	FLOP Reduction	Speedup	Status
Baseline (nanoVLM)	72.49%	0%	1x	-
Post-hoc adaptive_6x6	66.03%	36.3%	~1.5x	✅
Fine-tuned adaptive_4x4	69.87%	65.1%	~3x	✅✅✅

Key Achievement: 65.1% fewer FLOPs with 69.87% accuracy (only 2.6% drop from baseline) through adaptive 4x4 token pooling + fine-tuning.

🔬 Optimization Methods Explored

We systematically tested multiple post-hoc optimization techniques:

1. ✅ Token Pooling (Winner!)

Spatial pooling of visual tokens before LLM processing

Method: Adaptive pooling from 8×8 (64 tokens) to 4×4 (16 tokens)
Post-hoc results: 63.32% accuracy, 65.1% FLOP reduction
After fine-tuning: 69.87% accuracy, 65.1% FLOP reduction
Key insight: Fine-tuning recovers accuracy lost from aggressive pooling

2. ❌ Resolution Reduction

Using smaller input images

Tested: 512px → 384px, 336px, 288px, 224px, etc.
Best: 384px achieved 63.93% accuracy (below 65% threshold)
Conclusion: Significant accuracy loss, not viable alone

3. ❌ Layer Pruning

Removing transformer layers from vision encoder or LLM

Vision pruning: Minimal effect (model only has 12 layers)
LLM pruning: Catastrophic accuracy drop (72% → 30%)
Conclusion: Too risky for this model architecture

4. 🔄 Combined Optimizations

Stacking multiple techniques

Created framework for testing combinations
Finding: Token pooling alone + fine-tuning outperformed all combinations

🚀 Quick Start

Environment Setup

# Clone the repository
git clone https://github.com/victorknox/picoVLM.git
cd picoVLM

# Setup environment (we use uv)
uv venv
source .venv/bin/activate
uv pip install torch numpy torchvision pillow datasets huggingface-hub transformers safetensors tqdm pandas einops

Try the Optimized Model

from models.vision_language_model import VisionLanguageModel

# Load the fine-tuned efficient model
model = VisionLanguageModel.from_pretrained("checkpoints_aokvqa_4x4_finetuned/mcq_finetuned")

# Use it just like the original nanoVLM!
# This model is 3x faster with only 2.6% accuracy drop

Run Post-hoc Optimization Experiments

# Test token pooling (7 configurations)
python -m aokvqa.posthoc_token_pooling --batch_test --num_samples -1

# Test resolution reduction (5 configurations)
python -m aokvqa.posthoc_resolution_reduction --batch_test --num_samples -1

# Test layer pruning (9 configurations)
python -m aokvqa.posthoc_layer_pruning --batch_test --num_samples -1

# Test combined optimizations (7 strategic combinations)
python -m aokvqa.posthoc_combined --batch_test --num_samples -1

Fine-tune with Token Pooling

# Fine-tune adaptive 4x4 pooling (our best configuration)
python -m aokvqa.run_finetune_4x4

# Or customize:
python -m aokvqa.finetune_aokvqa \
    --model_id lusxvr/nanoVLM \
    --token_pooling adaptive \
    --target_grid 4 \
    --epochs 2 \
    --freeze_vision

📁 Repository Structure

picoVLM/
├── aokvqa/                          # A-OKVQA optimization toolkit
│   ├── posthoc_token_pooling.py     # Token reduction experiments
│   ├── posthoc_resolution_reduction.py  # Image resolution experiments
│   ├── posthoc_layer_pruning.py     # Layer removal experiments
│   ├── posthoc_combined.py          # Combined optimization experiments
│   ├── finetune_aokvqa.py           # Fine-tuning script with pooling support
│   ├── run_finetune_4x4.py          # Convenience wrapper for 4x4 fine-tuning
│   ├── experiment_tracker.py        # Experiment logging and comparison
│   ├── evaluate.py                  # Evaluation with TFLOPS measurement
│   ├── QUICKSTART.md                # 5-minute getting started guide
│   ├── OPTIMIZATION_GUIDE.md        # Comprehensive optimization manual
│   └── TOOLKIT_REFERENCE.md         # Complete command reference
├── models/                          # Model architecture (from nanoVLM)
│   ├── vision_language_model.py     # Main VLM class
│   ├── modality_projector.py        # Enhanced with token pooling support
│   └── config.py                    # Enhanced with pooling parameters
├── checkpoints_aokvqa_4x4_finetuned/  # Best model checkpoint
└── README.md                        # This file

📊 Detailed Results

Token Pooling Configurations (Full Dataset)

Configuration	Accuracy	TFLOPS	FLOP Reduction	Notes
Baseline (8×8)	72.49%	1.79	0%	No pooling
Adaptive 8×8	72.49%	1.63	8.8%	Same token count, just pooling layer
Adaptive 6×6	66.03%	1.14	36.3%	Post-hoc sweet spot
Adaptive 4×4 (post-hoc)	63.32%	0.62	65.1%	Below threshold
Adaptive 4×4 (fine-tuned)	69.87%	0.62	65.1%	Winner!

Resolution Reduction (Full Dataset)

Resolution	Accuracy	Patches	Notes
512px (default)	72.49%	1024	Baseline
384px	63.93%	576	Below 65% threshold
320px	58.08%	400	Significant drop
Lower	<50%	-	Too aggressive

Layer Pruning (Full Dataset)

Configuration	Accuracy	Notes
Vision 24/21/18 layers	72.49%	No effect (model has only 12 layers)
LLM 24 layers	30.92%	Catastrophic failure
LLM 20 layers	24.54%	Even worse

🛠️ Key Technical Contributions

1. Enhanced Modality Projector

Added spatial pooling support to the modality projector:

# models/modality_projector.py
if self.token_pooling == "adaptive":
    self.spatial_pool = nn.AdaptiveAvgPool2d((cfg.mp_target_grid, cfg.mp_target_grid))
elif self.token_pooling in ["avg", "max"]:
    self.spatial_pool = nn.AvgPool2d(kernel_size=cfg.mp_pool_kernel, stride=cfg.mp_pool_stride)

2. Post-hoc Optimization Framework

Scripts to test optimizations without retraining:

Apply pooling/pruning/resolution changes
Evaluate on validation set
Track TFLOPS and accuracy
Compare multiple configurations automatically

3. Fine-tuning with Pooling

Modified training script to:

Initialize with pooling layer
Preserve pretrained projection weights
Update expected token count
Train only LLM + projector (vision frozen)

4. Experiment Tracking System

Comprehensive logging of all experiments:

from aokvqa.experiment_tracker import ExperimentTracker
tracker = ExperimentTracker()
tracker.log_experiment("adaptive_4x4_finetuned", "token_pooling", {...}, 69.87, 0.62, 1145)
tracker.print_summary()

📖 Documentation

QUICKSTART.md: Get started in 5 minutes
OPTIMIZATION_GUIDE.md: Comprehensive optimization manual
TOOLKIT_REFERENCE.md: Complete command reference
TRAINING_RESULTS.md: Detailed training logs and analysis

🎓 Lessons Learned

Token pooling is highly effective: 65% FLOP reduction with fine-tuning
Post-hoc testing is fast: Test many configurations in hours, not days
Fine-tuning recovers accuracy: +6.8% improvement over post-hoc pooling
Not all optimizations work: Resolution/layer pruning too aggressive for this model
Preserve pretrained weights: Critical for maintaining model knowledge

🙏 Acknowledgments

This project builds on:

nanoVLM by Luis Wiedmann, Aritra Roy Gosthipaty, and Andrés Marafioti
A-OKVQA dataset by HuggingFace
SigLIP vision encoder by Google
SmolLM2 language model by HuggingFace

📝 Citation

If you use this work, please cite both this project and the original nanoVLM:

@misc{knox2025picovlm,
  author = {Vamshi Krishna Bonagiri},
  title = {picoVLM: Efficient Vision-Language Model Optimization},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/victorknox/picoVLM}}
}

@misc{wiedmann2025nanovlm,
  author = {Luis Wiedmann and Aritra Roy Gosthipaty and Andrés Marafioti},
  title = {nanoVLM},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/nanoVLM}}
}

📄 License

This project inherits the license from nanoVLM. See LICENSE file for details.

🤝 Contributing

This is a research/educational project demonstrating efficiency optimization techniques. Feel free to:

Open issues for bugs or questions
Submit PRs for improvements
Use the code for your own experiments

For contributions to the base nanoVLM framework, please see the original repository.

Name		Name	Last commit message	Last commit date
Latest commit History 378 Commits
aokvqa		aokvqa
assets		assets
data		data
eval		eval
models		models
slurm		slurm
tests		tests
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
CLEANUP.md		CLEANUP.md
LICENSE		LICENSE
README.md		README.md
baseline_results.json		baseline_results.json
eval.slurm		eval.slurm
evaluation.py		evaluation.py
flops.py		flops.py
generate.py		generate.py
merge_eval_results.py		merge_eval_results.py
prepare.sh		prepare.sh
run_evaluation.py		run_evaluation.py
train.py		train.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

picoVLM: Efficient Vision-Language Model Optimization

🎯 Project Overview

🏆 Final Results

🔬 Optimization Methods Explored

1. ✅ Token Pooling (Winner!)

2. ❌ Resolution Reduction

3. ❌ Layer Pruning

4. 🔄 Combined Optimizations

🚀 Quick Start

Environment Setup

Try the Optimized Model

Run Post-hoc Optimization Experiments

Fine-tune with Token Pooling

📁 Repository Structure

📊 Detailed Results

Token Pooling Configurations (Full Dataset)

Resolution Reduction (Full Dataset)

Layer Pruning (Full Dataset)

🛠️ Key Technical Contributions

1. Enhanced Modality Projector

2. Post-hoc Optimization Framework

3. Fine-tuning with Pooling

4. Experiment Tracking System

📖 Documentation

🎓 Lessons Learned

🙏 Acknowledgments

📝 Citation

📄 License

🤝 Contributing

About

Uh oh!

Releases

Packages

Languages

License

victorknox/picoVLM

Folders and files

Latest commit

History

Repository files navigation

picoVLM: Efficient Vision-Language Model Optimization

🎯 Project Overview

🏆 Final Results

🔬 Optimization Methods Explored

1. ✅ Token Pooling (Winner!)

2. ❌ Resolution Reduction

3. ❌ Layer Pruning

4. 🔄 Combined Optimizations

🚀 Quick Start

Environment Setup

Try the Optimized Model

Run Post-hoc Optimization Experiments

Fine-tune with Token Pooling

📁 Repository Structure

📊 Detailed Results

Token Pooling Configurations (Full Dataset)

Resolution Reduction (Full Dataset)

Layer Pruning (Full Dataset)

🛠️ Key Technical Contributions

1. Enhanced Modality Projector

2. Post-hoc Optimization Framework

3. Fine-tuning with Pooling

4. Experiment Tracking System

📖 Documentation

🎓 Lessons Learned

🙏 Acknowledgments

📝 Citation

📄 License

🤝 Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages