A fork of nanoVLM exploring efficiency optimizations for the A-OKVQA dataset.
This repository demonstrates how to achieve 65% FLOP reduction while maintaining high accuracy on visual question answering tasks through post-hoc optimization and fine-tuning.
This project explores making vision-language models more efficient for the A-OKVQA dataset (visual question answering with multiple choice). Starting from lusxvr/nanoVLM (460M parameters), we achieved:
| Method | Accuracy | FLOP Reduction | Speedup | Status |
|---|---|---|---|---|
| Baseline (nanoVLM) | 72.49% | 0% | 1x | - |
| Post-hoc adaptive_6x6 | 66.03% | 36.3% | ~1.5x | β |
| Fine-tuned adaptive_4x4 | 69.87% | 65.1% | ~3x | β β β |
Key Achievement: 65.1% fewer FLOPs with 69.87% accuracy (only 2.6% drop from baseline) through adaptive 4x4 token pooling + fine-tuning.
We systematically tested multiple post-hoc optimization techniques:
Spatial pooling of visual tokens before LLM processing
- Method: Adaptive pooling from 8Γ8 (64 tokens) to 4Γ4 (16 tokens)
- Post-hoc results: 63.32% accuracy, 65.1% FLOP reduction
- After fine-tuning: 69.87% accuracy, 65.1% FLOP reduction
- Key insight: Fine-tuning recovers accuracy lost from aggressive pooling
Using smaller input images
- Tested: 512px β 384px, 336px, 288px, 224px, etc.
- Best: 384px achieved 63.93% accuracy (below 65% threshold)
- Conclusion: Significant accuracy loss, not viable alone
Removing transformer layers from vision encoder or LLM
- Vision pruning: Minimal effect (model only has 12 layers)
- LLM pruning: Catastrophic accuracy drop (72% β 30%)
- Conclusion: Too risky for this model architecture
Stacking multiple techniques
- Created framework for testing combinations
- Finding: Token pooling alone + fine-tuning outperformed all combinations
# Clone the repository
git clone https://github.com/victorknox/picoVLM.git
cd picoVLM
# Setup environment (we use uv)
uv venv
source .venv/bin/activate
uv pip install torch numpy torchvision pillow datasets huggingface-hub transformers safetensors tqdm pandas einopsfrom models.vision_language_model import VisionLanguageModel
# Load the fine-tuned efficient model
model = VisionLanguageModel.from_pretrained("checkpoints_aokvqa_4x4_finetuned/mcq_finetuned")
# Use it just like the original nanoVLM!
# This model is 3x faster with only 2.6% accuracy drop# Test token pooling (7 configurations)
python -m aokvqa.posthoc_token_pooling --batch_test --num_samples -1
# Test resolution reduction (5 configurations)
python -m aokvqa.posthoc_resolution_reduction --batch_test --num_samples -1
# Test layer pruning (9 configurations)
python -m aokvqa.posthoc_layer_pruning --batch_test --num_samples -1
# Test combined optimizations (7 strategic combinations)
python -m aokvqa.posthoc_combined --batch_test --num_samples -1# Fine-tune adaptive 4x4 pooling (our best configuration)
python -m aokvqa.run_finetune_4x4
# Or customize:
python -m aokvqa.finetune_aokvqa \
--model_id lusxvr/nanoVLM \
--token_pooling adaptive \
--target_grid 4 \
--epochs 2 \
--freeze_visionpicoVLM/
βββ aokvqa/ # A-OKVQA optimization toolkit
β βββ posthoc_token_pooling.py # Token reduction experiments
β βββ posthoc_resolution_reduction.py # Image resolution experiments
β βββ posthoc_layer_pruning.py # Layer removal experiments
β βββ posthoc_combined.py # Combined optimization experiments
β βββ finetune_aokvqa.py # Fine-tuning script with pooling support
β βββ run_finetune_4x4.py # Convenience wrapper for 4x4 fine-tuning
β βββ experiment_tracker.py # Experiment logging and comparison
β βββ evaluate.py # Evaluation with TFLOPS measurement
β βββ QUICKSTART.md # 5-minute getting started guide
β βββ OPTIMIZATION_GUIDE.md # Comprehensive optimization manual
β βββ TOOLKIT_REFERENCE.md # Complete command reference
βββ models/ # Model architecture (from nanoVLM)
β βββ vision_language_model.py # Main VLM class
β βββ modality_projector.py # Enhanced with token pooling support
β βββ config.py # Enhanced with pooling parameters
βββ checkpoints_aokvqa_4x4_finetuned/ # Best model checkpoint
βββ README.md # This file
| Configuration | Accuracy | TFLOPS | FLOP Reduction | Notes |
|---|---|---|---|---|
| Baseline (8Γ8) | 72.49% | 1.79 | 0% | No pooling |
| Adaptive 8Γ8 | 72.49% | 1.63 | 8.8% | Same token count, just pooling layer |
| Adaptive 6Γ6 | 66.03% | 1.14 | 36.3% | Post-hoc sweet spot |
| Adaptive 4Γ4 (post-hoc) | 63.32% | 0.62 | 65.1% | Below threshold |
| Adaptive 4Γ4 (fine-tuned) | 69.87% | 0.62 | 65.1% | Winner! |
| Resolution | Accuracy | Patches | Notes |
|---|---|---|---|
| 512px (default) | 72.49% | 1024 | Baseline |
| 384px | 63.93% | 576 | Below 65% threshold |
| 320px | 58.08% | 400 | Significant drop |
| Lower | <50% | - | Too aggressive |
| Configuration | Accuracy | Notes |
|---|---|---|
| Vision 24/21/18 layers | 72.49% | No effect (model has only 12 layers) |
| LLM 24 layers | 30.92% | Catastrophic failure |
| LLM 20 layers | 24.54% | Even worse |
Added spatial pooling support to the modality projector:
# models/modality_projector.py
if self.token_pooling == "adaptive":
self.spatial_pool = nn.AdaptiveAvgPool2d((cfg.mp_target_grid, cfg.mp_target_grid))
elif self.token_pooling in ["avg", "max"]:
self.spatial_pool = nn.AvgPool2d(kernel_size=cfg.mp_pool_kernel, stride=cfg.mp_pool_stride)Scripts to test optimizations without retraining:
- Apply pooling/pruning/resolution changes
- Evaluate on validation set
- Track TFLOPS and accuracy
- Compare multiple configurations automatically
Modified training script to:
- Initialize with pooling layer
- Preserve pretrained projection weights
- Update expected token count
- Train only LLM + projector (vision frozen)
Comprehensive logging of all experiments:
from aokvqa.experiment_tracker import ExperimentTracker
tracker = ExperimentTracker()
tracker.log_experiment("adaptive_4x4_finetuned", "token_pooling", {...}, 69.87, 0.62, 1145)
tracker.print_summary()- QUICKSTART.md: Get started in 5 minutes
- OPTIMIZATION_GUIDE.md: Comprehensive optimization manual
- TOOLKIT_REFERENCE.md: Complete command reference
- TRAINING_RESULTS.md: Detailed training logs and analysis
- Token pooling is highly effective: 65% FLOP reduction with fine-tuning
- Post-hoc testing is fast: Test many configurations in hours, not days
- Fine-tuning recovers accuracy: +6.8% improvement over post-hoc pooling
- Not all optimizations work: Resolution/layer pruning too aggressive for this model
- Preserve pretrained weights: Critical for maintaining model knowledge
This project builds on:
- nanoVLM by Luis Wiedmann, Aritra Roy Gosthipaty, and AndrΓ©s Marafioti
- A-OKVQA dataset by HuggingFace
- SigLIP vision encoder by Google
- SmolLM2 language model by HuggingFace
If you use this work, please cite both this project and the original nanoVLM:
@misc{knox2025picovlm,
author = {Vamshi Krishna Bonagiri},
title = {picoVLM: Efficient Vision-Language Model Optimization},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/victorknox/picoVLM}}
}
@misc{wiedmann2025nanovlm,
author = {Luis Wiedmann and Aritra Roy Gosthipaty and AndrΓ©s Marafioti},
title = {nanoVLM},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huggingface/nanoVLM}}
}This project inherits the license from nanoVLM. See LICENSE file for details.
This is a research/educational project demonstrating efficiency optimization techniques. Feel free to:
- Open issues for bugs or questions
- Submit PRs for improvements
- Use the code for your own experiments
For contributions to the base nanoVLM framework, please see the original repository.