Skip to content

victorknox/picoVLM

Β 
Β 

Repository files navigation

picoVLM: Efficient Vision-Language Model Optimization

A fork of nanoVLM exploring efficiency optimizations for the A-OKVQA dataset.

This repository demonstrates how to achieve 65% FLOP reduction while maintaining high accuracy on visual question answering tasks through post-hoc optimization and fine-tuning.


🎯 Project Overview

This project explores making vision-language models more efficient for the A-OKVQA dataset (visual question answering with multiple choice). Starting from lusxvr/nanoVLM (460M parameters), we achieved:

πŸ† Final Results

Method Accuracy FLOP Reduction Speedup Status
Baseline (nanoVLM) 72.49% 0% 1x -
Post-hoc adaptive_6x6 66.03% 36.3% ~1.5x βœ…
Fine-tuned adaptive_4x4 69.87% 65.1% ~3x βœ…βœ…βœ…

Key Achievement: 65.1% fewer FLOPs with 69.87% accuracy (only 2.6% drop from baseline) through adaptive 4x4 token pooling + fine-tuning.


πŸ”¬ Optimization Methods Explored

We systematically tested multiple post-hoc optimization techniques:

1. βœ… Token Pooling (Winner!)

Spatial pooling of visual tokens before LLM processing

  • Method: Adaptive pooling from 8Γ—8 (64 tokens) to 4Γ—4 (16 tokens)
  • Post-hoc results: 63.32% accuracy, 65.1% FLOP reduction
  • After fine-tuning: 69.87% accuracy, 65.1% FLOP reduction
  • Key insight: Fine-tuning recovers accuracy lost from aggressive pooling

2. ❌ Resolution Reduction

Using smaller input images

  • Tested: 512px β†’ 384px, 336px, 288px, 224px, etc.
  • Best: 384px achieved 63.93% accuracy (below 65% threshold)
  • Conclusion: Significant accuracy loss, not viable alone

3. ❌ Layer Pruning

Removing transformer layers from vision encoder or LLM

  • Vision pruning: Minimal effect (model only has 12 layers)
  • LLM pruning: Catastrophic accuracy drop (72% β†’ 30%)
  • Conclusion: Too risky for this model architecture

4. πŸ”„ Combined Optimizations

Stacking multiple techniques

  • Created framework for testing combinations
  • Finding: Token pooling alone + fine-tuning outperformed all combinations

πŸš€ Quick Start

Environment Setup

# Clone the repository
git clone https://github.com/victorknox/picoVLM.git
cd picoVLM

# Setup environment (we use uv)
uv venv
source .venv/bin/activate
uv pip install torch numpy torchvision pillow datasets huggingface-hub transformers safetensors tqdm pandas einops

Try the Optimized Model

from models.vision_language_model import VisionLanguageModel

# Load the fine-tuned efficient model
model = VisionLanguageModel.from_pretrained("checkpoints_aokvqa_4x4_finetuned/mcq_finetuned")

# Use it just like the original nanoVLM!
# This model is 3x faster with only 2.6% accuracy drop

Run Post-hoc Optimization Experiments

# Test token pooling (7 configurations)
python -m aokvqa.posthoc_token_pooling --batch_test --num_samples -1

# Test resolution reduction (5 configurations)
python -m aokvqa.posthoc_resolution_reduction --batch_test --num_samples -1

# Test layer pruning (9 configurations)
python -m aokvqa.posthoc_layer_pruning --batch_test --num_samples -1

# Test combined optimizations (7 strategic combinations)
python -m aokvqa.posthoc_combined --batch_test --num_samples -1

Fine-tune with Token Pooling

# Fine-tune adaptive 4x4 pooling (our best configuration)
python -m aokvqa.run_finetune_4x4

# Or customize:
python -m aokvqa.finetune_aokvqa \
    --model_id lusxvr/nanoVLM \
    --token_pooling adaptive \
    --target_grid 4 \
    --epochs 2 \
    --freeze_vision

πŸ“ Repository Structure

picoVLM/
β”œβ”€β”€ aokvqa/                          # A-OKVQA optimization toolkit
β”‚   β”œβ”€β”€ posthoc_token_pooling.py     # Token reduction experiments
β”‚   β”œβ”€β”€ posthoc_resolution_reduction.py  # Image resolution experiments
β”‚   β”œβ”€β”€ posthoc_layer_pruning.py     # Layer removal experiments
β”‚   β”œβ”€β”€ posthoc_combined.py          # Combined optimization experiments
β”‚   β”œβ”€β”€ finetune_aokvqa.py           # Fine-tuning script with pooling support
β”‚   β”œβ”€β”€ run_finetune_4x4.py          # Convenience wrapper for 4x4 fine-tuning
β”‚   β”œβ”€β”€ experiment_tracker.py        # Experiment logging and comparison
β”‚   β”œβ”€β”€ evaluate.py                  # Evaluation with TFLOPS measurement
β”‚   β”œβ”€β”€ QUICKSTART.md                # 5-minute getting started guide
β”‚   β”œβ”€β”€ OPTIMIZATION_GUIDE.md        # Comprehensive optimization manual
β”‚   └── TOOLKIT_REFERENCE.md         # Complete command reference
β”œβ”€β”€ models/                          # Model architecture (from nanoVLM)
β”‚   β”œβ”€β”€ vision_language_model.py     # Main VLM class
β”‚   β”œβ”€β”€ modality_projector.py        # Enhanced with token pooling support
β”‚   └── config.py                    # Enhanced with pooling parameters
β”œβ”€β”€ checkpoints_aokvqa_4x4_finetuned/  # Best model checkpoint
└── README.md                        # This file

πŸ“Š Detailed Results

Token Pooling Configurations (Full Dataset)

Configuration Accuracy TFLOPS FLOP Reduction Notes
Baseline (8Γ—8) 72.49% 1.79 0% No pooling
Adaptive 8Γ—8 72.49% 1.63 8.8% Same token count, just pooling layer
Adaptive 6Γ—6 66.03% 1.14 36.3% Post-hoc sweet spot
Adaptive 4Γ—4 (post-hoc) 63.32% 0.62 65.1% Below threshold
Adaptive 4Γ—4 (fine-tuned) 69.87% 0.62 65.1% Winner!

Resolution Reduction (Full Dataset)

Resolution Accuracy Patches Notes
512px (default) 72.49% 1024 Baseline
384px 63.93% 576 Below 65% threshold
320px 58.08% 400 Significant drop
Lower <50% - Too aggressive

Layer Pruning (Full Dataset)

Configuration Accuracy Notes
Vision 24/21/18 layers 72.49% No effect (model has only 12 layers)
LLM 24 layers 30.92% Catastrophic failure
LLM 20 layers 24.54% Even worse

πŸ› οΈ Key Technical Contributions

1. Enhanced Modality Projector

Added spatial pooling support to the modality projector:

# models/modality_projector.py
if self.token_pooling == "adaptive":
    self.spatial_pool = nn.AdaptiveAvgPool2d((cfg.mp_target_grid, cfg.mp_target_grid))
elif self.token_pooling in ["avg", "max"]:
    self.spatial_pool = nn.AvgPool2d(kernel_size=cfg.mp_pool_kernel, stride=cfg.mp_pool_stride)

2. Post-hoc Optimization Framework

Scripts to test optimizations without retraining:

  • Apply pooling/pruning/resolution changes
  • Evaluate on validation set
  • Track TFLOPS and accuracy
  • Compare multiple configurations automatically

3. Fine-tuning with Pooling

Modified training script to:

  • Initialize with pooling layer
  • Preserve pretrained projection weights
  • Update expected token count
  • Train only LLM + projector (vision frozen)

4. Experiment Tracking System

Comprehensive logging of all experiments:

from aokvqa.experiment_tracker import ExperimentTracker
tracker = ExperimentTracker()
tracker.log_experiment("adaptive_4x4_finetuned", "token_pooling", {...}, 69.87, 0.62, 1145)
tracker.print_summary()

πŸ“– Documentation


πŸŽ“ Lessons Learned

  1. Token pooling is highly effective: 65% FLOP reduction with fine-tuning
  2. Post-hoc testing is fast: Test many configurations in hours, not days
  3. Fine-tuning recovers accuracy: +6.8% improvement over post-hoc pooling
  4. Not all optimizations work: Resolution/layer pruning too aggressive for this model
  5. Preserve pretrained weights: Critical for maintaining model knowledge

πŸ™ Acknowledgments

This project builds on:

  • nanoVLM by Luis Wiedmann, Aritra Roy Gosthipaty, and AndrΓ©s Marafioti
  • A-OKVQA dataset by HuggingFace
  • SigLIP vision encoder by Google
  • SmolLM2 language model by HuggingFace

πŸ“ Citation

If you use this work, please cite both this project and the original nanoVLM:

@misc{knox2025picovlm,
  author = {Vamshi Krishna Bonagiri},
  title = {picoVLM: Efficient Vision-Language Model Optimization},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/victorknox/picoVLM}}
}

@misc{wiedmann2025nanovlm,
  author = {Luis Wiedmann and Aritra Roy Gosthipaty and AndrΓ©s Marafioti},
  title = {nanoVLM},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/nanoVLM}}
}

πŸ“„ License

This project inherits the license from nanoVLM. See LICENSE file for details.


🀝 Contributing

This is a research/educational project demonstrating efficiency optimization techniques. Feel free to:

  • Open issues for bugs or questions
  • Submit PRs for improvements
  • Use the code for your own experiments

For contributions to the base nanoVLM framework, please see the original repository.

About

Editing nanovlm for efficiency

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 88.7%
  • Shell 11.3%