NeurIPS 2025 • Spotlight
📄 Paper: link coming soon (arXiv / OpenReview)
We show that large pretrained Vision Transformers (especially self-supervised ones like DINOv2) naturally learn object binding — they internally represent whether two patches belong to the same object (IsSameObject) without any explicit object-level supervision.
We provide a single setup script that installs all dependencies (PyTorch, mmseg, DinoV2, xformers, etc.). During setup, libs/dinov2/requirements.txt is replaced with requirements_dino.txt so that the install uses the exact dependency versions tested in this repo.
bash scripts/setup.sh<DATASET_ROOT>/
ADE20K_2021_17_01/
images/ADE/training/
images/ADE/validation/
objectInfo150.csv # <-- MUST BE PLACED HERE
objects.txt
index_ade20k.mat
index_ade20k.pkl
objectInfo150.csv must be placed inside the dataset root.
It maps ADE20K’s full label space into the canonical 150-class version used for probing.
Edit the wandb fields in cfgs/config.yaml.
All training/evaluation entry points are provided in the scripts/ directory.
| Command | Description |
|---|---|
bash train.sh |
Train a probe |
bash eval.sh |
Evaluate a trained checkpoint |
You select which probing task to run via the mode= argument:
mode= |
Description |
|---|---|
train |
Pairwise (IsSameObject) probing |
train_class |
Pointwise class probing |
train_identity |
Pointwise identity probing |
Choose the probe architecture via:
probe.mode=linear / diag_quadratic / quadratic
To visualize layer-wise IsSameObject scores, run:
python main.py mode=vis