Vision Foundation Models
Vision systems trained once at scale and applied to novel tasks, objects, or domains without retraining — generalising the language-model recipe of pretrain-once-prompt-many to dense vision problems.
A vision foundation model is a model trained once on a broad distribution and then applied to novel tasks at test time without per-task fine-tuning, by analogy with large language models that prompt-condition for unfamiliar tasks. The category is well-established for general-purpose representations (CLIP, DINO, SAM), but it is increasingly being pushed into structured-output vision problems — 6D pose, medical segmentation, scene-level dense prediction — where per-task training has historically been the norm. This topic tracks methodology contributions that adapt the foundation-model recipe to such structured outputs.
Foundation models for 6D object pose
Estimating an object’s 6D pose — three rotational and three translational degrees of freedom — is the canonical entry point for robotic manipulation and AR. Until recently each new object required either re-training a network or a careful test-time optimisation pipeline. Wen et al. (FoundationPose, 2024) propose a single model that handles a novel object at test time with no fine-tuning, in two operating modes: model-based, when a CAD model is provided, and model-free, when only a small set of reference images is available. A unified neural architecture handles both regimes by pivoting on a learned object-centric implicit representation that can be conditioned either on rendered CAD views or on captured reference views. The training recipe is large-scale synthetic data generation: an LLM-driven scene composer plus a high-fidelity renderer produce millions of synthetic interactions that cover the long tail of object shapes, materials, lighting, and clutter. The system unifies pose estimation and tracking in a single network, generalises to objects unseen during training, and outperforms specialist baselines on every standard benchmark. The methodological contribution is the demonstration that a foundation-model recipe — large-scale pretraining plus a test-time conditioning interface — works for a structured 6D output, not only for classification or open-vocabulary recognition.
In-context learning for medical segmentation
Medical image segmentation is the textbook case of a per-task vision problem: each new modality, anatomy, and label requires a fresh model. Butoi et al. (UniverSeg, 2023) propose a single network that segments any unseen task at test time, given a small support set of paired example image and segmentation. The architecture uses a CrossBlock module that lets the query image attend to the support pairs at every resolution, so the segmentation prediction is conditioned on the examples in much the same way that a language model’s output is conditioned on in-context demonstrations. The model is trained on a curated meta-distribution of 53 open medical datasets covering many anatomies, modalities, and label conventions, and at test time it segments tasks it has never seen during training — including new anatomies, label sets, and imaging modalities — without any gradient updates. UniverSeg reframes medical segmentation methodology: instead of training a specialist per dataset, train a generalist once and condition it with a handful of labelled examples per deployment. The contribution is the demonstration that in-context learning, the trait that defines large language models, transfers to dense pixel-level vision outputs.
What the cluster shares
The two papers above are united by a methodological pattern that defines this topic: a single model is trained at scale once, and at test time is conditioned to a novel object, task, or domain through an interface — CAD models or reference images for FoundationPose, an in-context support set for UniverSeg — that does not involve gradient updates. The active research surface includes scaling laws for these recipes, more efficient conditioning interfaces, prompt-tunable open-vocabulary detectors and segmenters, and the integration of vision foundation models with language-model agents that issue the prompts.
Prerequisites
Sources
- paper · primary · 2024wen-2024
-
In context
Where this topic sits in the prerequisite graph. Click any node to jump.
Reviewed by
Review this topic
This page was drafted by an agent and is waiting on expert review. Spotted a wrong prerequisite, a missing concept, a misattributed source, or a factual slip? Tell us — your review opens a tracked issue maintainers act on.