Self-Supervised Vision

Self-Supervised Vision addresses pretext tasks and contrastive pretraining for vision. It sits within Computer Vision (Deep Learning) and inherits that area’s core questions about correctness, scale, and tractability. This page surveys the conceptual axes of the topic and points to the references that frame ongoing research and teaching. The intent is to be useful both as an entry point for newcomers and as an index for practitioners cross-checking their mental model against the field’s primary sources.

Work on self-supervised vision can be organised around a few interlocking concerns: the formal objects under study, the algorithms or systems that compute over them, the resource trade-offs (time, memory, communication, statistical efficiency), and the empirical or theoretical guarantees that practitioners rely on. The sources cited below approach the topic from a mix of these angles.

Foundational references

Masked Autoencoders Are Scalable Vision Learners (He, 2022) contributes to this area as a primary methodological reference; readers should consult it directly for the precise formulation and results. DINO: Emerging Properties in Self-Supervised Vision Transformers (Caron, 2021) contributes to this area as a primary methodological reference; readers should consult it directly for the precise formulation and results.

Open methodological questions in self-supervised vision cluster around how to compose the techniques above under realistic constraints — scale, adversarial inputs, partial observability, and shifting workloads. The cited references give the precise statements, proofs, and empirical evaluations that this overview only sketches; downstream topic pages drill into specific subfields.

Self-Supervised Vision

Foundational references

Prerequisites

Sources

In context

Reviewed by

Review this topic