2024-07-29

SAM 2: Segment Anything in Images and Videos

Overall reaction

Very positive response; many see SAM/SAM2 as among the most practically useful open models to date.
Users report heavy real-world usage of SAM1 and are eager to adopt SAM2, especially for video and speed improvements.
Some concern about licensing control and regulatory-driven access limits.

What SAM2 does and how it differs from SAM1

Unified, promptable object segmentation for both images and videos, with real-time tracking of objects across frames.
Images are treated as single-frame videos; video support adds “memory attention” with object tokens stored in a FIFO memory bank to track objects over time.
Paper claims modestly better segmentation quality (mIoU) and up to ~6x speedup on images, mainly from a more efficient encoder.
Can track objects even when they leave and re-enter frame, but performance depends on memory/cache settings and remains imperfect.

Licensing, openness, and CLA

Code, models, and data are released under Apache 2.0; dataset is Creative Commons.
Compared favorably against more restrictive LLM releases.
Presence of a Contributor License Agreement worries some, who see it as centralizing rights and signaling limited community ownership, though existing versions can’t be relicensed retroactively.

Demos, access, and browser issues

Official web demo praised for ease-of-use; examples include shoes, sports balls, everyday objects.
Demo is blocked for users in Texas and Illinois, attributed to local biometric privacy laws; some EU/Germany users report geo-restrictions and see this as regulatory side-effect or lobbying tactic.
Firefox is not supported due to missing video APIs; users must use Chrome/Safari.
Some confusion/complaints about cookies and consent banners.

Technical details and performance

Training: 256 A100 GPUs for 108 hours (more than SAM1, but considered relatively cheap for video capability).
New SA-V dataset: ~50k videos, built via phased, SAM-assisted annotation that speeds up labeling dramatically.
Can run on CPU (slow) and non-NVIDIA GPUs; users report success with AMD and Apple M1 (via MPS) for SAM1, expect similar for SAM2 though setup is non-trivial.
Questions about performance on Raspberry Pi, iPhone, and Metal remain largely unanswered; mobile readiness is unclear.

Use cases and integrations

Reported uses of SAM1:
- Rapid meme/graphics creation (high-quality alpha masks).
- Industrial facilities: segmenting pipes/valves before classification.
- Massive acceleration of dataset annotation (millions of images; years of human time saved).
- Biology and microscopy, 3D stacks, and medical-like imagery.
- GUI element segmentation for automation tools.
- GIMP plugins and browser-based tools.
- Art projects that decompose personal photo streams into object databases.
SAM2 is quickly being integrated into labeling platforms and tools; people expect hosted APIs from third parties soon.

Limitations, open questions, and future directions

Known weaknesses:
- Struggles with fine structures and semi-transparent/complex boundaries (hair, fences, splashing liquids, snow, foliage).
- Challenged by multiple similar objects (e.g., juggling, similar balls) and fast motion or motion blur.
- Tracking alone may require combining segmentation with dedicated trackers.
Conceptual questions arise about:
- How the memory mechanism could translate to LLMs.
- Whether SAM2 is a good base for frame-level classifiers.
- How to fine-tune officially, and whether guidance will be provided.
- Extending the idea to audio segmentation or to “segment anything” for long text (semantic chunking for RAG).
Some raise ethical/security concerns (military uses, adversarial attacks, biometric implications), but these aren’t deeply explored in the thread.

Related topics