SAM 2: Segment Anything in Images and Videos
Overall reaction
- Very positive response; many see SAM/SAM2 as among the most practically useful open models to date.
- Users report heavy real-world usage of SAM1 and are eager to adopt SAM2, especially for video and speed improvements.
- Some concern about licensing control and regulatory-driven access limits.
What SAM2 does and how it differs from SAM1
- Unified, promptable object segmentation for both images and videos, with real-time tracking of objects across frames.
- Images are treated as single-frame videos; video support adds “memory attention” with object tokens stored in a FIFO memory bank to track objects over time.
- Paper claims modestly better segmentation quality (mIoU) and up to ~6x speedup on images, mainly from a more efficient encoder.
- Can track objects even when they leave and re-enter frame, but performance depends on memory/cache settings and remains imperfect.
Licensing, openness, and CLA
- Code, models, and data are released under Apache 2.0; dataset is Creative Commons.
- Compared favorably against more restrictive LLM releases.
- Presence of a Contributor License Agreement worries some, who see it as centralizing rights and signaling limited community ownership, though existing versions can’t be relicensed retroactively.
Demos, access, and browser issues
- Official web demo praised for ease-of-use; examples include shoes, sports balls, everyday objects.
- Demo is blocked for users in Texas and Illinois, attributed to local biometric privacy laws; some EU/Germany users report geo-restrictions and see this as regulatory side-effect or lobbying tactic.
- Firefox is not supported due to missing video APIs; users must use Chrome/Safari.
- Some confusion/complaints about cookies and consent banners.
Technical details and performance
- Training: 256 A100 GPUs for 108 hours (more than SAM1, but considered relatively cheap for video capability).
- New SA-V dataset: ~50k videos, built via phased, SAM-assisted annotation that speeds up labeling dramatically.
- Can run on CPU (slow) and non-NVIDIA GPUs; users report success with AMD and Apple M1 (via MPS) for SAM1, expect similar for SAM2 though setup is non-trivial.
- Questions about performance on Raspberry Pi, iPhone, and Metal remain largely unanswered; mobile readiness is unclear.
Use cases and integrations
- Reported uses of SAM1:
- Rapid meme/graphics creation (high-quality alpha masks).
- Industrial facilities: segmenting pipes/valves before classification.
- Massive acceleration of dataset annotation (millions of images; years of human time saved).
- Biology and microscopy, 3D stacks, and medical-like imagery.
- GUI element segmentation for automation tools.
- GIMP plugins and browser-based tools.
- Art projects that decompose personal photo streams into object databases.
- SAM2 is quickly being integrated into labeling platforms and tools; people expect hosted APIs from third parties soon.
Limitations, open questions, and future directions
- Known weaknesses:
- Struggles with fine structures and semi-transparent/complex boundaries (hair, fences, splashing liquids, snow, foliage).
- Challenged by multiple similar objects (e.g., juggling, similar balls) and fast motion or motion blur.
- Tracking alone may require combining segmentation with dedicated trackers.
- Conceptual questions arise about:
- How the memory mechanism could translate to LLMs.
- Whether SAM2 is a good base for frame-level classifiers.
- How to fine-tune officially, and whether guidance will be provided.
- Extending the idea to audio segmentation or to “segment anything” for long text (semantic chunking for RAG).
- Some raise ethical/security concerns (military uses, adversarial attacks, biometric implications), but these aren’t deeply explored in the thread.