SAM 2: Segment Anything in Images and Videos

Overall reaction

  • Very positive response; many see SAM/SAM2 as among the most practically useful open models to date.
  • Users report heavy real-world usage of SAM1 and are eager to adopt SAM2, especially for video and speed improvements.
  • Some concern about licensing control and regulatory-driven access limits.

What SAM2 does and how it differs from SAM1

  • Unified, promptable object segmentation for both images and videos, with real-time tracking of objects across frames.
  • Images are treated as single-frame videos; video support adds “memory attention” with object tokens stored in a FIFO memory bank to track objects over time.
  • Paper claims modestly better segmentation quality (mIoU) and up to ~6x speedup on images, mainly from a more efficient encoder.
  • Can track objects even when they leave and re-enter frame, but performance depends on memory/cache settings and remains imperfect.

Licensing, openness, and CLA

  • Code, models, and data are released under Apache 2.0; dataset is Creative Commons.
  • Compared favorably against more restrictive LLM releases.
  • Presence of a Contributor License Agreement worries some, who see it as centralizing rights and signaling limited community ownership, though existing versions can’t be relicensed retroactively.

Demos, access, and browser issues

  • Official web demo praised for ease-of-use; examples include shoes, sports balls, everyday objects.
  • Demo is blocked for users in Texas and Illinois, attributed to local biometric privacy laws; some EU/Germany users report geo-restrictions and see this as regulatory side-effect or lobbying tactic.
  • Firefox is not supported due to missing video APIs; users must use Chrome/Safari.
  • Some confusion/complaints about cookies and consent banners.

Technical details and performance

  • Training: 256 A100 GPUs for 108 hours (more than SAM1, but considered relatively cheap for video capability).
  • New SA-V dataset: ~50k videos, built via phased, SAM-assisted annotation that speeds up labeling dramatically.
  • Can run on CPU (slow) and non-NVIDIA GPUs; users report success with AMD and Apple M1 (via MPS) for SAM1, expect similar for SAM2 though setup is non-trivial.
  • Questions about performance on Raspberry Pi, iPhone, and Metal remain largely unanswered; mobile readiness is unclear.

Use cases and integrations

  • Reported uses of SAM1:
    • Rapid meme/graphics creation (high-quality alpha masks).
    • Industrial facilities: segmenting pipes/valves before classification.
    • Massive acceleration of dataset annotation (millions of images; years of human time saved).
    • Biology and microscopy, 3D stacks, and medical-like imagery.
    • GUI element segmentation for automation tools.
    • GIMP plugins and browser-based tools.
    • Art projects that decompose personal photo streams into object databases.
  • SAM2 is quickly being integrated into labeling platforms and tools; people expect hosted APIs from third parties soon.

Limitations, open questions, and future directions

  • Known weaknesses:
    • Struggles with fine structures and semi-transparent/complex boundaries (hair, fences, splashing liquids, snow, foliage).
    • Challenged by multiple similar objects (e.g., juggling, similar balls) and fast motion or motion blur.
    • Tracking alone may require combining segmentation with dedicated trackers.
  • Conceptual questions arise about:
    • How the memory mechanism could translate to LLMs.
    • Whether SAM2 is a good base for frame-level classifiers.
    • How to fine-tune officially, and whether guidance will be provided.
    • Extending the idea to audio segmentation or to “segment anything” for long text (semantic chunking for RAG).
  • Some raise ethical/security concerns (military uses, adversarial attacks, biometric implications), but these aren’t deeply explored in the thread.