Show HN: Gemini can now natively embed video, so I built sub-second video search

Use Cases and Applications

  • Dashcam and home security footage highlighted as primary use cases (e.g., quickly finding specific incidents, pets escaping, or falls).
  • Proposed for home monitoring, trail and game cams (“find all deer encounters”), and commercial surveillance (retail, worker, state).
  • Suggested for social media/product monitoring (brand tracking across TikTok/Instagram), porn indexing, ad detection/removal, and automatic alerts as a “virtual security guard.”
  • Video editing ideas: search-and-cut features (“remove all scenes with X”) via EDL export or NLE plugins.

Technical Behavior and Quality

  • Gemini Embedding 2 can embed video directly: no transcription or captions required.
  • Embeddings capture visible text (signs, captions) and audio features (e.g., “someone yelling”), though audio was not fully tested by the project author.
  • Temporal structure is respected; not just per-frame CLIP-style averaging.
  • Video is chunked with a configurable overlap (default 5s) to avoid missing events at boundaries; no formal benchmarks yet.
  • Retrieval quality is good but often requires specific queries; more detailed descriptions yield better matches.
  • Currently no confidence thresholding; system returns the “closest” clip even if no good match exists.

Cost, Scale, and Local Alternatives

  • Gemini pricing: ~1 frame/sec, ~$0.00079/frame, ~$2.84 per hour of indexed footage under default settings.
  • Some debate/misunderstanding in the thread about effective cost per hour; resolved by clarifying Gemini’s internal 1 fps tokenization.
  • Cost currently limits continuous real-time indexing; could be trivial for governments or wealthy orgs at scale.
  • Several participants seek open-weight or local video embedding models; CLIP-based and Qwen VL embedding mentioned, plus Intel/OpenVINO tooling, but none clearly match Gemini’s temporal video embedding out of the box.

Privacy, Surveillance, and Dystopia Concerns

  • Strong concerns about panopticon-like surveillance once it is cheap to index every public and private camera feed.
  • Discussion of law-enforcement and municipal systems (ALPR, Fusus, civilian camera integration, facial recognition vans).
  • Embeddings enable searching for descriptions (“tall man in trench coat”) rather than just faces, raising tracking concerns.
  • Some see this as inevitable tech progression; others argue for regulation, pausing AI, or keeping processing local to mitigate risks.