Show HN: Gemini can now natively embed video, so I built sub-second video search
Use Cases and Applications
- Dashcam and home security footage highlighted as primary use cases (e.g., quickly finding specific incidents, pets escaping, or falls).
- Proposed for home monitoring, trail and game cams (“find all deer encounters”), and commercial surveillance (retail, worker, state).
- Suggested for social media/product monitoring (brand tracking across TikTok/Instagram), porn indexing, ad detection/removal, and automatic alerts as a “virtual security guard.”
- Video editing ideas: search-and-cut features (“remove all scenes with X”) via EDL export or NLE plugins.
Technical Behavior and Quality
- Gemini Embedding 2 can embed video directly: no transcription or captions required.
- Embeddings capture visible text (signs, captions) and audio features (e.g., “someone yelling”), though audio was not fully tested by the project author.
- Temporal structure is respected; not just per-frame CLIP-style averaging.
- Video is chunked with a configurable overlap (default 5s) to avoid missing events at boundaries; no formal benchmarks yet.
- Retrieval quality is good but often requires specific queries; more detailed descriptions yield better matches.
- Currently no confidence thresholding; system returns the “closest” clip even if no good match exists.
Cost, Scale, and Local Alternatives
- Gemini pricing: ~1 frame/sec, ~$0.00079/frame, ~$2.84 per hour of indexed footage under default settings.
- Some debate/misunderstanding in the thread about effective cost per hour; resolved by clarifying Gemini’s internal 1 fps tokenization.
- Cost currently limits continuous real-time indexing; could be trivial for governments or wealthy orgs at scale.
- Several participants seek open-weight or local video embedding models; CLIP-based and Qwen VL embedding mentioned, plus Intel/OpenVINO tooling, but none clearly match Gemini’s temporal video embedding out of the box.
Privacy, Surveillance, and Dystopia Concerns
- Strong concerns about panopticon-like surveillance once it is cheap to index every public and private camera feed.
- Discussion of law-enforcement and municipal systems (ALPR, Fusus, civilian camera integration, facial recognition vans).
- Embeddings enable searching for descriptions (“tall man in trench coat”) rather than just faces, raising tracking concerns.
- Some see this as inevitable tech progression; others argue for regulation, pausing AI, or keeping processing local to mitigate risks.