2024-05-15

PaliGemma: Open-Source Multimodal Model by Google

Model architecture and capabilities

Some note PaliGemma is “two models slapped together” (vision encoder + language model) and question if that’s outdated; others reply this is how most multimodal systems (e.g., CLIP-like) work and even GPT‑4o likely uses an encoder.
According to the linked article (as quoted in the thread), PaliGemma is competitive with GPT‑4o, faster in some cases, and strong at OCR; it also supports object detection (bounding boxes) and segmentation, which major closed models reportedly don’t expose.
A few users are impressed with its OCR and segmentation, including its ability to output coordinates/masks, which they assumed transformers would struggle with.

Hardware requirements and model size

The model is only 3B parameters; people expect it can run on consumer GPUs (e.g., RTX 3060) and even phones.
Debate over practicality: some say iPhones can run small LLMs on-device using their GPUs; others argue that while technically possible, it’s too slow or limited for “realistic” workloads.
Commenters note image models can be smaller than pure LLMs, and small models are attractive for high‑throughput or task‑specific deployments.

Licensing, “open source” debate, and usage restrictions

Multiple comments stress PaliGemma/Gemma is not FOSS; license isn’t OSI‑approved and includes an Acceptable Use Policy that can change over time.
Discussion about “Model Derivatives” vs outputs: interpretation is that models trained using Gemma outputs are restricted, but raw outputs themselves are not.
Strong pushback on using “open” / “open source” loosely; some insist OSI’s definition should be the standard, others argue the term is broader and not legally fixed.
Concerns over broad use bans (e.g., automated decisions in finance, legal, employment, healthcare) and the chilling effect of vague terms like “decisions.”

Fine‑tuning, alternatives, and benchmarks

Some see PaliGemma mainly as a fine‑tuneable, commercially usable base; others say there are stronger open‑source VLMs (e.g., LLaVA‑Mistral, Moondream) unless a tiny model is required.
Fine‑tuning recipes and tools (LLaVA scripts, XTuner, TinyLLaVA, specific papers/datasets) are referenced; smaller models can be tuned on a few rented GPUs.

Segmentation and output interpretation

Several users struggle to interpret segmentation tokens (e.g., <locXXXX>, <segXXX>); decoding is described as “tedious.”
Others point to official docs and example code (e.g., Hugging Face demo) that map tokens to bounding boxes and masks, with community tooling promised to simplify this.

OCR and safety concerns

Mixed feedback on OCR quality; one user reports JSON handling/OCR errors, while benchmarks in the article claim high accuracy.
Some express unease about using LLM‑style models for OCR due to prompt injection risks and safety filters that might censor or alter text rather than faithfully transcribe it.

Perceptions of Google’s AI strategy

Opinions split: some think Google is catching up and leveraging distribution; others remain skeptical, citing overhype around Gemini 1.5 and general mistrust of corporate “open” releases.

Related topics