PaliGemma 1 & 2
Google built multi-modal, multilingual model, with SoTA performance on a wide range of vision-language tasks.PaliGemma is a versatile and lightweight vision-language model (VLM) inspired by PaLI-3 and based on open components such as the SigLIP vision model and the Gemma language model. It takes both image and text as input and generates text as output, supporting multiple languages. It is designed as a model for transfer to a wide range of vision-language tasks such as image and short video caption, visual question answering, text reading, object detection and object segmentation. Paligemma 2, the latest iteration in the family, further improves upon these capabilities with enhanced performance and a strong focus on ease of fine-tuning. Leveraging the core architecture of Google's Gemini models, Paligemma 2 is open-weight, making it accessible for research and development across diverse fields. PaliGemma 2 mix checkpoints are fine-tuned on a diverse set of tasks and are ready to use out of the box while pt checkpoints are pre-trained and intended for further fine-tuning. These tasks include short and long captioning, optical character recognition, question answering, object detection and segmentation, and more.
Open Vision Language Models (VLMs) have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.
Finetune in specific vision-language task:
Vision-language research:
MG_MODEL_PUBLIC_ARTIFACT To copy the PaliGemma model artifacts to a local file path location, use Cloud Shell or your local terminal to enter the command line prompt below. This prompt is provided only for Customer's use of PaliGemma and should not be shared with third parties.
For PaliGemma 1:
For PaliGemma 2:
PaliGemma is the composition of a Transformer decoder and a Vision Transformer image encoder, with a total of 2.9 billion params. The text decoder is initialized from Gemma-2B. The image encoder is initialized from SigLIP-So400m/14. PaliGemma is trained following the PaLI-3 recipes. Paligemma 2, the latest iteration, builds upon this architecture leveraging the core design of Google's Gemini models. It maintains the open-weight nature of the Gemma family, offering multiple model sizes and emphasizing ease of fine-tuning for specific downstream tasks.
PaliGemma is pretrained on the following mixture of datasets.
WebLI (Web Language Image) is a web-scale multilingual image-text dataset built from the public web. A wide range of WebLI splits are used to acquire versatile model capabilities, such as visual semantic understanding, object localization, visually-situated text understanding, multilinguality etc.
Curated English image-alt_text pairs from webpages (Sharma et al., 2018). We used the GCP Translation API to translate into 34 additional languages.
A subset of VQ2A-CC3M (Changpinyo et al., 2022a), translated into the same additional 34 languages as CC3M-35L, using GCP translation API.
Detection and object-aware question and answers generated by handcrafted rules on OpenImages dataset.
Images and texts collected from Wikipedia (Srinivasan et al., 2021).
The following filters are applied to WebLI, with the goal of training PaliGemma on clean data.
Pornographic image filtering This filter removes images deemed to be of pornographic nature.
Text safety filtering We identify and filter out images that are paired with unsafe text. Unsafe text is any text deemed to contain or be about CSAI, pornography, vulgarities, or otherwise offensive.
Text toxicity filtering We further use the Perspective API to identify and filter out images that are paired with text deemed insulting, obscene, hateful or otherwise toxic.
Text PII filtering PII (Personally Identifiable Information) filtering was performed using Cloud Data Loss Prevention (DLP) API to protect the privacy of individuals. Identifiers such as social security numbers and other sensitive information types were removed.
Additional methods Filtering based on content quality and safety in line with our policies and practices.
In order to verify the transferability of PaliGemma to a wide variety of academic tasks, we finetune the pretrained models on each task. Additionally we train the mix model with a mixture of the transfer tasks. We report results on different resolutions to provide an impression of which tasks benefit from increased resolution. Importantly, none of these tasks or datasets are part of the pretraining data mixture, and their images are explicitly removed from the web-scale pretraining data.
Benchmark (train split) | Metric (split) | mix_224 | mix_448 |
---|---|---|---|
MMVP | Paired Accuracy | 46.00 | 45.33 |
POPE | Accuracy(random / popular / adversarial) | 88.00 / 86.63 / 85.67 | 89.37 / 88.40 / 87.47 |
GQA | Accuracy (test) | 65.20 | 65.47 |
Benchmark (train split) | Metric (split) | pt_224 | pt_448 | pt_896 |
---|---|---|---|---|
Captioning | ||||
COCO captions(train+restval) | CIDEr(val) | 141.92 | 144.60 | |
NoCaps(Eval of COCO captions transfer) | CIDEr(val) | 121.72 | 123.58 | |
COCO-35L(train) | CIDErdev (en / avg-34 / avg) | 139.2 / 115.8 / 116.4 | 141.2 / 118.0 / 118.6 | |
XM3600(Eval of COCO-35L transfer) | CIDErtest (en / avg-35 / avg) | 78.1 / 41.3 / 42.4 | 80.0 / 41.9 / 42.9 | |
TextCaps(train) | CIDEr(val) | 127.48 | 153.94 | |
SciCap (first sentence, no subfigure)(train+val) | CIDEr / BLEU-4(test) | 162.25 / 0.192 | 181.49 / 0.211 | |
Screen2words(train+dev) | CIDEr(test) | 117.57 | 119.59 | |
Widget Captioning(train+dev) | CIDEr(test) | 136.07 | 148.36 | |
Question Answering | ||||
VQAv2(train+validation) | Accuracy(Test server - std) | 83.19 | 85.64 | |
MMVP(Eval of VQAv2 transfer) | Paired Accuracy | 47.33 | 45.33 | |
POPE(Eval of VQAv2 transfer) | Accuracy(random / popular / adversarial) | 87.80 / 85.87 / 84.27 | 88.23 / 86.77 / 85.90 | |
OKVQA(train) | Accuracy(val) | 63.54 | 63.15 | |
A-OKVQA (MC)(train+val) | Accuracy(Test server) | 76.37 | 76.90 | |
A-OKVQA (DA)(train+val) | Accuracy(Test server) | 61.85 | 63.22 | |
GQA(train_balanced+val_balanced) | Accuracy(testdev balanced) | 65.61 | 67.03 | |
xGQA(Eval of GQA transfer) | Mean Accuracy(bn,de,en,id,ko,pt,ru,zh) | 58.37 | 59.07 | |
NLVR2(train+dev) | Accuracy(test) | 90.02 | 88.93 | |
MaRVL(Eval of NLVR2 transfer) | Mean Accuracy(test) (id,sw,ta,tr,zh) | 80.57 | 76.78 | |
AI2D(train) | Accuracy(test) | 72.12 | 73.28 | |
ScienceQA (Img subset, no CoT)(train+val) | Accuracy(test) | 95.39 | 95.93 | |
RSVQA-LR (Non numeric)(train+val) | Mean Accuracy(test) | 92.65 | 93.11 | |
RSVQA-HR (Non numeric)(train+val) | Mean Accuracy(test/test2) | 92.61 / 90.58 | 92.79 / 90.54 | |
ChartQA(human+aug)x(train+val) | Mean Relaxed Accuracy(test_human, test_aug) | 57.08 | 71.36 | |
VizWiz VQA(train+val) | Accuracy(Test server - std) | 73.7 | 75.52 | |
TallyQA(train) | Accuracy(test_simple/test_complex) | 81.72 / 69.56 | 84.86 / 72.27 | |
OCR-VQA(train+val) | Accuracy(test) | 72.32 | 74.61 | 74.93 |
TextVQA(train+val) | Accuracy(Test server - std) | 55.47 | 73.15 | 76.48 |
DocVQA(train+val) | ANLS(Test server) | 43.74 | 78.02 | 84.77 |
Infographic VQA(train+val) | ANLS(Test server) | 28.46 | 40.47 | 47.75 |
SceneText VQA(train+val) | ANLS(Test server) | 63.29 | 81.82 | 84.40 |
Segmentation | ||||
RefCOCO(combined refcoco, refcoco+, refcocog excluding val and test images) | MIoU(validation)refcoco / refcoco+ / refcocog | 73.40 / 68.32 / 67.65 | 75.57 / 69.76 / 70.17 | 76.94 / 72.18 / 72.22 |
Video tasks (Caption/QA) | ||||
MSR-VTT (Captioning) | CIDEr(test) | 70.54 | ||
MSR-VTT (QA) | Accuracy(test) | 50.09 | ||
ActivityNet (Captioning) | CIDEr(test) | 34.62 | ||
ActivityNet (QA) | Accuracy(test) | 50.78 | ||
VATEX (Captioning) | CIDEr(test) | 79.73 | ||
MSVD (QA) | Accuracy(test) | 60.22 |
The model inherits the safety benefits and safety risks associated with large language models (Gemma) and vision-language models (PaLI-3). The model should not be used for downstream applications without prior assessment and mitigation of downstream application-specific security and fairness concerns.
The known risks are:
The model is trained on large, often noisy, image-text datasets that are known to contain biases regarding people of different backgrounds.
Inherited risk from the larger language model, including hallucination.
Text and image convey meaning in distinct ways and with distinct limitations. More research is needed to examine questions of efficacy and utility before image-to-text models can be used as communication aids, including for education.
Risks identified and mitigations:
Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases.
Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases.
Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of LLMs. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the Gemma Prohibited Use Policy.
Privacy violations: Models were trained on data filtered for removal of PII (Personally Identifiable Information). Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques.
Limitations:
Most limitations inherited from the underlying Gemma model still apply:
VLMs are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging.
Natural language is inherently complex. VLMs might struggle to grasp subtle nuances, sarcasm, or figurative language.
VLMs generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements.
VLMs rely on statistical patterns in language and images. They might lack the ability to apply common sense reasoning in certain situations.
PaliGemma was designed first and foremost to serve as a general pretrained model for transfer to specialized tasks. Hence, its "out of the box" or "zero-shot" performance might lag behind models designed specifically for that.
PaliGemma is not a multi-turn chatbot. It is designed for a single round of image + text input.
Resource ID | Release date | Release stage | Description |
---|---|---|---|
paligemma-224-float32 | 2024-04-08 | GA | Pretrained checkpoint that wasn't trained on any specific tasks. Recommended for finetuning. Resolution: 224 dtype: float32 |
paligemma-224-float16 | 2024-04-08 | GA | Pretrained checkpoint that wasn't trained on any specific tasks. Resolution: 224 dtype: float16 |
paligemma-224-bfloat16 | 2024-04-08 | GA | Pretrained checkpoint that wasn't trained on any specific tasks. Resolution: 224 dtype: bfloat16 |
paligemma-448-float32 | 2024-04-08 | GA | Pretrained checkpoint that wasn't trained on any specific tasks. Recommended for finetuning. Resolution: 448 dtype: float32 |
paligemma-448-float16 | 2024-04-08 | GA | Pretrained checkpoint that wasn't trained on any specific tasks. Resolution: 448 dtype: float16 |
paligemma-448-bfloat16 | 2024-04-08 | GA | Pretrained checkpoint that wasn't trained on any specific tasks. Resolution: 448 dtype: bfloat16 |
paligemma-896-float32 | 2024-04-08 | GA | Pretrained checkpoint that wasn't trained on any specific tasks. Recommended for finetuning. Resolution: 896 dtype: float32 |
paligemma-896-float16 | 2024-04-08 | GA | Pretrained checkpoint that wasn't trained on any specific tasks. Resolution: 896 dtype: float16 |
paligemma-896-bfloat16 | 2024-04-08 | GA | Pretrained checkpoint that wasn't trained on any specific tasks. Resolution: 896 dtype: bfloat16 |
paligemma-mix-224-float32 | 2024-04-08 | GA | Mix fine-tuned checkpoint that accepts natural language inputs. Recommended for serving. Resolution: 224 dtype: float32 |
paligemma-mix-224-float16 | 2024-04-08 | GA | Mix fine-tuned checkpoint that accepts natural language inputs. Recommended for serving. Resolution: 224 dtype: float16 |
paligemma-mix-224-bfloat16 | 2024-04-08 | GA | Mix fine-tuned checkpoint that accepts natural language inputs. Recommended for serving. Resolution: 224 dtype: bfloat16 |
paligemma-mix-448-float32 | 2024-04-08 | GA | Mix fine-tuned checkpoint that accepts natural language inputs. Recommended for serving. Resolution: 448 dtype: float32 |
paligemma-mix-448-float16 | 2024-04-08 | GA | Mix fine-tuned checkpoint that accepts natural language inputs. Recommended for serving. Resolution: 448 dtype: float16 |
paligemma-mix-448-bfloat16 | 2024-04-08 | GA | Mix fine-tuned checkpoint that accepts natural language inputs. Recommended for serving. Resolution: 448 dtype: bfloat16 |
Paligemma 2 Models | |||
google/paligemma2-3b-pt-224 | 2024-12-05 | GA | Paligemma 2 3B parameters, Pre-trained, Resolution 224. |
google/paligemma2-3b-pt-448 | 2024-12-05 | GA | Paligemma 2 3B parameters, Pre-trained, Resolution 448. |
google/paligemma2-3b-pt-896 | 2024-12-05 | GA | Paligemma 2 3B parameters, Pre-trained, Resolution 896. |
google/paligemma2-10b-pt-224 | 2024-12-05 | GA | Paligemma 2 10B parameters, Pre-trained, Resolution 224. |
google/paligemma2-10b-pt-448 | 2024-12-05 | GA | Paligemma 2 10B parameters, Pre-trained, Resolution 448. |
google/paligemma2-10b-pt-896 | 2024-12-05 | GA | Paligemma 2 10B parameters, Pre-trained, Resolution 896. |
google/paligemma2-28b-pt-224 | 2024-12-05 | GA | Paligemma 2 28B parameters, Pre-trained, Resolution 224. |
google/paligemma2-28b-pt-448 | 2024-12-05 | GA | Paligemma 2 28B parameters, Pre-trained, Resolution 448. |
google/paligemma2-28b-pt-896 | 2024-12-05 | GA | Paligemma 2 28B parameters, Pre-trained, Resolution 896. |
google/paligemma2-3b-ft-docci-448 | 2024-12-05 | GA | Paligemma 2 3B parameters, Fine-tuned on DOCCI, Resolution 448. |
google/paligemma2-10b-ft-docci-448 | 2024-12-05 | GA | Paligemma 2 10B parameters, Fine-tuned on DOCCI, Resolution 448. |
google/paligemma2-3b-pt-224-jax | 2024-12-05 | GA | Paligemma 2 3B parameters, Pre-trained (JAX), Resolution 224. |
google/paligemma2-3b-pt-448-jax | 2024-12-05 | GA | Paligemma 2 3B parameters, Pre-trained (JAX), Resolution 448. |
google/paligemma2-3b-pt-896-jax | 2024-12-05 | GA | Paligemma 2 3B parameters, Pre-trained (JAX), Resolution 896. |
google/paligemma2-10b-pt-224-jax | 2024-12-05 | GA | Paligemma 2 10B parameters, Pre-trained (JAX), Resolution 224. |
google/paligemma2-10b-pt-448-jax | 2024-12-05 | GA | Paligemma 2 10B parameters, Pre-trained (JAX), Resolution 448. |
google/paligemma2-10b-pt-896-jax | 2024-12-05 | GA | Paligemma 2 10B parameters, Pre-trained (JAX), Resolution 896. |
google/paligemma2-28b-pt-224-jax | 2024-12-05 | GA | Paligemma 2 28B parameters, Pre-trained (JAX), Resolution 224. |
google/paligemma2-28b-pt-448-jax | 2024-12-05 | GA | Paligemma 2 28B parameters, Pre-trained (JAX), Resolution 448. |
google/paligemma2-28b-pt-896-jax | 2024-12-05 | GA | Paligemma 2 28B parameters, Pre-trained (JAX), Resolution 896. |
google/paligemma2-10b-ft-docci-448-jax | 2024-12-05 | GA | Paligemma 2 10B parameters, Fine-tuned on DOCCI (JAX), Resolution 448. |
google/paligemma2-3b-ft-docci-448-jax | 2024-12-05 | GA | Paligemma 2 3B parameters, Fine-tuned on DOCCI (JAX), Resolution 448. |
google/paligemma2-3b-mix-224 | 2025-02-19 | GA | Paligemma 2 Mix 3B parameters, Resolution 224. |
google/paligemma2-3b-mix-448 | 2025-02-19 | GA | Paligemma 2 Mix 3B parameters, Resolution 448. |
google/paligemma2-10b-mix-448 | 2025-02-19 | GA | Paligemma 2 Mix 10B parameters, Resolution 448. |
google/paligemma2-10b-mix-224 | 2025-02-19 | GA | Paligemma 2 Mix 10B parameters, Resolution 224. |
google/paligemma2-28b-mix-448 | 2025-02-19 | GA | Paligemma 2 Mix 28B parameters, Resolution 448. |
google/paligemma2-28b-mix-224 | 2025-02-19 | GA | Paligemma 2 Mix 28B parameters, Resolution 224. |
By using, reproducing, modifying, distributing, performing or displaying any portion or element of Gemma, Model Derivatives including via any Hosted Service, (each as defined below) (collectively, the "Gemma Services") or otherwise accepting the terms of this Agreement, you agree to be bound by this Agreement.
Google reserves the right to update this Gemma Prohibited Use Policy from time to time.
You may not use nor allow others to use Gemma or Model Derivatives to:
Google Cloud Console has failed to load JavaScript sources from www.gstatic.com.
Possible reasons are: