Pic2Word Composed Image Retrieval
Pic2Word is a state of the art image retrieval model.Pic2Word, an image-based retrieval model, was produced by a collaboration between Google Cloud AI and Boston University researchers and released Composed Image Retrieval on GitHub.
Pic2Word was first described in the paper "Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval" by Saito et al (2023). It is trained with Conceptual Caption URLs.
The paper proposes a novel method for zero-shot composed image retrieval that is trainable using only image-caption and unlabeled image datasets, rather than existing training methods that require the use of labeled triplets consisting of the query image, text specification, and the target image. Pic2Word leverages pre-trained vision-language models and transforms an input image to a language token to compose image and text queries. It uses a pretrained CLIP model to encode images as text tokens. This approach outperforms several existing supervised training methods on benchmarks.
This model can be used in a notebook. Click Open notebook to use the model in Colab.
The model was pretrained on Conceptual Caption URLs.
Taking as input an image and a text prompt, the model produces a set of images most closely matching the combined image-text query.
Resource ID | Release date | Release stage | Description |
---|---|---|---|
google/pic2word | 2024-04-01 | General Availability | Serving |
La console Google Cloud non รจ riuscita a caricare le origini JavaScript da www.gstatic.com.
Possibili motivi: