DITO
DITO is an open vocabulary image object detection and segmentation model.DITO is a simple yet effective open-vocabulary object detection and segmentation approach based on detection-oriented image-text pretraining. It bridges the gap between image-level pretraining and open-vocabulary object detection and segmentation by replacing the commonly used classification architecture with the detector architecture at the pretraining phase. This enables the detector heads to learn from noisy image-text pairs, which is crucial for open-vocabulary object detection. DITO pretraining uses only standard image-text contrastive loss and no pseudo-labeling, making it an effective extension of the contrastive learning method to learn emergent object-semantic cues. In addition, it proposes a shifted-window learning approach upon window attention for detection and segmentation to make the backbone representation more robust, translation-invariant, and less biased by the window pattern.
The method was introduced in the paper "Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection" by Kim et al (2023). On the popular LVIS open-vocabulary detection benchmark, DITO sets a new state of the art of 40.4 mask APr using the common ViT-L backbone, significantly outperforming the best existing approach by +6.5 mask APr at system level. On the COCO benchmark, DITO achieves a very competitive 40.8 novel AP without pseudo labeling or weak supervision. In addition, DITO outperforms the baseline significantly on the transfer detection setup.
This model card is based on the JAX implementation from the Google Research DITO Github repository.
This model can be used in a notebook. Click Open notebook to use the model in Colab.
The models are trained on the DataComp-1B and LVIS datasets.
The model outputs a detected list of objects with their bounding boxes and probability distribution over the predefined classes. You can also generate an instance segmentation mask per detected object.
Resource ID | Release date | Release stage | Description |
---|---|---|---|
jax/dito | 2023-10-23 | GA | Model training and serving for open vocabulary image object detection and segmentation. |
Die Google Cloud Console konnte keine JavaScript-Quellen von www.gstatic.com laden.
Mögliche Gründe: