DITO – Vertex AI – Google Cloud Console

DITO

DITO is an open vocabulary image object detection and segmentation model.

Overview

DITO is a simple yet effective open-vocabulary object detection and segmentation approach based on detection-oriented image-text pretraining. It bridges the gap between image-level pretraining and open-vocabulary object detection and segmentation by replacing the commonly used classification architecture with the detector architecture at the pretraining phase. This enables the detector heads to learn from noisy image-text pairs, which is crucial for open-vocabulary object detection. DITO pretraining uses only standard image-text contrastive loss and no pseudo-labeling, making it an effective extension of the contrastive learning method to learn emergent object-semantic cues. In addition, it proposes a shifted-window learning approach upon window attention for detection and segmentation to make the backbone representation more robust, translation-invariant, and less biased by the window pattern.

The method was introduced in the paper "Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection" by Kim et al (2023). On the popular LVIS open-vocabulary detection benchmark, DITO sets a new state of the art of 40.4 mask APr using the common ViT-L backbone, significantly outperforming the best existing approach by +6.5 mask APr at system level. On the COCO benchmark, DITO achieves a very competitive 40.8 novel AP without pseudo labeling or weak supervision. In addition, DITO outperforms the baseline significantly on the transfer detection setup.

This model card is based on the JAX implementation from the Google Research DITO Github repository.

Use cases

Open Vocabulary Image object detection and segmentation: A key benefit of open-vocabulary detection and segmentation is to test on out-of-distribution data with categories given by users on the fly. DITO achieves strong performance, surpassing the previous state-of-the-art on LVIS open-vocabulary benchmark by 6.5 mask APr at system level. On the COCO benchmark, DITO achieves a very competitive 40.8 novel AP without pseudo labeling or weak supervision.

Documentation

Get started

This model can be used in a notebook. Click Open notebook to use the model in Colab.

Dataset and training

The models are trained on the DataComp-1B and LVIS datasets.

Model output

The model outputs a detected list of objects with their bounding boxes and probability distribution over the predefined classes. You can also generate an instance segmentation mask per detected object.

Best practices and limitations

Sensitivity to input resolution: This model checkpoint operates at a fixed input resolution and may struggle with images of varying sizes or aspect ratios. Resizing or cropping images to fit the required input resolution can result in loss of information or distortion. In order to use a different resolution, the model should be trained with the desired input resolution for optimal performance. To use the model for a different resolution without retraining, you'll need to re-export the model under another resolution.

Versions

Resource ID	Release date	Release stage	Description
jax/dito	2023-10-23	GA	Model training and serving for open vocabulary image object detection and segmentation.

Task

Vokabularerkennung öffnen

Vokabularsegmentierung öffnen

Kenntnisstand

Anfänger

Overview

Use cases

Documentation

Get started

Dataset and training

Model output

Best practices and limitations

Versions

Links

Modell-ID

Versionsname

Tags

Task

Kenntnisstand

Pricing

Your page may be loading slowly because you're building optimized sources. If you intended on using uncompiled sources, please click this link.