ViT (JAX)
ViT is a transformer-like architecture for image classification.The Vision Transformer (ViT) is an innovative model designed for image classification tasks, which leverages a Transformer-like architecture to process patches of an image. By using self-attention mechanisms, ViT achieves remarkable performance compared to the state-of-the-art convolutional neural networks (CNNs) while significantly reducing the computational resources required for training. It was first introduced in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" by Dosovitskiy et al (2020).
The key idea behind ViT is to treat images as sequences of patches, allowing the Transformer architecture to be directly applied to the image classification problem. The model begins by dividing the input image into a fixed number of equally-sized patches, which are then linearly embedded into a lower-dimensional space to capture their content. These embeddings are treated as the sequence of input tokens for the Transformer.
The Transformer consists of multiple layers of self-attention and feed-forward neural networks. Self-attention enables the model to capture dependencies between different patches, allowing the ViT to learn both local and global image information effectively. This mechanism enables the model to capture long-range dependencies, which is a challenge for CNNs due to their localized receptive fields.
To facilitate positional encoding, which provides the model with spatial information about the image, ViT introduces learnable positional embeddings that are added to the patch embeddings. The combined embeddings are then processed through the Transformer layers to capture intricate relationships between patches and learn discriminative features.
During training, ViT employs a standard cross-entropy loss to optimize the model's parameters. Notably, the ViT model can be pre-trained on large-scale image datasets using techniques like self-supervised learning or transfer learning. Fine-tuning on specific downstream tasks, such as object recognition or segmentation, can further improve performance.
The model featured on this page is based on the JAX implementation, available on the Vision Transformer GitHub repository. The models are pretrained on the ImageNet and ImageNet-21k datasets.
This model can be used in a notebook. Click Open notebook to use the model in Colab.
The models were pretrained on the ImageNet and ImageNet-21k datasets.
The output of the ViT model for image classification is a probability distribution over the predefined classes. It provides the likelihood of the input image belonging to each class. The class with the highest probability is typically chosen as the predicted class. The output can be further utilized for post-processing or as input for other tasks.
Resource ID | Release date | Release stage | Description |
---|---|---|---|
jax/vit_base_patch16_224 | 2024-04-01 | General Availability | Fine-tuning and serving |
La console Google Cloud n'a pas pu charger les sources JavaScript depuis www.gstatic.com.
Voici les raisons possibles :