Image Tokenization: How Vision Transformers See the World

Convolutional Neural Networks (CNNs) have dominated computer vision for the longest time because of their ability to capture local visual patterns through learned kernels. However, CNNs cannot easily model relationships between distant parts of an image. This limitation motivated the development of token-based vision models.

To enable transformers to handle visual data, images must be represented in a form similar to how words are represented in natural language models. This process is known as image tokenization.

Instead of processing every pixel, the image is divided into patches that are converted into discrete tokens—compact numerical representations that encode local visual information. These tokens form the “vocabulary” that transformers use to understand relationships between parts of an image.

What Is Image Tokenization?

Image tokenization converts an image into a sequence of tokens, where each token corresponds to an image patch. A tokenizer – often a discrete variational autoencoder (dVAE) – learns to map patches to entries in a visual codebook, similar to how word embeddings work in text. These visual tokens allow the model to reason about spatial relationships and texture at a higher level than raw pixels.

The approach bridges the gap between visual and textual transformers, making architectures like ViT and BEiT possible.

The tokens represent encodings of the image patches and are easier for the model to learn from than raw pixels. For BEiT, images are tokenized into a 14Ă—14 map, meaning the image is divided into 14-by-14 patches, each becoming one token. The tokenizer itself is trained by a discrete variational autoencoder (dVAE), which consists of two parts:

  • Tokenizer: Converts image pixels into tokens.
  • Decoder: Reconstructs the image from those tokens.

These visual tokens match the number of patches in the input image. The tokenizer used for BEiT is a publicly available tokenizer proposed in Zero-Shot Text-to-Image Generation. The visual tokens generated by the tokenizer are then used to pre-train the BEiT architecture for masked image modeling tasks.

Masked Image Modeling

Self-supervised learning needs a pretext task to train a model. This task allows the model to learn from data without labels. That flexibility enables the use of vast amounts of publicly available unlabeled image data, reducing the need for expensive labeled datasets.

In BEiT, the pretext task is masked image modeling. The idea is to mask out random patches of the image from the input sequence, creating a corrupted version of the image. The transformer is then trained to predict the correct visual tokens for the masked patches.

The tokens computed by the transformer are compared with the visual tokens produced by the image tokenizer. The difference between the predicted and actual tokens is used to adjust the model weights. Over time, the model learns to reconstruct missing patches correctly, developing a strong understanding of image structure and content.

Once trained, the model can accurately generate the correct visual tokens and even reconstruct masked-out regions. In doing so, it builds a deep understanding of the feature representations of the images it has seen. This knowledge is stored in the encoder of the BEiT model and can later be transferred to downstream tasks such as image classification, image segmentation, and object detection.

Fine-Tuning and Downstream Tasks

After pre-training, the model can be fine-tuned on specialized computer vision problems such as image classification, semantic segmentation, and object detection

BEiT leverages its pre-training knowledge to outperform other vision transformers that are randomly initialized and trained on datasets such as ImageNet-1K. It also performs better than other self-supervised models.

For semantic segmentation, BEiT even outperforms networks that use supervised pre-training with labeled data, despite not using any labels during its own pre-training phase.

Conclusion

Vision Transformers can replicate the success of transformers in NLP, but they typically require massive datasets to achieve competitive performance. Acquiring and labeling such large image datasets is costly and time-consuming.

BEiT addresses this limitation by using self-supervised pre-training with unlabeled data. The model learns by predicting masked image tokens rather than requiring explicit labels. This method enables it to learn robust visual representations from large amounts of freely available data.

By converting images into tokens through image tokenization, BEiT bridges the gap between visual and textual transformers. It uses this process to outperform existing models on image classification and semantic segmentation tasks, proving that self-supervised learning with image tokenization is a powerful path forward for computer vision.