CLIP Prefix for Image Captioning

Model Details

What is the Clip Captioning?

CLIP prefix captioning is a technique that uses a language model called Contrastive Language-Image Pre-Training (CLIP) to generate captions for images. The prefix in the name refers to the fact that the captions are generated by adding a short text prefix to the image, which provides a hint or a prompt to guide the language model. The CLIP model is trained on a large corpus of text and images, which enables it to understand the relationship between words and visual content. By adding a prefix to an image, the model can generate a caption that describes the image in a way that is consistent with the prompt. The resulting captions can be used in various applications, including image search, content recommendation, and accessibility for visually impaired users.


The men of tv drama.A motorcycle parked in the desert.


Model Detail

In this paper, we present a simple approach to address this task. We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions. The recently proposed CLIP model contains rich semantic features which were trained with textual context, making it best for vision-language perception. Our key idea is that together with a pre-trained language model (GPT2), we obtain a wide understanding of both visual and textual data. Hence, our approach only requires rather quick training to produce a competent captioning model. Without additional annotations or pre-training, it efficiently generates meaningful captions for large-scale and diverse datasets. Surprisingly, our method works well even when only the mapping network is trained, while both CLIP and the language model remain frozen, allowing a lighter architecture with less trainable parameters. Through quantitative evaluation, we demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets, while it is simpler, faster, and lighter.

