ClipCap
CLIP Prefix for Image Captioning
Experience the model with the demo screen.
Model Details
What is the Clip Captioning?
CLIP prefix captioning is a technique that uses a language model called Contrastive Language-Image Pre-Training (CLIP) to generate captions for images. The prefix in the name refers to the fact that the captions are generated by adding a short text prefix to the image, which provides a hint or a prompt to guide the language model. The CLIP model is trained on a large corpus of text and images, which enables it to understand the relationship between words and visual content. By adding a prefix to an image, the model can generate a caption that describes the image in a way that is consistent with the prompt. The resulting captions can be used in various applications, including image search, content recommendation, and accessibility for visually impaired users.
Example
![]() | ![]() |
The men of tv drama. | A motorcycle parked in the desert. |
Model Detail
In this paper, we present a simple approach to address this task. We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions. The recently proposed CLIP model contains rich semantic features which were trained with textual context, making it best for vision-language perception. Our key idea is that together with a pre-trained language model (GPT2), we obtain a wide understanding of both visual and textual data. Hence, our approach only requires rather quick training to produce a competent captioning model. Without additional annotations or pre-training, it efficiently generates meaningful captions for large-scale and diverse datasets. Surprisingly, our method works well even when only the mapping network is trained, while both CLIP and the language model remain frozen, allowing a lighter architecture with less trainable parameters. Through quantitative evaluation, we demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets, while it is simpler, faster, and lighter.
@article{mokady2021clipcap,
title={ClipCap: CLIP Prefix for Image Captioning},
author={Mokady, Ron and Hertz, Amir and Bermano, Amit H},
journal={arXiv preprint arXiv:2111.09734},
year={2021}
}