Oftentimes, machine studying (ML) mannequin builders start their design utilizing a generic spine mannequin that’s skilled at scale and with capabilities transferable to a variety of downstream duties. In pure language processing, plenty of common spine fashions, together with BERT, T5, GPT-3 (generally additionally known as “basis fashions”), are pre-trained on web-scale knowledge and have demonstrated generic multi-tasking capabilities via zero-shot, few-shot or switch studying. In contrast with coaching over-specialized particular person fashions, pre-training spine fashions for numerous downstream duties can amortize the coaching prices, permitting one to beat useful resource limitations when constructing massive scale fashions.
In laptop imaginative and prescient, pioneering work has proven the effectiveness of single-encoder fashions pre-trained for picture classification to seize generic visible representations which can be efficient for different downstream duties. Extra lately, contrastive dual-encoder (CLIP, ALIGN, Florence) and generative encoder-decoder (SimVLM) approaches skilled utilizing web-scale noisy image-text pairs have been explored. Twin-encoder fashions exhibit exceptional zero-shot picture classification capabilities however are much less efficient for joint vision-language understanding. Then again, encoder-decoder strategies are good at picture captioning and visible query answering however can’t carry out retrieval-style duties.
In “CoCa: Contrastive Captioners are Picture-Textual content Basis Fashions”, we current a unified imaginative and prescient spine mannequin referred to as Contrastive Captioner (CoCa). Our mannequin is a novel encoder-decoder method that concurrently produces aligned unimodal picture and textual content embeddings and joint multimodal representations, making it versatile sufficient to be immediately relevant for all sorts of downstream duties. Particularly, CoCa achieves state-of-the-art outcomes on a collection of imaginative and prescient and vision-language duties spanning imaginative and prescient recognition, cross-modal alignment, and multimodal understanding. Moreover, it learns extremely generic representations in order that it may carry out as nicely or higher than absolutely fine-tuned fashions with zero-shot studying or frozen encoders.
|Overview of Contrastive Captioners (CoCa) in comparison with single-encoder, dual-encoder and encoder-decoder fashions.|
We suggest CoCa, a unified coaching framework that mixes contrastive loss and captioning loss on a single coaching knowledge stream consisting of picture annotations and noisy image-text pairs, successfully merging single-encoder, dual-encoder and encoder-decoder paradigms.
To this finish, we current a novel encoder-decoder structure the place the encoder is a imaginative and prescient transformer (ViT), and the textual content decoder transformer is decoupled into two components, a unimodal textual content decoder and a multimodal textual content decoder. We skip cross-attention in unimodal decoder layers to encode text-only representations for contrastive loss, and cascade multimodal decoder layers with cross-attention to picture encoder outputs to study multimodal image-text representations for captioning loss. This design maximizes the mannequin’s flexibility and universality in accommodating a large spectrum of duties, and on the identical time, it may be effectively skilled with a single ahead and backward propagation for each coaching goals, leading to minimal computational overhead. Thus, the mannequin may be skilled end-to-end from scratch with coaching prices corresponding to a naïve encoder-decoder mannequin.
|Illustration of ahead propagation utilized by CoCa for each contrastive and captioning losses.|
The CoCa mannequin may be immediately fine-tuned on many duties with minimal adaptation. By doing so, our mannequin achieves a collection of state-of-the-art outcomes on common imaginative and prescient and multimodal benchmarks, together with (1) visible recognition: ImageNet, Kinetics-400/600/700, and MiT; (2) cross-modal alignment: MS-COCO, Flickr30K, and MSR-VTT; and (3) multimodal understanding: VQA, SNLI-VE, NLVR2, and NoCaps.
|Comparability of CoCa with different image-text spine fashions (with out task-specific customization) and a number of state-of-the-art task-specialized fashions.|
It’s noteworthy that CoCa attains these outcomes as a single mannequin tailored for all duties whereas usually lighter than prior top-performing specialised fashions. For instance, CoCa obtains 91.0% ImageNet top-1 accuracy whereas utilizing lower than half the parameters of prior state-of-the-art fashions. As well as, CoCa additionally obtains sturdy generative functionality of high-quality picture captions.
|Picture classification scaling efficiency evaluating fine-tuned ImageNet top-1 accuracy versus mannequin dimension.|
|Textual content captions generated by CoCa with NoCaps photos as enter.|
In addition to reaching glorious efficiency with fine-tuning, CoCa additionally outperforms earlier state-of-the-art fashions on zero-shot studying duties, together with picture classification,and cross-modal retrieval. CoCa obtains 86.3% zero-shot accuracy on ImageNet whereas additionally robustly outperforming prior fashions on difficult variant benchmarks, resembling ImageNet-A, ImageNet-R, ImageNet-V2, and ImageNet-Sketch. As proven within the determine beneath, CoCa obtains higher zero-shot accuracy with smaller mannequin sizes in comparison with prior strategies.
|Picture classification scaling efficiency evaluating zero-shot ImageNet top-1 accuracy versus mannequin dimension.|
Frozen Encoder Illustration
One significantly thrilling statement is that CoCa achieves outcomes corresponding to the most effective fine-tuned fashions utilizing solely a frozen visible encoder, through which options extracted after mannequin coaching are used to coach a classifier, somewhat than the extra computationally intensive effort of fine-tuning a mannequin. On ImageNet, a frozen CoCa encoder with a discovered classification head obtains 90.6% top-1 accuracy, which is best than the absolutely fine-tuned efficiency of present spine fashions (90.1%). We additionally discover this setup to work extraordinarily nicely for video recognition. We feed sampled video frames into the CoCa frozen picture encoder individually, and fuse output options by attentional pooling earlier than making use of a discovered classifier. This straightforward method utilizing a CoCa frozen picture encoder achieves video motion recognition top-1 accuracy of 88.0% on Kinetics-400 dataset and demonstrates that CoCa learns a extremely generic visible illustration with the mixed coaching goals.
|Comparability of Frozen CoCa visible encoder with (a number of) best-performing fine-tuned fashions.|
We current Contrastive Captioner (CoCa), a novel pre-training paradigm for image-text spine fashions. This straightforward technique is broadly relevant to many kinds of imaginative and prescient and vision-language downstream duties, and obtains state-of-the-art efficiency with minimal and even no task-specific diversifications.
We wish to thank our co-authors Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu who’ve been concerned in all points of the undertaking. We additionally wish to thank Yi-Ting Chen, Kaifeng Chen, Ye Xia, Zhen Li, Chao Jia, Yinfei Yang, Zhengdong Zhang, Wei Han, Yuan Cao, Tao Zhu, Futang Peng, Soham Ghosh, Zihang Dai, Xin Li, Anelia Angelova, Jason Baldridge, Izhak Shafran, Shengyang Dai, Abhijit Ogale, Zhifeng Chen, Claire Cui, Paul Natsev, Tom Duerig for useful discussions, Andrew Dai for assist with contrastive fashions, Christopher Fifty and Bowen Zhang for assist with video fashions, Yuanzhong Xu for assist with mannequin scaling, Lucas Beyer for assist with knowledge preparation, Andy Zeng for assist with MSR-VTT analysis, Hieu Pham and Simon Kornblith for assist with zero-shot evaluations, Erica Moreira and Victor Gomes for assist with useful resource coordination, Liangliang Cao for proofreading, Tom Small for creating the animations used on this blogpost, and others within the Google Mind group for assist all through this undertaking.