uCLIP: Parameter‑Efficient Multilingual Extension of Vision–Language Models with Unpaired Data

AAAI 2026
1 KAIST AI, 2 Korea University
*Indicates Equal Contribution
uCLIP architecture overview

uCLIP is a parameter‑efficient framework that extends vision–language models to under‑represented languages without requiring any image–text (I–T) pairs—multilingual or English—or multilingual–English text–text (T–T) pairs during training.

Abstract

Contrastive Language–Image Pre‑training (CLIP) demonstrates strong generalization across visual tasks by leveraging large‑scale English image–text pairs. However, its extension to low‑resource languages remains limited due to the scarcity of high‑quality multilingual image–text data. Existing multilingual vision–language models show consistently low retrieval performance in under‑represented languages—such as Czech, Finnish, Croatian, Hungarian, and Romanian—on the Crossmodal‑3600 (XM3600) benchmark.

We propose a lightweight, data‑efficient framework for multilingual vision–language alignment. Our approach requires no image–text or text–text pairs and freezes both the pretrained image encoder and the multilingual text encoder. Only a compact 1.7M‑parameter projection module is trained, using a contrastive loss over English representations as semantic anchors. This minimal setup enables robust multilingual alignment even for languages with limited supervision.

Extensive evaluation across multiple multilingual retrieval benchmarks confirms the effectiveness of our method, with significant gains in the five under‑represented languages where existing models typically underperform. These findings highlight the effectiveness of our pivot‑based, parameter‑efficient alignment strategy for inclusive multimodal learning.

Main Architecture

uCLIP architecture overview
We propose a lightweight alignment framework that bridges multilingual text and image embeddings via English, without requiring paired I-T and T-T data or encoder finetuning. \textit{uCLIP} employs frozen encoders along with compact projection heads to map inputs into a shared embedding space. At inference time, only multilingual text encoder, image encoder and projectors are used. The model directly encodes multilingual text and image inputs using the frozen encoders, followed by projection into the shared space.

Multilingual Image–Text Retrieval

We evaluate our approach on multilingual image-to-text retrieval, text-to-image retrieval, and zero-shot classification tasks across five low-resource languages On image-to-text retrieval (I→T), uCLIP achieves average R@10 scores of 53.2%, 60.0%, and 71.8% on MSCOCO, Flickr30k, and XM3600 respectively, outperforming other baselines. Similarly, for text-to-image retrieval (T→I), it records 53.3%, 59.7%, and 70.4%, achieving the competitive score. When comparing against strong baselines, uCLIP shows clear advantages over other models. Unlike baselines incorporate extensive multilingual pretraining or direct image-multilingual text or English-multilingual text supervision, uCLIP operates without any explicit paired supervision and still achieves superior results---highlighting the effectiveness of its lightweight cross-modal alignment approach, using English as semantic pivot
How it works illustration

Zero-shot Classification

We assess zero-shot classification on multilingually translated benchmarks: CIFAR-10, and STL-10. uCLIP maintains strong class discrimination across languages without using any paired data, unlike baselines such as M-CLIP and SigLIP2, which depend on direct T-T or I-T supervision. For example, in CIFAR-10, uCLIP achieves 90.5 in Finnish and 91.9 in Croatian—matching or surpassing M-CLIP and SigLIP2 that rely on translated captions or massive multilingual text-image pairs, respectively. (e.g., 22B data in SigLIP2).
How it works illustration

Qualitative Results

We present qualitative comparisons of multilingual image-text retrieval. Each query is a translated sentence from one of five low-resource languages, and we compare the top-1 retrieved images from uCLIP and four baselines. Compared to existing models, uCLIP consistently retrieves more semantically aligned images, demonstrating robust comprehension of not only key objects but also relational and contextual elements in the sentence. For example, given the query “Two young men playing a game of soccer,” other models often retrieve images with only one person, scenes with more than two individuals, or even incorrect activities (e.g., frisbee), suggesting failure to capture compositional structure. In contrast, uCLIP retrieves a correctly grounded image matching both the subject count and activity. Similarly, for the query “A red fire hydrant in front of a shopping center,” most models detect the hydrant but miss the “shopping center” context, indicating over-reliance on salient object keywords. uCLIP, however, correctly captures both the foreground and background, illustrating strong visual grounding and full-sentence understanding. These results suggest that uCLIP is able to go beyond surface-level keyword matching and instead align multimodal features based on deeper cross-lingual and cross-modal semantic structures.
How it works illustration

BibTeX

TBA