Contrastive Language–Image Pre‑training (CLIP) demonstrates strong generalization across visual tasks by leveraging large‑scale English image–text pairs. However, its extension to low‑resource languages remains limited due to the scarcity of high‑quality multilingual image–text data. Existing multilingual vision–language models show consistently low retrieval performance in under‑represented languages—such as Czech, Finnish, Croatian, Hungarian, and Romanian—on the Crossmodal‑3600 (XM3600) benchmark.
We propose a lightweight, data‑efficient framework for multilingual vision–language alignment. Our approach requires no image–text or text–text pairs and freezes both the pretrained image encoder and the multilingual text encoder. Only a compact 1.7M‑parameter projection module is trained, using a contrastive loss over English representations as semantic anchors. This minimal setup enables robust multilingual alignment even for languages with limited supervision.
Extensive evaluation across multiple multilingual retrieval benchmarks confirms the effectiveness of our method, with significant gains in the five under‑represented languages where existing models typically underperform. These findings highlight the effectiveness of our pivot‑based, parameter‑efficient alignment strategy for inclusive multimodal learning.