<p dir="ltr">Vision-language models (VLMs) have demonstrated strong zero-shot inference capabilities but may exhibit stereotypical biases toward certain demographic groups. Consequently, downstream tasks leveraging these models may yield unbalanced performance across different target social groups, potentially reinforcing harmful stereotypes. Mitigating such biases is critical for ensuring fairness in practical applications. Existing debiasing approaches typically rely on curated face-centric datasets for fine-tuning or retraining, risking overfitting and limiting generalizability. To address this issue, we propose a novel framework, CABIN (Causal Adjustment Based INtervention). It leverages a causal framework to identify sensitive attributes in images as confounding factors. Employing a learned mapper, which is trained on general large-scale image-text pairs rather than face-centric datasets, CABIN may use text to adjust sensitive attributes in the image embedding, ensuring independence between these sensitive attributes and image embeddings. This independence enables a backdoor adjustment for unbiased inference without the drawbacks of additional fine-tuning or retraining on narrowly tailored datasets. Through comprehensive experiments and analyses, we demonstrate that CABIN effectively mitigates biases and improves fairness metrics while preserving the zero-shot strengths of VLMs.</p>