Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning

Guoxin Wang1†, Jun Zhao1, Xinyi Liu1, Enyou Liu1, Xuyang Cao1, Chao Li1, Zhuoyun Liu1, Qintian Sun1, Fangru Zhou1, Haoqiang Xing1, Zhenhong Yang1
1JDH Algo, JD Health International Inc. Project Leader

Framework. Model architecture of Citrus-V. The framework consists of three components: (1) an MLLM—including the LLM, tokenizer, and a vision encoder—for high-level visual-textual reasoning such as report generation, VQA, and grounding; (2) a segmentation projector that maps the "[SEG]" token produced by the MLLM into latent segmentation prompts; and (3) a segmentation model that decodes the latent segmentation prompts together with semantic image features into pixel-level masks. Separate image encoders are employed to decouple low-level details for segmentation from high-level semantics for other tasks, ensuring both types of tasks are optimized without semantic conflict.

Abstract

Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.

Contribution

1. We construct a unified model that integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference within a single model.


2. To facilitate reproducibility and support the research community, we release Citrus-V along with a curated open-source data suite, including a multimodal chain-of-thought reasoning dataset for report generation, a refined detection and segmentation benchmark with corrected labels, and a medical document understanding benchmark with graded difficulty levels.


3. We further design a novel multimodal training paradigm to accelerate convergence and enhance generalization across diverse imaging and reasoning tasks.

Comparison with Other Methods

Ability

Evaluation

Open-source model < 10B

Open-source model > 10B

Commercial model

Task Sub-Category MedGemma 4B Qwen2.5-VL 7B HuatuoGPT-V 7B Lingshu 7B Citrus-V 8B HealthGPT 14B MedPLIB 14B Qwen2.5-VL 32B Lingshu 32B Citrus-V 33B HuatuoGPT-V 34B MedGemma 27B Qwen2.5-VL 72B Citrus-V 73B Doubao 1.6 GPT 4.1 GPT 5
Medical visual question answering (Med-VQA) VQA-RAD - 72.06 66.30 67.85 68.74 64.08 45.45 72.28 75.39 63.64 63.86 79.16 33.49 62.53 68.37
MedXpertQA MM 22.05 20.75 22.30 26.90 24.55 - 25.30 31.00 22.65 33.10 28.15 45.75 43.35 51.48
SLAKE - 78.32 67.86 69.39 82.90 67.43 38.54 76.36 87.68 73.02 76.17 79.47 67.28 72.54 65.82
PATH-VQA - 48.64 42.30 44.29 60.23 58.67 40.02 41.58 64.76 44.92 47.60 42.55 47.58 54.97 31.74
PMC-VQA - 48.02 50.86 53.84 55.77 56.90 44.40 53.58 57.23 56.79 45.35 57.11 49.94 38.76 36.10
Medical text question answering (Med-TQA) PubMedQA - 73.00 75.80 73.60 75.40 69.40 49.40 68.60 78.20 71.00 79.00 70.00 76.00 76.00 78.00
MedMCQA - 52.26 53.40 51.95 56.13 63.33 1.63 62.71 65.05 55.08 63.23 69.57 75.06 77.07 62.99
MedQA USMLE 55.54 57.50 52.95 63.39 66.93 7.38 71.33 74.94 58.52 81.15 77.77 93.48 87.98 76.96
MCMLE 41.10 73.93 73.09 75.98 52.83 12.03 88.18 86.98 76.09 64.89 90.40 94.02 81.73 74.00
MedXpertQA Text 13.10 12.40 10.33 16.45 12.45 0.45 15.88 22.86 12.20 22.01 16.78 30.67 30.82 40.75
CMMLU - 43.96 68.80 71.12 69.02 55.36 15.53 82.60 82.37 77.64 60.24 87.47 91.67 81.02 82.93
Medbullets op4 48.05 47.08 43.51 62.66 53.57 3.90 59.74 68.51 44.81 67.86 62.66 82.79 78.90 88.93
op5 42.53 36.69 37.66 52.92 50.00 1.30 50.65 63.31 39.29 65.58 56.17 76.62 73.38 87.30
SuperGPQA - 21.52 26.39 22.11 27.51 25.59 0.22 38.26 40.80 28.06 33.18 45.15 55.19 50.60 49.54
Medical image detection and segmentation (Med-IDAS) MeCOVQA-G (segmentation) DER - - - - - 79.84 - - - - - - - -
CT - - - - - 57.58 - - - - - - - -
PET - - - - - 64.25 - - - - - - - -
X-RAY - - - - - 8.47 - - - - - - - -
END - - - - - 44.35 - - - - - - - -
MR - - - - - 27.38 - - - - - - - -
US - - - - - 34.22 - - - - - - - -
FP - - - - - 4.82 - - - - - - - -
MedSAM2 (eval) - 20.90
Medical Document Understanding (Med-Doc) Laboratory test report Full extract (hard) 27.61 73.23 23.10 60.06 22.22 - 71.40 63.46 22.78 26.15 74.90 80.00 66.86 69.05
Simple QA (hard) 19.87 83.62 10.26 60.48 17.90 - 83.41 73.36 8.52 32.97 88.21 84.72 71.40 78.60
Complex QA (hard) 9.56 31.22 4.56 23.45 3.17 - 30.54 32.38 10.88 5.50 44.11 70.90 36.66 56.78
All categories Simple QA (hard) 29.75 71.92 25.17 53.92 24.67 - 73.29 61.25 22.50 29.50 74.46 81.00 53.17 82.92
Complex QA (hard) 25.92 70.92 14.21 52.28 16.00 - 73.38 62.83 16.12 22.21 76.46 70.92 55.04 64.50
Medical Image Report Generation (Med-IRG) CheXpert Plus ROUGE-L 26.01 22.59 21.40 26.68 21.29 0.07 17.45 25.29 23.97 17.65 20.06 22.67 24.50 31.72
CIDEr 85.86 61.72 65.00 76.49 68.24 0.04 52.48 77.42 66.07 48.08 57.31 92.72 78.80 131.44
RaTE 51.23 43.79 46.58 46.93 47.82 21.05 46.70 46.18 45.51 48.73 44.76 53.76 45.50 56.64
IU Xray ROUGE-L 39.51 30.41 29.96 44.52 23.89 - 20.86 45.06 29.12 32.59 26.44 22.67 32.63 31.72
CIDEr 151.86 96.30 110.51 200.47 86.89 - 83.94 206.68 124.67 115.65 95.11 92.72 124.42 131.44
RaTE 61.99 54.03 54.91 60.30 52.33 - 55.44 65.22 57.87 58.76 56.00 53.76 50.91 56.64

Model Access

Model

Size

Ability

Download

Citrus-V

8B

Med-VQA, Med-TQA, Med-Doc, Med-IRG and Multimodal Chain-of-Thought Reasoning (MM-CoT).

Huggingface

33B

Med-VQA, Med-TQA, Med-Doc, and MM-CoT.

Huggingface

73B

Med-VQA, Med-TQA, Med-Doc, and MM-CoT.

Huggingface

Dataset Access

Dataset

Usage

Description

Download

MeCoVQA-G-Plus

Benchmark

MeCOVQA-G+ is mainly used to evaluate the medical image modality segmentation capability of medical multimodal language models, and it is built on the basis of the open-source dataset MeCOVQA-G through methods such as expert modification and review.

Huggingface

MedXray-CoT

Train Data

The MedXray-CoT dataset is primarily used to train large medical multimodal models to develop the ability, similar to that of physicians, to reason based on different anatomical structure regions under the X-ray modality.

Huggingface

Training Strategy



Concept Alignment: In this stage, most model parameters are frozen, and only the MLLM projector together with the vision encoder are updated. Training primarily relies on image-caption pairs, which establish a stable mapping from visual features into the language space without disrupting the pretrained reasoning ability of the LLM. This step provides a lightweight but essential initialization for subsequent multimodal alignment.


Comprehension Enhancement: All MLLM parameters, including the projector, vision encoder, and LLM, are trainable, while the segmentation projector, encoder, and decoder remain frozen. Training incorporates a broader range of tasks, reporting detailed and structured, interpreting of medical image are preserved to establish correlations between visual features, medical concept, radiological findings and imaging diagnostic. Additionally, scientific document comprehension data such as chart, diagram, are taken considering that explication and interpretation of scientific illustrations and graphs is necessary facing with clinical documentation, laboratory reports, diagnostic image annotations and radiographic markers. This stage strengthens the MLLM’s multimodal comprehension capacity, while freezing the segmentation modules prevents premature interference from segmentation supervision.


Instruction Fine-Tuning: In the instruction fine-tuning stage, the MLLM is trained on the most diverse instruction-following data, including report generation, text-only instructions, OCR, grounding, medical VQA, reasoning-based chain-of-thought tasks, and segmentation instructions. All MLLM parameters are updated, whereas segmentation modules remain frozen. It is worth noting that our empirical results show that directly combining segmentation loss with other tasks can damage VQA performance significantly. To this end, for segmentation samples, the supervision is restricted to textual outputs containing the special "[SEG]" token, without applying mask-level losses. This design allows the MLLM to acquire the discourse patterns needed for segmentation queries and encode segmentation intent into the hidden state of the "[SEG]" token.


Segmentation Fine-tuning: At this stage, all MLLM parameters are frozen, and optimization focuses exclusively on segmentation components, including the segmentation projector, the segmentation encoder, and the segmentation decoder. Unlike prior methods that freeze most of the segmentation backbone, the full SAM2 architecture, including the vision encoder, prompt encoder, and mask decoder, is fine-tuned to adapt effectively to medical imaging. With the MLLM frozen, training is computationally efficient, while full adaptation of SAM2 ensures precise and domain-specific pixel-level segmentation.

Case Study