Citrus

Abstract

Large language models (LLMs), particularly those with reasoning capabilities, have rapidly advanced in recent years, demonstrating significant potential across a wide range of applications. However, their deployment in healthcare, especially in disease reasoning tasks, is hindered by the challenge of acquiring expert-level cognitive data. In this paper, we introduce Citrus, a medical language model that bridges the gap between clinical expertise and AI reasoning by emulating the cognitive processes of medical experts. The model is trained on a large corpus of simulated expert disease reasoning data, synthesized using a novel approach that accurately captures the decision-making pathways of clinicians. This approach enables Citrus to better simulate the complex reasoning processes involved in diagnosing and treating medical conditions.To further address the lack of publicly available datasets for medical reasoning tasks, we release the last-stage training data, including a custom-built medical diagnostic dialogue dataset. This open-source contribution aims to support further research and development in the field. Evaluations using authoritative benchmarks such as MedQA, covering tasks in medical reasoning and language understanding, show that Citrus achieves superior performance compared to other models of similar size. These results highlight Citrus’s potential to significantly enhance medical decision support systems, providing a more accurate and efficient tool for clinical decision-making.

Contributions

1. We propose a training-free reasoning approach that emulates the cognitive processes of medical experts, enabling large language models to enhance their medical capabilities in clinical diagnosis and treatment.

2. In conjunction with the data construction method, we introduce a multi-stage post-training approach to further improve the model’s medical performance.

3. We have made the Citrus model and its training data publicly available as open-source resources to advance research in AI-driven medical decision-making.

4. We have developed and open-sourced a large-scale, updatable clinical practice evaluation dataset based on real-world data, accurately reflecting the distribution of patients in real-world settings.

Model Access

Model Name	Backbone	Link
Citrus1.0-llama-70B	llama-70B	Model Link
Citrus1.0-Qwen-72B	Qwen-72B	Model Link

Data Access

Dataset	Dataset Usage	Dataset Description	Link
Citrus_S3	Train Data	A portion of the training data for the model includes 20k data points.	Data Link
JMED	Test Data	The dataset originates from anonymized doctor-patient dialogues at JD Health Internet Hospital, filtered to retain consultations adhering to standardized diagnostic workflows. The initial release contains 1,000 high-quality clinical records spanning all age groups (0-90 years) and multiple specialties.	Data Link

Method

1. Main approaches

LLMs preforms similar cognitive pathways as medical experts.CPT enabled LLMs to learn medical knowledge and perform pattern recognition as doctors do, meanwhile LLMs are capable to handle hypothetical-deductive reasoning by executing several specific reasoning steps, which can be trained through SFT and RL procedure

2. Overview of training stages and training data pipeline

The training process consists of three stages: CPT, SFT, and RL. We shows training purposes and dataset scale on each stage, also, we points out the data pipeline in corresponding stage

Results

1. Main Results on Medical Benchmarks

2. The experiments on Citrus1.0-Llama-70B

BibTeX

        
@misc{wang2025citrus,
    title={Citrus: Leveraging Expert Cognitive Pathways in a Medical Language Model for Advanced Medical Decision Support}, 
    author={Guoxin Wang and Minyu Gao and Shuai Yang and Ya Zhang and Lizhi He and Liang Huang and Hanlin Xiao and Yexuan Zhang and Wanyue Li and Lu Chen and Jintao Fei and Xin Li},
    year={2025},
    eprint={2502.18274},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2502.18274}, }