Chinese Medical Dataset
Comprehensive collection of Chinese medical datasets for AI research
README
[更新中]中文医学数据集详细整理 Author: mzcai@ir.hit.edu.cn 中文医学数据集 1、【分类\&医疗问答】CMB Chinese-Medical-Benchmark 1.1、CMB数据集汇总 1.2、数据示例 1.2.1、CMB-Exam数据示例 1.2.2、CMB-Clin数据示例 2、【医疗问答】Huatuo-26M 2.1、Huatuo数据集汇总 2.2、数据示例 2.2.1、在线医疗百科数据示例 2.2.2、医疗知识图谱 2.2.3、网络上的公开医疗问答论坛 2.2.4、Huatuo测试集数据示例 3、【实体识别\&属性抽取】Yidu-S4K 3.1、数据集汇总 3.2、任务描述 3.3、数据示例 3.3.1、医疗命名实体识别数据示例 3.3.2、医疗实体及属性抽取(跨院迁移)数据示例 4、【术语标准化】Yidu-N7K 4.1、数据集汇总 4.2、任务描述 4.3、数据示例 5、【医疗问答】cMedQA2 5.1、数据集汇总 5.2、数据示例 5.2.1、questions.csv数据示例 5.2.2、answer.csv数据示例…
Source attribution
- Awesome AI for Science — github.com/mengqi97/chinese-medical-dataset
- GitHub — github.com/mengqi97/chinese-medical-dataset
Related resources
Equivariant graph attention Transformer (ICLR2023)
Therapeutics Data Commons: 66 AI-ready datasets across 22 drug discovery tasks with 29 leaderboards, covering target identification, molecular generation, ADMET prediction, and clinical trial outcomes (Harvard MIMS, NeurIPS 2021/2024)
Large-scale benchmark suite for protein fitness prediction and design, aggregating 200+ deep mutational scanning assays and clinical variant datasets across diverse protein families and taxa, with standardized zero-shot and supervised leaderboards for variant effect prediction, mutation effect prediction, and protein language model evaluation (OATML & Marks Lab, NeurIPS 2023 Spotlight, Datasets & Benchmarks)
Curated open dataset collection of 602M+ observational and perturbational single-cell profiles for accelerating virtual cell model creation, integrating Tahoe-100M and scBaseCount data with Google Cloud Marketplace distribution (Arc Institute, 2025-2026)