People | OpenDataLab

Faculty & Researchers

Conghui He (何聪辉)

Young Scientist and Project Investigator (PI) at Shanghai Artificial Intelligence Laboratory

Previously Senior Researcher at WeChat (developed Plato).

Ph.D. from Tsinghua University (2013-2018), B.S. from Sun Yat-sen University (2009-2013).

Research Interests

High-Performance Computing, Computer Vision, Large Language Models, Data-Centric AI, Pre-training Data Preparation, Multimodal Learning

Awards

2023 SenseTime Award (Top 1 team out of 100)
2021 SenseTime Outstanding Team Award (Top 10 teams out of 200)
2019 Tencent Technology Breakthrough Award - Gold (Top 1 team out of 50)
2018 Outstanding Doctoral Graduate Award
2017 ACM Gordon Bell Prize (Highest honor in HPC applications)
2013 IEEE-IBM Smarter Planet Challenge Global Winner (Team Leader, 1/54)

Key Research, Projects & Reports

OpenDataLab: An open platform with 7700+ datasets, serving 40k+ developers.
MinerU: A one-stop, open-source, high-quality data extraction tool for PDF, web, and e-books.
InternLM: Series of 7B and 20B foundation and chat models.
PDF-Extract-Kit: A comprehensive library for high-quality PDF content extraction.
Report: "对话上海AI Lab何聪辉：从DeepSeek看数据的重要性，低成本实现"四两拨千斤""
Report: "中国超算再获"戈登贝尔奖"，成果对地震预测研究有借鉴意义"

Selected Publications

Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching, CVPR 2025
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations, CVPR 2025
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training, ICLR 2025
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text, ICLR 2025
Mmbench: Is your multi-modal model an all-around player? ECCV 2024

Lijun Wu (吴郦军)

Young Scientist, Shanghai Artificial Intelligence Laboratory.

Formerly Research Scientist at ByteDance, Senior Researcher at Microsoft Research Asia (MSRA).

Email: lijun_wu@outlook.com

Research Interests

LLM (post-training, RLHF), Synthetic Data Optimization, AI4Science (LLM4Science, Drug Discovery)

Awards

2013 IEEE-IBM Smarter Planet Challenge Global Winner
2018 MSRA Ph.D. Fellowship
2019 WMT Global Machine Translation Competition - 8 Track Championships
2021 OGB-LSC@KDD Cup - Runner up
2024 ACL Language + Molecule - 1st and 2nd place in two tracks

Key Research, Projects & Reports

Selected Publications

Nature Language Model: Deciphering the Language of Nature for Scientific Discovery, arxiv 2025
3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization, ICLR 2025
Fabind+: Enhancing molecular docking through improved pocket prediction and pose generation, KDD 2025
Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey, arxiv 2024
Target-aware Molecule Generation for Drug Design Using a Chemical Language Model, Nature Communications, 2024

Bin Wang (王斌)

Young Researcher, Shanghai Artificial Intelligence Laboratory.

Ph.D. from University of Chinese Academy of Sciences (UCAS Scholar).

Algorithm Lead for MinerU project.

Email: wangbin@pjlab.org.cn

Research Interests

Intelligent Document Parsing and Understanding, Data Autonomous Iteration Agents, Multimodal Large Models

Awards

ImageNet Large Scale Visual Recognition Challenge (ILSVRC2016 VID) - 3rd Place Globally
UCAS Zhu Li Yuehua Excellent Doctoral Scholarship

Key Research & Projects

MinerU: https://github.com/opendatalab/MinerU
PDF-Extract-Kit: https://github.com/opendatalab/PDF-Extract-Kit
DocLayout-YOLO: https://github.com/opendatalab/DocLayout-YOLO
OmniDocBench: https://github.com/opendatalab/OmniDocBench

Research Focus Areas

Intelligent Document Parsing & Understanding: Developing practical algorithms for layout detection, table recognition, chemical element recognition, geometric parsing, etc., for RAG and AI4S.
Multimodal Large Models: Focusing on vertical domain multimodal large models using data-centric algorithms, generative models, and reinforcement learning to address OOD problems.
Data Autonomous Iteration Agents: Using agent technology to automate data iteration processes (quality improvement, distribution balancing, safety validation) for efficient AI model training.

Selected Publications

Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching, CVPR 2025
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations, CVPR 2025
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training, ICLR 2025
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text, ICLR 2025
MinerU: An Open-Source Solution for Precise Document Content Extraction, Arxiv 2024

Jiang Wu (吴江)

Young Researcher, Shanghai Artificial Intelligence Laboratory.

B.S. and Ph.D. from Tsinghua University.

Email: wujiang@pjlab.org.cn

Research Interests

Large Language Models, Multimodal Large Models, Intelligent Document Parsing and Understanding

Awards

Led the development of an industry-leading satellite imagery analysis system, setting new technical benchmarks, and deployed it in multiple satellite and surveying centers.

Selected Publications

Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations, ACL 2024
VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis, AAAI 2025
Utilize the Flow Before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning, AAAI 2025
GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation, NAACL 2025 findings
OpenHuEval: Evaluating Large Language Model on Hungarian Specifics, arxiv 2025

Jiantao Qiu (邱剑涛)

Young Researcher, Shanghai Artificial Intelligence Laboratory.

B.S. and Ph.D. in Electronic Engineering from Tsinghua University.

Email: qiujiantao@pjlab.org.cn

Research Interests

Large Language Model Datasets, HTML Document Understanding, Energy-Efficient Neural Network Accelerator Design, Multi-Machine Collaborative System Design

Awards

AI2000 Most Influential Scholar Award Honorable Mention in AAAI/IJCAI (2023, for work in FPGA) - Top 3 Rising Star in FPGA

Selected Publications

Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, FPGA 2016

Wentao Zhang (张文涛)

Assistant Professor, Researcher, and Doctoral Supervisor at the International Machine Learning Research Center, Peking University.
Research Consultant at Shanghai Artificial Intelligence Laboratory.

Formerly at Tencent Machine Learning Platform Department, Apple AIML, and Mila - Quebec AI Institute.

Email: zhangwentao1@pjlab.org.cn

Research Interests

Data-centric machine learning and large model data governance.

Awards

WWW'22 Best Student Paper Award (1/1822)
AP-Web'23 Best Paper Runner Up Award
CIKM'24 Best Student Full Paper Award (1/1496)
Apple Scholar (2021, sole recipient in Asia-Pacific)
World Artificial Intelligence Conference (WAIC) Yunfan Award (1 of 15 globally)
Peking University/Beijing Municipal/Chinese Association for Artificial Intelligence Excellent Doctoral Dissertation Award, 2023
Peking University "Weiming Young Scholar", 2024
World Internet Conference Leading Scientific and Technological Achievement Award, 2024
Huawei Spark Award, 2024
Chinese Institute of Electronics Science and Technology Progress Award (First Prize), 2023

Key Research, Projects & Reports

Angel: a high-performance distributed machine learning and graph computing platform, jointly designed by Tencent and PKU.
SGL: a scalable graph learning toolkit for extremely large graph datasets.
MindWare: a powerful AutoML system, which automates feature engineering, algorithm selection and hyperparameter tuning.
OpenBox: an efficient open-source system designed for solving generalized black-box optimization (BBO) problems.

Selected Publications

PAS: Data-Efficient Plug-and-Play Prompt Augmentation System, ICDE 2025
DataSculpt: Crafting Data Landscapes for Long-Context LLMs through Multi-Objective Partitioning, ICDE 2025
Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning, ICLR 2025
Towards Precise Scaling Laws for Video Diffusion Transformers, CVPR 2025
Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models, NeurIPS 2024

Weijia Li (李唯嘉)

Associate Professor ("Hundred Talents Program"), Sun Yat-sen University.
Research Consultant, Shanghai Artificial Intelligence Laboratory.

Ph.D. from Tsinghua University, Postdoc at MMLab, CUHK.

Email: liweijia@pjlab.org.cn

Research Interests

Multimodal Large Models, Image Generation, Synthetic Data Detection, AI4Earth

Key Research, Projects & Reports

Selected Publications

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models (ICLR 2025, Spotlight)
Urbench: A comprehensive benchmark for evaluating large multimodal models in multi-view urban scenarios (AAAI 2025)
Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation (Arxiv 2025)
LEGION: Learning to Ground and Explain for Synthetic Image Detection (Arxiv 2025)
Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation (Arxiv 2025)

Joint Training Students

Hengrui Kang (康恒锐)

Current Institution: Undergraduate from Honors College, University of Electronic Science and Technology of China (UESTC)

Joint Ph.D. Program: Shanghai Jiao Tong University (SJTU) & Shanghai AI Laboratory (Currently 1st-year Ph.D. student)

Internship Experience: Joined OpenDataLab as an intern in early December 2023

Key Work: Participated in the lab's intelligent document parsing (MinerU project) and has made preliminary explorations in Trustworthy AI (synthetic image detection).

Research Interests: Synthetic data detection, intelligent document parsing and generation

Jiahe Song (宋家和)

Current Institution: Third-year Master's student, School of Computer Science, Peking University (PKU)

Joint Ph.D. Program: School of AI, Shanghai Jiao Tong University (SJTU) & Shanghai AI Laboratory (Starting Sep 2025)

Internship Experience: Started internship at Shanghai AI Laboratory (Beijing base) in October 2024 (Advisor: Dr. Jiang Wu)

Key Work: Co-first author of the paper "PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model"

Research Interests: Multimodal large models, AI for science

Honglin Lin (林泓霖)

Current Institution: Final-year undergraduate student, Artificial Intelligence, Beijing University of Posts and Telecommunications (BUPT)

Joint Ph.D. Program: School of AI, Shanghai Jiao Tong University (SJTU) & Shanghai AI Laboratory (Starting Sep 2025)

Internship Experience: Has been interning at the lab for over half a year since postgraduate recommendation in 2024

Research Interests: Mathematical reasoning in large models, data synthesis, etc.

Junbo Niu (牛俊博)

Current Institution: Final-year undergraduate student, Automation, Beihang University

Joint Ph.D. Program: Peking University (PKU) & Shanghai AI Laboratory (Starting Sep 2025)

Research Interests: Multimodal Understanding (Video Understanding, OCR) & Data-Centric Machine Learning

Xin Gao (高鑫)

Current Institution: Software Engineering, University of Electronic Science and Technology of China (UESTC)

Joint Ph.D. Program: Shanghai Jiao Tong University (SJTU) & Shanghai AI Laboratory (Starting Sep 2025)

Internship Experience: Joined the lab as an intern in early August 2024

Research Interests: Data synthesis, evaluation, and filtering for large models, etc.

Yu Li (李宇)

Current Institution: Final-year undergraduate student, School of Cyber Science and Engineering, Wuhan University

Joint Ph.D. Program: University of Science and Technology of China (USTC) & Shanghai AI Laboratory (Starting Sep 2025)

Internship Experience: Joined the lab as an intern at the end of October 2024

Research Interests: Logical reasoning in large models, data synthesis, etc.

Zichen Wen (温子辰)

Current Institution: Undergraduate from University of Electronic Science and Technology of China (UESTC)

Joint Ph.D. Program: Shanghai AI Laboratory & Shanghai Jiao Tong University (SJTU) (Starting Sep 2025)

Research Interests: Efficient AI (including Lightweight and Efficient Large Models for Language/Multimodality, and Data-Efficient Artificial Intelligence)