OpenDataLab provides a comprehensive suite of resources for researchers, practitioners, and students interested in data-centric AI. Below you'll find our datasets, tools and platforms that support cutting-edge AI research and development.

Datasets

OmniDocBench

A high-quality multi-source evaluation benchmark that pioneers a new paradigm for document parsing assessment

Access Dataset

Wanjuan Silkroad

The first large-scale multilingual corpus covering mainstream modalities

Access Dataset

Shusheng Wanjuan

Shanghai AI Lab's first open-source high-quality multimodal pretraining corpus for large models

Access Dataset

Platforms & Tools

OpenDataLab

China's most influential large model data platform in terms of volume and data scale

Visit Platform

MinerU

A document corpus production engine for the large model era

Explore Details

Label U

A flexible annotation tool compatible with multiple data formats and freely configurable combinations

Explore Details

Label LLM

A renowned data annotation platform in the open-source field

Explore Details

More Resources

For additional resources and tools, please visit OpenDataLab's Open Source Tools platform. The platform provides a comprehensive collection of AI development resources.