Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations

Jiaxing Sun1*, Weiquan Huang2*, Jiang Wu3*†, Chenya Gu3, Wei Li3,
Songyang Zhang3, Hang Yan3, Conghui He3‡ ,
1 Wuhan University 2 Tongji University 3 Shanghai AI Laboratory
* Equal Contribution Project lead Corresponding Author


Construction of CHARM.

Comparison of commonsense reasoning benchmarks.

Benchmarks CN-Lang CSR CN-specifics Dual-Domain Rea-Mem
Most benchmarks in davis2023benchmarks
XNLI, XCOPA,XStoryCloze
LogiQA, CLUE, CMMLU
CORECODE
CHARM (ours)

"CN-Lang" indicates the benchmark is presented in Chinese language. "CSR" means the benchmark is designed to focus on CommonSense Reasoning. "CN-specific" indicates the benchmark includes elements that are unique to Chinese culture, language, regional characteristics, history, etc. "Dual-Domain" indicates the benchmark encompasses both Chinese-specific and global domain tasks, with questions presented in the similar style and format. "Rea-Mem" indicates the benchmark includes closely-interconnected reasoning and memorization tasks.

✨CHARM

CHARM is the first benchmark for comprehensively and in-depth evaluating the commonsense reasoning ability of large language models (LLMs) in Chinese, which covers both globally known and Chinese-specific commonsense. In addition, the CHARM can evaluate the LLMs' memorization-independent reasoning abilities and analyze the typical errors.

📖 Commonsense Domain

🌐 Global commonsense domain

Global commonsense domain consists of universally understood commonsense. It covers objects and aspects of modern life that an individual should be aware of. It includes foundational knowledge that someone with a basic modern education is expected to know. When it involves individuals, they are globally recognized figures.

🚩 Chinese commonsense domain

Chinese commonsense domain encompasses Chinese-specific elements. We categorize them into 7 aspects:

📋 Task List

Overview of CHARM

Task Type Task Domain Chinese aspects Question Type # Question
Reasoning Anachronisms Judgment (AJ) Chinese H, AC, LC, F 2-option MCQ 150
global - 2-option MCQ 150
Time Understanding (TU) Chinese H, AC, LC 4-option MCQ 100
Sequence Understanding (SqU) Chinese H, CA, LC, G, L 4-option MCQ 100
global - 4-option MCQ 100
Movie and Music Recommendation (MMR) Chinese E 4-option MCQ 50
global - 4-option MCQ 50
Sport Understanding (SpU) Chinese F 2-option MCQ 200
global - 2-option MCQ 200
Natural Language Inference (NLI) Chinese G, E, L 3-option MCQ 100
global - 3-option MCQ 100
Reading Comprehension (RC) Chinese all 7 aspects 4-option MCQ 200
global - 4-option MCQ 200
Memorization Anachronisms Judgment (AJ) Chinese H, AC, LC, F Free-form QA 150
Time Understanding (TU) Chinese H, AC, LC Free-form QA 83
Movie and Music Recommendation (MMR) Chinese E Free-form QA 399
Sport Understanding (SpU) Chinese F Free-form QA 127

🖊️ Citation

@misc{sun2024benchmarking,
      title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations}, 
      author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
      year={2024},
      eprint={2403.14112},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}