"CN-Lang" indicates the benchmark is presented in Chinese
language. "CSR" means the benchmark is designed to focus on CommonSense
Reasoning. "CN-specific" indicates the benchmark includes elements that are unique to
Chinese culture, language, regional characteristics, history, etc. "Dual-Domain" indicates the benchmark
encompasses both Chinese-specific and global domain tasks, with questions presented in the similar style and
format. "Rea-Mem" indicates the benchmark includes closely-interconnected reasoning and
memorization tasks.
✨CHARM
CHARM is the first benchmark for comprehensively and in-depth evaluating the commonsense
reasoning ability of large language models (LLMs) in Chinese, which covers both globally known and
Chinese-specific commonsense. In addition, the CHARM can evaluate the LLMs' memorization-independent
reasoning abilities and analyze the typical errors.
📖 Commonsense Domain
🌐 Global commonsense domain
Global commonsense domain consists of universally understood commonsense. It covers objects and aspects
of modern life that an individual should be aware of. It includes foundational knowledge that someone
with a basic modern education is expected to know. When it involves individuals, they are globally
recognized figures.
🚩 Chinese commonsense domain
Chinese commonsense domain encompasses Chinese-specific elements. We categorize them into 7 aspects:
History (H): includes important events and figures in Chinese history, China's
dynasties, and other basic facts and shared knowledge about the history of China.
Traditional Culture and Arts (CA): encompasses Chinese traditional cultural arts,
literary works, and traditional lifestyles.
Daily Life and Customs (LC): includes modern Chinese daily routines, clothing,
food, housing, transportation festivals and so on.
Entertainment (E): includes the movies, television programs, music, and other
entertainments in modern Chinese daily life.
Public Figures (F): encompasses the public figures well-known in Chinese society.
Geography (G): includes China's geographical distribution, natural landscapes, and
characteristic regional cultures.
Chinese Language (L): includes the fundamentals of the Chinese language, such as
Chinese characters, idioms and so on.
📋 Task List
Reasoning Tasks: The charm consists of 7 reasoning tasks, which are:
Anachronisms Judgment (AJ), Time Understanding (TU), Sequence Understanding (SqU), Movie
and Music Recommendation (MMR), Sport Understanding (SpU), Natural Language Inference (NLI),
Reading Comprehension (RC).
Memorization Tasks: We chose tasks that can be readily associated in this
manner, AJ, TU, MMR, SpU, referred to as the
Memorization-Reasoning-Interconnected (MRI) tasks, and built the related memorization questions.