🏆 CHARM 排行榜 🏆

Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations.

CHARM 推理任务的评测结果

LLM 中国常识领域 全球常识领域
AJ TU SqU MMR SpU NLI RC Avg. AJ TU SqU MMR SpU NLI RC Avg.
GPT-3.5-0125 80.0 45 65 40 82.0 65 53.0 61.43 90.0 94 80 58 89.5 56 48.0 73.64
GPT-4o-240513 98.67 77 91 80 93.5 74 60.5 82.1 96.0 98 98 74 94.5 72 65.0 85.36
Gemini-1.5-flash 94.0 52 84 68 78.0 79 67.0 74.57 89.33 98 96 60 90.5 70 68.5 81.76
LLaMA-3-8B 78.67 29 58 38 65.5 58 47.0 53.45 83.33 85 84 68 78.5 63 39.5 71.62
LLaMA-3-70B 92.67 44 82 52 77.0 74 65.5 69.6 84.0 97 94 64 83.0 66 64.5 78.93
InternLM2-1.8B 46.67 31 33 28 53.0 59 26.5 39.6 37.33 51 43 22 52.5 65 23.5 42.05
InternLM2-7B 79.33 43 62 52 78.0 76 27.0 59.62 70.67 77 65 48 77.0 77 36.5 64.45
InternLM2-20B 90.67 51 58 46 75.0 76 26.0 60.38 82.67 82 74 30 77.5 75 27.0 64.02
Yi1.5-6B 88.0 35 64 48 75.5 70 39.0 59.93 81.33 71 74 60 75.0 59 42.0 66.05
Yi1.5-34B 96.0 49 87 80 85.5 79 44.5 74.43 86.67 91 89 54 88.0 73 49.5 75.88
Qwen1.5-1.8B 41.33 37 39 40 56.0 47 36.5 42.4 42.67 42 45 26 60.5 53 32.0 43.02
Qwen1.5-7B 82.0 32 58 56 76.0 66 43.0 59.0 74.0 74 74 36 71.5 66 40.0 62.21
Qwen1.5-14B 95.33 48 74 60 78.5 80 51.0 69.55 86.0 81 84 34 83.5 78 50.5 71.0
Qwen1.5-72B 96.67 51 91 78 87.5 86 66.0 79.45 93.33 91 95 52 91.0 76 72.5 81.55

我们选择了经验证的最优提示策略:对于英文 LLMs 使用 XLT,对于中文为主的 LLMs 使用 ZH-CoT。上表中显示了 LLMs 在 CHARM 推理任务上的准确率。详细实验结果请阅读论文.

🖊️ 引用

    @misc{sun2024benchmarking,
          title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations}, 
          author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
          year={2024},
          eprint={2403.14112},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }