LLM | Chinese Commonsense Domain | Global Commonsense Domain | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AJ | TU | SqU | MMR | SpU | NLI | RC | Avg. | AJ | TU | SqU | MMR | SpU | NLI | RC | Avg. | |
GPT-3.5-0125 | 80.0 | 45 | 65 | 40 | 82.0 | 65 | 53.0 | 61.43 | 90.0 | 94 | 80 | 58 | 89.5 | 56 | 48.0 | 73.64 |
GPT-4o-240513 | 98.67 | 77 | 91 | 80 | 93.5 | 74 | 60.5 | 82.1 | 96.0 | 98 | 98 | 74 | 94.5 | 72 | 65.0 | 85.36 |
Gemini-1.5-flash | 94.0 | 52 | 84 | 68 | 78.0 | 79 | 67.0 | 74.57 | 89.33 | 98 | 96 | 60 | 90.5 | 70 | 68.5 | 81.76 |
LLaMA-3-8B | 78.67 | 29 | 58 | 38 | 65.5 | 58 | 47.0 | 53.45 | 83.33 | 85 | 84 | 68 | 78.5 | 63 | 39.5 | 71.62 |
LLaMA-3-70B | 92.67 | 44 | 82 | 52 | 77.0 | 74 | 65.5 | 69.6 | 84.0 | 97 | 94 | 64 | 83.0 | 66 | 64.5 | 78.93 |
InternLM2-1.8B | 46.67 | 31 | 33 | 28 | 53.0 | 59 | 26.5 | 39.6 | 37.33 | 51 | 43 | 22 | 52.5 | 65 | 23.5 | 42.05 |
InternLM2-7B | 79.33 | 43 | 62 | 52 | 78.0 | 76 | 27.0 | 59.62 | 70.67 | 77 | 65 | 48 | 77.0 | 77 | 36.5 | 64.45 |
InternLM2-20B | 90.67 | 51 | 58 | 46 | 75.0 | 76 | 26.0 | 60.38 | 82.67 | 82 | 74 | 30 | 77.5 | 75 | 27.0 | 64.02 |
Yi1.5-6B | 88.0 | 35 | 64 | 48 | 75.5 | 70 | 39.0 | 59.93 | 81.33 | 71 | 74 | 60 | 75.0 | 59 | 42.0 | 66.05 |
Yi1.5-34B | 96.0 | 49 | 87 | 80 | 85.5 | 79 | 44.5 | 74.43 | 86.67 | 91 | 89 | 54 | 88.0 | 73 | 49.5 | 75.88 |
Qwen1.5-1.8B | 41.33 | 37 | 39 | 40 | 56.0 | 47 | 36.5 | 42.4 | 42.67 | 42 | 45 | 26 | 60.5 | 53 | 32.0 | 43.02 |
Qwen1.5-7B | 82.0 | 32 | 58 | 56 | 76.0 | 66 | 43.0 | 59.0 | 74.0 | 74 | 74 | 36 | 71.5 | 66 | 40.0 | 62.21 |
Qwen1.5-14B | 95.33 | 48 | 74 | 60 | 78.5 | 80 | 51.0 | 69.55 | 86.0 | 81 | 84 | 34 | 83.5 | 78 | 50.5 | 71.0 |
Qwen1.5-72B | 96.67 | 51 | 91 | 78 | 87.5 | 86 | 66.0 | 79.45 | 93.33 | 91 | 95 | 52 | 91.0 | 76 | 72.5 | 81.55 |
We selected the empirically optimal prompt strategy: XLT for English LLMs and ZH-CoT for Chinese-oriented LLMs. The table above shows the accuracy of LLMs on CHARM reasoning tasks. For detailed experimental results, please refer to the paper.
@misc{sun2024benchmarking, title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations}, author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He}, year={2024}, eprint={2403.14112}, archivePrefix={arXiv}, primaryClass={cs.CL} }