CHARM Leaderboard

CHARM 推理任务的评测结果

LLM	中国常识领域								全球常识领域
LLM	AJ	TU	SqU	MMR	SpU	NLI	RC	Avg.	AJ	TU	SqU	MMR	SpU	NLI	RC	Avg.
GPT-3.5-0125	80.0	45	65	40	82.0	65	53.0	61.43	90.0	94	80	58	89.5	56	48.0	73.64
GPT-4o-240513	98.67	77	91	80	93.5	74	60.5	82.1	96.0	98	98	74	94.5	72	65.0	85.36
Gemini-1.5-flash	94.0	52	84	68	78.0	79	67.0	74.57	89.33	98	96	60	90.5	70	68.5	81.76
LLaMA-3-8B	78.67	29	58	38	65.5	58	47.0	53.45	83.33	85	84	68	78.5	63	39.5	71.62
LLaMA-3-70B	92.67	44	82	52	77.0	74	65.5	69.6	84.0	97	94	64	83.0	66	64.5	78.93
InternLM2-1.8B	46.67	31	33	28	53.0	59	26.5	39.6	37.33	51	43	22	52.5	65	23.5	42.05
InternLM2-7B	79.33	43	62	52	78.0	76	27.0	59.62	70.67	77	65	48	77.0	77	36.5	64.45
InternLM2-20B	90.67	51	58	46	75.0	76	26.0	60.38	82.67	82	74	30	77.5	75	27.0	64.02
Yi1.5-6B	88.0	35	64	48	75.5	70	39.0	59.93	81.33	71	74	60	75.0	59	42.0	66.05
Yi1.5-34B	96.0	49	87	80	85.5	79	44.5	74.43	86.67	91	89	54	88.0	73	49.5	75.88
Qwen1.5-1.8B	41.33	37	39	40	56.0	47	36.5	42.4	42.67	42	45	26	60.5	53	32.0	43.02
Qwen1.5-7B	82.0	32	58	56	76.0	66	43.0	59.0	74.0	74	74	36	71.5	66	40.0	62.21
Qwen1.5-14B	95.33	48	74	60	78.5	80	51.0	69.55	86.0	81	84	34	83.5	78	50.5	71.0
Qwen1.5-72B	96.67	51	91	78	87.5	86	66.0	79.45	93.33	91	95	52	91.0	76	72.5	81.55

我们选择了经验证的最优提示策略：对于英文 LLMs 使用 XLT，对于中文为主的 LLMs 使用 ZH-CoT。上表中显示了 LLMs 在 CHARM 推理任务上的准确率。详细实验结果请阅读论文.

🖊️ 引用

    @misc{sun2024benchmarking,
          title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations}, 
          author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
          year={2024},
          eprint={2403.14112},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }

🏆 CHARM 排行榜 🏆

Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations.

CHARM 推理任务的评测结果

🖊️ 引用