CHARM Findings

💡 此页面将展示基于CHARM的一些结论和发现，包括提示策略、综合推理与记忆对比，以及独立于记忆的推理。

提示策略

我们选择了5种常用的提示策略，并评估了19种LLMs在CHARM推理任务上的表现。我们评估了当前常用的LLMs，这些可以分为两类：

(1) 7 个国外大语言模型, 包括 GPT-3.5, GPT4, LLaMA-2 (7B,13B,70B), and Vicuna (7B,13B).
(2) 12 国内大语言模型, 包括 ChatGLM3 (6B), Baichuan2 (7B,13B), InternLM2 (7B,20B), Yi (6B,34B), DeepSeek (7B,67B) and Qwen (7B,14B,72B).

提示策略示例

提示策略	描述	示例
Direct	LLM不进行中间推理，而是直接预测答案。	Q：以下陈述是否包含时代错误，请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择？李白用钢笔写诗。选项：(A) 是 (B) 否 A:(A)
ZH-CoT	LLM在给出答案之前会用中文进行中间推理。	Q：以下陈述是否包含时代错误，请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择？李白用钢笔写诗。选项：(A) 是 (B) 否 A：让我们一步一步来思考。这个陈述提到了“李白”，他是中国唐朝时期的诗人。而陈述中提到的“钢笔”是现代设备，因此李白不可能使用钢笔写诗，该陈述包含时代错误。所以答案是(A)。
EN-CoT	针对中文问题，CoT的推理过程是用英语进行的。	Q：以下陈述是否包含时代错误，请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择？李白用钢笔写诗。选项：(A) 是 (B) 否 A：Let's think step by step.This statement mentions "Li Bai", a poet from the Tang Dynasty in China. The "pen" mentioned in the statement is a modern device, so it is impossible for Li Bai to write poetry with a pen. This statement contains errors from the times. So the answer is (A).
Translate-EN	我们使用了DeepL api将我们的基准测试翻译成英文，然后使用英文的CoT进行推理。	Q: Choose the correct option if the following statement contains an anachronism. How would a person with compulsory education and basic common sense choose?Li Bai wrote poetry with a fountain pen. Options:(A) Yes (B) No A: Let's think step by step.The statement mentions "Li Bai", a Chinese poet from the Tang Dynasty. The "fountain pen" mentioned in the statement is a modern device, so Li Bai could not have used a fountain pen to write his poems, and the statement contains an anachronism. The answer is (A).
XLT	模板提示XLT被用来将原始问题转换成英文请求，逐步解决它，并最终格式化输出答案。	I want you to act as a commonsense reasoning expert for Chinese.Request：以下陈述是否包含时代错误，请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择？李白用钢笔写诗。选项：(A) 是 (B) 否 You should retell the request in English. You should do the answer step by step to choose the right answer.You should step-by-step answer the request. You should tell me the answer in this format 'So the answer is'. Request: How would a typical person answer each of the following statements whether it contains an anachronism? Li Bai writes poetry with a pen. Option:(A) Yes (B) No Step-by-step answer: 1.This statement mentions "Li Bai", a poet from the Tang Dynasty in China. 2.The pen mentioned in the statement is a modern device. 3. so, it is impossible for Li Bai to write poetry with a pen. This statement contains errors from the times. So the answer is (A).

我们在CHARM推理任务中测试了19种LLMs和5种提示策略的组合。结果显示在下面的表格中。

在CHARM推理任务中的平均准确率

	提示策略	所有模型平均	中文大模型平均	英文大模型平均
所有常识领域平均	Direct	46.28	48.41	42.64
	ZH-CoT	56.66	62.40	46.81
	EN-CoT	54.46	58.19	48.06
	Translate-EN	53.88	55.51	51.07
	XLT	56.81	59.09	52.90
中国常识领域平均	Direct	45.43	47.76	41.44
	ZH-CoT	56.35	62.23	46.26
	EN-CoT	52.06	56.36	44.68
	Translate-EN	47.25	47.82	46.27
	XLT	53.80	56.63	48.96
全球常识领域平均	Direct	47.13	49.05	43.85
	ZH-CoT	56.96	62.57	47.35
	EN-CoT	56.85	60.01	51.44
	Translate-EN	60.50	63.20	55.87
	XLT	59.82	61.56	56.84

💡 结果显示，LLMs的语言取向和任务的知识领域会影响提示策略的表现，这进一步丰富了先前的研究发现。

从 LLM 维度来看, 不同的 LLMs 显然偏好不同的提示策略：在5种策略中，XLT 对于英文 LLMs 始终表现优秀，而对于面向中文的 LLMs，尽管存在一些复杂性，但 ZH-CoT 通常表现最好。

从常识领域维度来看, 使用英文进行推理的策略（如 XLT、Translate-EN 等）适用于全球常识领域；然而，ZH-CoT 在中国常识领域中的表现通常更好。

这里的结论与前面的研究有所不同 (Huang et al., 2023a, Zhang et al., 2023a, Shi et al., 2022),之前的研究提出，在处理非英文的推理任务时，使用英文比使用题目本身的语言更为有效。

综合推理与记忆的对比

我们评估了在 MRI 任务上，综合推理与记忆之间的相关性。下面是 LLMs 在 4 个 MRI 任务上的平均表现。

💡 如图所示，这19个 LLMs 大致可以被分为三种类型：

类型 I：低记忆能力和低综合推理能力。 我们发现，除了 OpenAI 的 GPT 系列外，所有其他的英文 LLMs 都属于这个类型。
类型 II：高记忆能力和中等综合推理能力。 GPT3.5 和所有规模在 30B 以下的面向中文的 LLMs 都属于这个类型。值得注意的是，一些 LLMs 具有高度的记忆性能，但相对较差的综合推理能力。
类型 III：超高记忆能力和高综合推理能力。 个类别包括 GPT4 和三个规模超过 30B 的面向中文的 LLMs。

独立于记忆的推理

我们提出了两种方法来比较 LLMs 在 MRI 任务上独立于记忆的推理能力： Mono-LLM-Memorization (FRMM) and Memorization-Independent Battles among LLMs (MIB). （关于 FRMM 和 MIB 方法的具体细节，请参阅论文。以下是 FRMM 和 MIB 方法的结果。

Leaderboard on the MRI tasks.

排行	综合推理	独立于记忆的推理
排行	综合推理	FRMM	MIB
1	DeepSeek-67B	Yi-34B (↑3)	GPT-4 (↑2)
2	Qwen-72B	DeepSeek-67B (↓1)	Yi-34B (↑2)
3	GPT-4	GPT-4 (-)	Qwen-72B (↓1)
4	Yi-34B	Qwen-72B (↓2)	DeepSeek-67B (↓3)
5	Qwen-14B	GPT-3.5 (↑2)	GPT-3.5 (↑2)
6	InternLM2-20B	Qwen-14B (↓1)	Qwen-14B (↓1)
7	GPT-3.5	InternLM2-20B (↓1)	InternLM2-20B (↓1)
8	InternLM2-7B	InternLM2-7B (-)	InternLM2-7B (-)
9	DeepSeek-7B	Baichuan2-13B (↑1)	Baichuan2-13B (↑1)
10	Baichuan2-13B	DeepSeek-7B (↓1)	DeepSeek-7B (↓1)
11	Baichuan2-7B	Yi-6B (↑3)	Baichuan2-7B (-)
12	ChatGLM3-6B	ChatGLM3-6B (-)	ChatGLM3-6B (-)
13	Qwen-7B	Baichuan2-7B (↓2)	Qwen-7B (-)
14	Yi-6B	Qwen-7B (↓1)	Yi-6B (-)
15	LLaMA-2-70B	LLaMA-2-13B (↑1)	LLaMA-2-13B (↑1)
16	LLaMA-2-13B	LLaMA-2-70B (↓1)	LLaMA-2-70B (↓1)
17	Vicuna-13B-v1.5	Vicuna-13B-v1.5 (-)	Vicuna-13B-v1.5 (-)
18	Vicuna-7B-v1.5	LLaMA-2-7B (↑1)	Vicuna-7B-v1.5 (-)
19	LLaMA-2-7B	Vicuna-7B-v1.5 (↓1)	LLaMA-2-7B (-)

💡 如果大型语言模型（LLMs）对保留的推理问题提供了错误的回答，这些可以被称为独立于记忆的推理错误。我们通过手动审查推理过程来分析这些错误，并将它们归类为四个主要类型

理解错误：在这种情况下，LLM 无法准确理解问题，包括误解内容，忽视甚至修改前提中的重要信息，以及未能把握问题的核心查询。
知识错误： LLM 在推理过程中融入了不准确的知识。值得强调的是，与推理问题相关的知识片段在之前的相关记忆问题中已经被检查过，而 LLM 正确地回答了这些问题。然而，在推理阶段，LLM 输出的信息是不正确的。
逻辑错误： LLM 犯了逻辑推理错误，比如数学推理错误，无法基于足够的信息得出正确的结论，或者达到正确的结论但输出了错误的选项。
其他错误： 其他零星、罕见的错误。

💡 独立于记忆的推理错误的分布

💡 LLMs 的三种类型的独立于记忆的推理错误示例

🖊️ 引用

    @misc{sun2024benchmarking,
          title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations}, 
          author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
          year={2024},
          eprint={2403.14112},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }

✨ CHARM 的发现 ✨

Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations.