OpenHuEval Leaderboard

Overall performance of 10 LLMs on OpenHuEval.

Model	HuWildBench	HuSimpleQA	HuProverbRea		HuMatchingFIB		HuStandardFIB
Model	WBScore	Acc	Acc. (OE)	Acc. (2CQ)	B acc.	Q acc.	B acc.	Q acc.
GPT-4o	81.09	50.3	89.16	95.51	77.78	43.88	57.36	15.05
GPT-4o-mini	74.19	25.56	84.67	92.16	55.68	19.78	35.08	7.53
QwQ	58.02	9.09	67.49	84.23	38.65	12.23	6.05	0
Deepseek-R1	82.96	34.58	82.29	91.72	80.87	47.12	61.76	17.2
Deepseek-V3	78.42	32.71	83.26	92.51	68.87	39.93	51.44	9.68
Llama-3.1-Instruct-70B	61.78	35.99	80.18	93.83	59.56	24.46	40.99	6.45
Llama-3.1-Instruct-8B	53.62	15.2	63.35	73.48	5.74	0.72	16.64	1.08
o1-mini	76.43	15.8	77.44	87.67	60.83	17.63	45.25	13.98
Qwen2.5-Instruct-72B	74.05	14.9	77.8	90.22	63.8	24.1	32.32	8.6
Qwen2.5-Instruct-7B	42.01	5.22	50.48	67.05	31.88	1.08	7.43	0

The first, second, and third place in each metric are marked with red, green, and blue text, respectively. In the FIB task evaluation metric, B represents the blank level, and Q represents the question level.

🖊️ Citation

  @misc{yang2025openhuevalevaluatinglargelanguage,
    title={OpenHuEval: Evaluating Large Language Model on Hungarian Specifics}, 
    author={Haote Yang and Xingjian Wei and Jiang Wu and Noémi Ligeti-Nagy and Jiaxing Sun and Yinfan Wang and Zijian Győző Yang and Junyuan Gao and Jingchao Wang and Bowen Jiang and Shasha Wang and Nanjun Yu and Zihao Zhang and Shixin Hong and Hongwei Liu and Wei Li and Songyang Zhang and Dahua Lin and Lijun Wu and Gábor Prószéky and Conghui He},
    year={2025},
    eprint={2503.21500},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2503.21500}, 
}

🏆 OpenHuEval Leaderboard 🏆

Evaluating Large Language Model on Hungarian Specifics

Overall performance of 10 LLMs on OpenHuEval.

🖊️ Citation