🏆 OpenHuEval Leaderboard 🏆

Evaluating Large Language Model on Hungarian Specifics

Overall performance of 10 LLMs on OpenHuEval.

Model HuWildBench HuSimpleQA HuProverbRea HuMatchingFIB HuStandardFIB
WBScore Acc Acc. (OE) Acc. (2CQ) B acc. Q acc. B acc. Q acc.
GPT-4o 81.09 50.3 89.16 95.51 77.78 43.88 57.36 15.05
GPT-4o-mini 74.19 25.56 84.67 92.16 55.68 19.78 35.08 7.53
QwQ 58.02 9.09 67.49 84.23 38.65 12.23 6.05 0
Deepseek-R1 82.96 34.58 82.29 91.72 80.87 47.12 61.76 17.2
Deepseek-V3 78.42 32.71 83.26 92.51 68.87 39.93 51.44 9.68
Llama-3.1-Instruct-70B 61.78 35.99 80.18 93.83 59.56 24.46 40.99 6.45
Llama-3.1-Instruct-8B 53.62 15.2 63.35 73.48 5.74 0.72 16.64 1.08
o1-mini 76.43 15.8 77.44 87.67 60.83 17.63 45.25 13.98
Qwen2.5-Instruct-72B 74.05 14.9 77.8 90.22 63.8 24.1 32.32 8.6
Qwen2.5-Instruct-7B 42.01 5.22 50.48 67.05 31.88 1.08 7.43 0

The first, second, and third place in each metric are marked with red, green, and blue text, respectively. In the FIB task evaluation metric, B represents the blank level, and Q represents the question level.

🖊️ Citation

  @misc{yang2025openhuevalevaluatinglargelanguage,
    title={OpenHuEval: Evaluating Large Language Model on Hungarian Specifics}, 
    author={Haote Yang and Xingjian Wei and Jiang Wu and Noémi Ligeti-Nagy and Jiaxing Sun and Yinfan Wang and Zijian Győző Yang and Junyuan Gao and Jingchao Wang and Bowen Jiang and Shasha Wang and Nanjun Yu and Zihao Zhang and Shixin Hong and Hongwei Liu and Wei Li and Songyang Zhang and Dahua Lin and Lijun Wu and Gábor Prószéky and Conghui He},
    year={2025},
    eprint={2503.21500},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2503.21500}, 
}