Model | HuWildBench | HuSimpleQA | HuProverbRea | HuMatchingFIB | HuStandardFIB | |||
---|---|---|---|---|---|---|---|---|
WBScore | Acc | Acc. (OE) | Acc. (2CQ) | B acc. | Q acc. | B acc. | Q acc. | |
GPT-4o | 81.09 | 50.3 | 89.16 | 95.51 | 77.78 | 43.88 | 57.36 | 15.05 |
GPT-4o-mini | 74.19 | 25.56 | 84.67 | 92.16 | 55.68 | 19.78 | 35.08 | 7.53 |
QwQ | 58.02 | 9.09 | 67.49 | 84.23 | 38.65 | 12.23 | 6.05 | 0 |
Deepseek-R1 | 82.96 | 34.58 | 82.29 | 91.72 | 80.87 | 47.12 | 61.76 | 17.2 |
Deepseek-V3 | 78.42 | 32.71 | 83.26 | 92.51 | 68.87 | 39.93 | 51.44 | 9.68 |
Llama-3.1-Instruct-70B | 61.78 | 35.99 | 80.18 | 93.83 | 59.56 | 24.46 | 40.99 | 6.45 |
Llama-3.1-Instruct-8B | 53.62 | 15.2 | 63.35 | 73.48 | 5.74 | 0.72 | 16.64 | 1.08 |
o1-mini | 76.43 | 15.8 | 77.44 | 87.67 | 60.83 | 17.63 | 45.25 | 13.98 |
Qwen2.5-Instruct-72B | 74.05 | 14.9 | 77.8 | 90.22 | 63.8 | 24.1 | 32.32 | 8.6 |
Qwen2.5-Instruct-7B | 42.01 | 5.22 | 50.48 | 67.05 | 31.88 | 1.08 | 7.43 | 0 |
The first, second, and third place in each metric are marked with red, green, and blue text, respectively. In the FIB task evaluation metric, B represents the blank level, and Q represents the question level.
@misc{yang2025openhuevalevaluatinglargelanguage, title={OpenHuEval: Evaluating Large Language Model on Hungarian Specifics}, author={Haote Yang and Xingjian Wei and Jiang Wu and Noémi Ligeti-Nagy and Jiaxing Sun and Yinfan Wang and Zijian Győző Yang and Junyuan Gao and Jingchao Wang and Bowen Jiang and Shasha Wang and Nanjun Yu and Zihao Zhang and Shixin Hong and Hongwei Liu and Wei Li and Songyang Zhang and Dahua Lin and Lijun Wu and Gábor Prószéky and Conghui He}, year={2025}, eprint={2503.21500}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2503.21500}, }