✨ Framework for Analyzing the Thinking Process of LRM ✨

Evaluating Large Language Model on Hungarian Specifics

Analysis on HuSimpleQA
Examples of OpenHuEval
Method for analyzing the LRM's thinking process on HuSimpleQA
Examples of OpenHuEval
Average number of tokens and thoughts generated per response on HuSimpleQA
Examples of OpenHuEval
The distribution of thought correctness ratio in INCORRECT and NOT_ATTEMPTED responses. Red lines for HuSimpleQA and blue lines for MATH500-Hard from Wang et al..
Analysis on HuMatchingFIB
Examples of OpenHuEval
Method for analyzing the LRM's thinking process on HuMatchingFIB
Examples of OpenHuEval
Example for splitting DeepSeek-R1's thinking process into segments and categorizing these segments on HuMatchingFIB
Examples of OpenHuEval
Tagging results of LRM’s reasoning segments in the correctness and complexity dimensions.

🖊️ Citation

  @misc{yang2025openhuevalevaluatinglargelanguage,
    title={OpenHuEval: Evaluating Large Language Model on Hungarian Specifics}, 
    author={Haote Yang and Xingjian Wei and Jiang Wu and Noémi Ligeti-Nagy and Jiaxing Sun and Yinfan Wang and Zijian Győző Yang and Junyuan Gao and Jingchao Wang and Bowen Jiang and Shasha Wang and Nanjun Yu and Zihao Zhang and Shixin Hong and Hongwei Liu and Wei Li and Songyang Zhang and Dahua Lin and Lijun Wu and Gábor Prószéky and Conghui He},
    year={2025},
    eprint={2503.21500},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2503.21500}, 
}