REST
Stress Testing Large Reasoning Models by Asking Multiple Problems at Once
Introduction
🧠 Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained 🔒 by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question 📝 reasoning through sequential testing, resulting in critical limitations: ⚠️ (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly 💰 and perpetual creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure 📚, a key requirement for real-world deployment.
🌉 To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing) , a stress-testing framework that concurrently exposes LRMs to multiple problems simultaneously.
Beyond basic reasoning, REST specifically evaluates several under-tested capabilities: contextual priority allocation 🗂️, cross-problem interference resistance ⚖️, and dynamic cognitive load management 🤹♀️.
📊 Our evaluation across numerous advanced reasoning models on 7 reasoning benchmarks reveals several striking findings:
- Even (state-of-the-art) SOTA models like DeepSeek-R1 exhibit substantial performance degradation 📉 under stress testing, challenging the prevailing assumption that "LLMs are multi-problem solvers".
- Crucially, REST demonstrates stronger discriminative power than existing benchmarks 💪, revealing pronounced performance differences 📊 among models that exhibit similar, near-ceiling performance under single-question evaluations.
- Some key mechanistic insights 💡 emerge from our analysis: 🤔 (1) the "overthinking trap" 🤯 is a critical factor contributing to the performance degradation 📉; (2) the models trained with "long2short" technique ✂️ preserve higher accuracy under REST, outperforming standard-trained counterparts.
Evaluation Framework
REST (Reasoning Evaluation through Simultaneous Testing) transforms existing benchmarks by concatenating multiple questions into a single prompt to evaluate these questions at once, repurposing existing benchmarks into more challenging variants.

Overview of REST
Based on 7 representative reasoning benchmarks (e.g., MATH500, AIME24), we reconstruct them into a multi-question format. For benchmark construction and evaluation details, please refer to our paper.
3 Domains
- MATH
- Science
- Code
7 Benchmarks
Based on 7 popular reasoning benchmarks:
5 Stress Level
The number of questions in a prompt is:
- \(\{1, 3, 6, 9, 12\}\) for easy task (GSM8K)
- \(\{1, 3, 5, 7, 9\}\) for medium task (MATH500, AMC23)
- \(\{1, 2, 3, 4, 5\}\) for hard task (AIME24, AIME25, GPQA, LiveCodeBench)
Evaluation Results
We evaluate more than 30 LRMs, spanning a parameter size range from 1.5B to 671B. Even SOTA LRMs like DeepSeek-R1 exhibit significant performance degradation under REST, such as 29.17% accuracy drop on AIME24, revealing a critical limitation in their reasoning robustness. For more interesting findings, please scroll down to the "Key Findings" section.
(Swipe Left/Right to view more results.)
Rank | Model | Single Score (Avg.) |
Stress Score (Avg.) |
GSM8K | MATH500 | AMC23 | AIME24 | AIME25 | GPQA | LiveCodeBench | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Single | Stress | Single | Stress | Single | Stress | Single | Stress | Single | Stress | Single | Stress | Single | Stress | ||||
1 | O4-mini | 82.88 | 69.36 | 93.71 | 93.07 | 90.00 | 82.40 | 96.25 | 82.79 | 73.33 | 49.69 | 80.00 | 41.42 | 76.26 | 73.11 | 70.61 | 63.07 |
2 | Deepseek-R1 | 81.57 | 66.45 | 96.20 | 96.16 | 97.00 | 92.09 | 93.75 | 81.80 | 81.66 | 52.49 | 68.75 | 37.17 | 70.20 | 64.63 | 63.44 | 40.83 |
3 | Qwen-QwQ-32B | 80.20 | 65.67 | 95.83 | 95.78 | 96.20 | 92.49 | 95.00 | 82.89 | 78.75 | 54.79 | 69.58 | 41.53 | 63.64 | 60.03 | 62.37 | 32.16 |
4 | OpenThinker2-32B | 75.27 | 64.68 | 96.44 | 95.17 | 96.20 | 90.10 | 95.00 | 81.00 | 68.33 | 53.01 | 52.50 | 38.20 | 62.12 | 57.79 | 56.27 | 37.53 |
5 | Qwen3-32B | 81.28 | 62.62 | 95.91 | 93.81 | 96.60 | 83.64 | 93.75 | 80.62 | 85.00 | 56.67 | 71.67 | 44.53 | 63.33 | 48.13 | 62.72 | 30.98 |
6 | Deepseek-R1-Distill-Qwen-32B | 75.02 | 62.50 | 95.54 | 95.50 | 94.60 | 88.97 | 94.75 | 86.24 | 72.92 | 52.51 | 51.67 | 33.83 | 60.10 | 53.73 | 55.56 | 26.71 |
7 | AReaL-boba-SFT-32B | 78.44 | 60.61 | 95.01 | 94.75 | 95.00 | 88.92 | 97.50 | 78.96 | 77.50 | 45.79 | 60.00 | 33.55 | 63.13 | 50.59 | 60.93 | 31.74 |
8 | O3-mini | 80.44 | 58.58 | 95.83 | 93.85 | 95.00 | 86.62 | 90.00 | 59.17 | 79.16 | 34.07 | 71.66 | 20.63 | 71.21 | 67.39 | 60.21 | 48.36 |
9 | Light-R1-32B-DS | 78.82 | 57.23 | 95.83 | 94.79 | 95.60 | 83.66 | 96.25 | 68.80 | 77.50 | 41.26 | 60.00 | 33.80 | 65.66 | 50.11 | 60.93 | 28.22 |
10 | Qwen3-8B | 77.03 | 56.26 | 95.30 | 93.90 | 95.40 | 80.28 | 92.50 | 53.59 | 74.58 | 44.28 | 63.33 | 32.06 | 61.11 | 58.60 | 56.99 | 31.13 |
11 | Gemini-2.5-flash-thinking | 81.82 | 52.61 | 89.23 | 91.28 | 97.20 | 69.92 | 97.50 | 47.63 | 76.67 | 26.60 | 71.67 | 16.54 | 78.79 | 68.00 | 61.65 | 48.34 |
12 | Open-Reasoner-Zero-32B | 64.31 | 51.80 | 95.83 | 91.80 | 92.00 | 82.90 | 83.75 | 70.04 | 46.67 | 31.65 | 36.67 | 23.63 | 60.10 | 49.57 | 35.13 | 13.01 |
13 | OpenThinker2-7B | 65.48 | 50.25 | 94.39 | 91.99 | 93.80 | 83.30 | 85.00 | 63.23 | 54.58 | 34.50 | 41.67 | 23.66 | 49.49 | 40.60 | 39.43 | 14.51 |
14 | DeepSeek-R1-Distill-Llama-8B | 62.28 | 48.06 | 90.45 | 85.18 | 89.80 | 81.34 | 87.50 | 70.75 | 50.42 | 31.23 | 28.33 | 22.66 | 50.00 | 33.99 | 39.43 | 11.25 |
15 | Llama-3.1-Nemotron-Nano-8B-v1 | 70.14 | 47.72 | 91.36 | 53.32 | 94.40 | 86.04 | 90.00 | 76.24 | 63.33 | 43.55 | 50.00 | 32.28 | 51.01 | 34.06 | 50.90 | 8.56 |
16 | Qwen-2.5-32B-SimpleRL-Zoo | 51.94 | 46.46 | 96.06 | 93.49 | 83.20 | 78.90 | 67.50 | 57.02 | 27.20 | 16.80 | 16.67 | 8.87 | 46.46 | 46.21 | 26.52 | 23.95 |
17 | OpenR1-Qwen-7B | 56.43 | 44.46 | 95.60 | 90.22 | 92.20 | 81.64 | 83.75 | 54.11 | 47.50 | 26.77 | 32.92 | 21.19 | 38.38 | 36.04 | 4.66 | 1.27 |
18 | Qwen2.5-32B-Instruct | 49.17 | 42.46 | 95.53 | 93.77 | 82.20 | 73.39 | 60.00 | 49.72 | 20.00 | 9.61 | 16.67 | 6.73 | 42.93 | 40.04 | 26.88 | 24.00 |
19 | Efficient-R1-7B(\(\alpha=0.2\)) | 61.71 | 40.15 | 87.95 | 80.38 | 88.20 | 76.41 | 85.00 | 48.05 | 50.42 | 22.12 | 33.75 | 17.25 | 47.97 | 34.37 | 38.71 | 2.50 |
20 | Efficient-R1-7B(\(\alpha=0.1\)) | 63.45 | 40.00 | 88.63 | 84.76 | 90.00 | 74.99 | 87.50 | 44.25 | 54.58 | 21.45 | 35.42 | 15.75 | 48.98 | 35.90 | 39.07 | 2.87 |
21 | S1.1-32B | 65.51 | 38.54 | 89.84 | 61.10 | 90.40 | 53.85 | 90.00 | 32.26 | 55.83 | 24.42 | 45.42 | 19.13 | 61.62 | 54.54 | 25.45 | 24.46 |
22 | Qwen3-1.7B | 59.44 | 38.20 | 90.22 | 85.26 | 90.20 | 62.13 | 85.00 | 40.43 | 46.67 | 23.61 | 35.83 | 17.13 | 37.37 | 32.21 | 30.82 | 6.65 |
23 | L1-Qwen-1.5B-Max | 49.17 | 37.71 | 84.17 | 77.79 | 83.40 | 73.23 | 77.50 | 48.37 | 20.00 | 15.13 | 22.92 | 14.95 | 36.87 | 32.03 | 19.35 | 2.45 |
24 | L1-Qwen-1.5B-Exact | 47.40 | 36.75 | 84.87 | 78.51 | 84.00 | 72.07 | 71.25 | 47.37 | 21.25 | 12.62 | 18.33 | 12.96 | 33.84 | 31.01 | 18.28 | 2.70 |
25 | Deepseek-R1-Distill-Qwen-7B | 64.03 | 36.33 | 89.49 | 89.06 | 93.00 | 66.75 | 87.50 | 36.06 | 54.17 | 16.53 | 35.42 | 11.37 | 51.01 | 31.67 | 37.63 | 2.89 |
26 | Eurus-2-7B-PRIME | 46.57 | 35.06 | 92.72 | 88.01 | 81.40 | 64.69 | 62.50 | 38.58 | 20.83 | 10.84 | 14.58 | 4.49 | 38.89 | 29.51 | 15.05 | 9.28 |
27 | Light-R1-7B-DS | 64.84 | 34.58 | 88.05 | 82.69 | 93.20 | 61.73 | 90.00 | 34.91 | 55.83 | 16.63 | 45.83 | 12.96 | 41.91 | 30.32 | 39.07 | 2.85 |
28 | Qwen2.5-7B-Instruct | 39.42 | 34.43 | 92.27 | 85.12 | 77.60 | 65.78 | 42.50 | 34.46 | 10.00 | 7.02 | 3.75 | 3.32 | 35.86 | 35.15 | 13.98 | 10.19 |
29 | AReaL-boba-RL-7B | 67.42 | 34.16 | 91.66 | 77.80 | 95.00 | 60.77 | 91.25 | 32.94 | 61.25 | 21.43 | 45.83 | 12.33 | 48.98 | 29.13 | 37.99 | 4.73 |
30 | DeepScaleR-1.5B-Preview | 53.09 | 30.74 | 84.84 | 66.58 | 87.60 | 59.77 | 76.25 | 32.05 | 38.75 | 12.82 | 31.25 | 14.23 | 31.82 | 27.90 | 21.15 | 1.83 |
31 | Qwen-2.5-Math-7B-SimpleRL-Zoo | 44.72 | 30.32 | 90.52 | 84.01 | 77.80 | 62.41 | 68.50 | 16.16 | 26.67 | 7.55 | 10.00 | 6.47 | 33.84 | 35.49 | 5.73 | 0.16 |
32 | Qwen2.5-Math-7B-Instruct | 43.52 | 30.10 | 95.53 | 78.53 | 83.60 | 56.59 | 60.00 | 28.46 | 14.17 | 6.40 | 11.67 | 5.33 | 35.35 | 33.96 | 4.30 | 1.45 |
33 | Marco-o1 | 38.56 | 26.93 | 89.08 | 79.56 | 72.40 | 48.19 | 47.50 | 17.23 | 10.00 | 4.35 | 10.83 | 3.64 | 30.81 | 28.32 | 9.32 | 7.24 |
34 | Qwen2.5-Math-1.5B-Instruct | 38.01 | 25.37 | 85.37 | 67.49 | 73.00 | 53.94 | 57.50 | 22.22 | 10.83 | 6.17 | 10.83 | 2.83 | 26.77 | 24.58 | 1.79 | 0.37 |
35 | Open-Reasoner-Zero-7B | 46.22 | 24.82 | 92.87 | 65.14 | 83.00 | 32.51 | 60.00 | 31.23 | 17.92 | 6.13 | 16.25 | 3.89 | 37.37 | 34.25 | 16.13 | 0.64 |
36 | Deepseek-R1-Distill-Qwen-1.5B | 48.16 | 22.88 | 84.62 | 70.21 | 83.40 | 42.47 | 62.50 | 13.98 | 29.17 | 4.97 | 25.00 | 5.91 | 37.37 | 22.11 | 15.05 | 0.48 |
37 | Llama-3.1-8B-Instruct | 29.37 | 21.97 | 85.29 | 76.65 | 49.40 | 30.23 | 27.50 | 9.34 | 0.00 | 2.31 | 0.00 | 0.00 | 33.33 | 28.54 | 10.04 | 6.77 |
38 | Qwen2.5-1.5B-Instruct | 25.58 | 10.73 | 65.13 | 13.08 | 53.40 | 27.05 | 30.00 | 10.65 | 2.50 | 2.04 | 0.00 | 0.42 | 26.26 | 21.52 | 1.79 | 0.37 |
Table: Detailed Model Performance, Single Accuracy v.s. Stress Accuracy
Key Findings
(1) Even SOTA LRMs like DeepSeek-R1 exhibit significant performance degradation under REST, such as 29.17% accuracy drop on AIME24, revealing a critical limitation in their reasoning robustness.
(2) Although models of different sizes demonstrate comparable, near-ceiling performance in traditional single-question evaluations, REST reveals significant disparities among them, thereby enhancing the discriminative power of existing benchmarks.

Figure: Performance comparison of LRMs of different sizes under various stress levels.
(3) Models trained with "long2short" technique preserve more accuracy of their single-problem performance under REST, outperforming standard-trained counterparts.

Figure: The effect of Long2Short training on REST.
(4) High-performing models on REST (e.g., Nemotron-nano-7B and DeepSeek-R1) tend to dynamically allocate less reasoning tokens for the first question when the stress level exceeds 1. In contrast, models with lower REST performance, such as DS-R1-Distill-Qwen-7B, often consume a lot of reasoning tokens to answer the first question.

Figure: The reasoning token count for questions at different positions on AIME24.
(5) LRMs generally achieve higher accuracy for earlier questions, while their performance declines for subsequent ones.

Figure: The effect of question position under stress tests.
(6) Presenting questions from easy to hard consistently yields better overall accuracy compared to the reverse order.

Figure: The effect of question order on overall performance under stress tests.
BibTeX
@misc{pan2025REST, title={REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once}, author={Zhuoshi Pan and Qizhi Pei and Yu Li and Qiyao Sun and Zinan Tang and H. Vicky Zhao and Conghui He and Lijun Wu}, year={2025}, eprint={2507.10541}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.10541}, }