REST

Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

Zhuoshi Pan1,2, Qizhi Pei2,3, Yu Li2, Qiyao Sun2, Zinan Tang2,
H. Vicky Zhao1, Conghui He2, Lijun Wu2

1Tsinghua University, 2Shanghai AI Laboratory, 3Renmin University of China

Introduction

🧠 Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained 🔒 by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question 📝 reasoning through sequential testing, resulting in critical limitations: ⚠️ (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly 💰 and perpetual creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure 📚, a key requirement for real-world deployment.

🌉 To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing) , a stress-testing framework that concurrently exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST specifically evaluates several under-tested capabilities: contextual priority allocation 🗂️, cross-problem interference resistance ⚖️, and dynamic cognitive load management 🤹‍♀️.

📊 Our evaluation across numerous advanced reasoning models on 7 reasoning benchmarks reveals several striking findings:

These results establish REST as a cost-efficient, future-proof evaluation paradigm 🔮 that better reflects real-world reasoning demands while reducing reliance on continuous human annotation.

Evaluation Framework

REST (Reasoning Evaluation through Simultaneous Testing) transforms existing benchmarks by concatenating multiple questions into a single prompt to evaluate these questions at once, repurposing existing benchmarks into more challenging variants.


REST Overview

Overview of REST

Based on 7 representative reasoning benchmarks (e.g., MATH500, AIME24), we reconstruct them into a multi-question format. For benchmark construction and evaluation details, please refer to our paper.


3 Domains

  • MATH
  • Science
  • Code

7 Benchmarks

Based on 7 popular reasoning benchmarks:

MATH: GSM8K, MATH500, AMC23, AIME24, AIME25
Science: GPQA
Code: LiveCodeBench

5 Stress Level

The number of questions in a prompt is:

  • \(\{1, 3, 6, 9, 12\}\) for easy task (GSM8K)
  • \(\{1, 3, 5, 7, 9\}\) for medium task (MATH500, AMC23)
  • \(\{1, 2, 3, 4, 5\}\) for hard task (AIME24, AIME25, GPQA, LiveCodeBench)

Evaluation Results

We evaluate more than 30 LRMs, spanning a parameter size range from 1.5B to 671B. Even SOTA LRMs like DeepSeek-R1 exhibit significant performance degradation under REST, such as 29.17% accuracy drop on AIME24, revealing a critical limitation in their reasoning robustness. For more interesting findings, please scroll down to the "Key Findings" section.


(Swipe Left/Right to view more results.)


Model Performance All Domain

Figure: Model Performance on All Domain: Single Accuracy v.s. Stress Accuracy

Model Performance Math Domain

Figure: Model Performance on Math Domain: Single Accuracy v.s. Stress Accuracy

Model Performance Science Domain

Figure: Model Performance on Science Domain: Single Accuracy v.s. Stress Accuracy

Model Performance Code Domain

Figure: Model Performance on Code Domain: Single Accuracy v.s. Stress Accuracy

Rank Model Single Score
(Avg.)
Stress Score
(Avg.)
GSM8K MATH500 AMC23 AIME24 AIME25 GPQA LiveCodeBench
Single Stress Single Stress Single Stress Single Stress Single Stress Single Stress Single Stress
1 O4-mini 82.88 69.36 93.71 93.07 90.00 82.40 96.25 82.79 73.33 49.69 80.00 41.42 76.26 73.11 70.61 63.07
2 Deepseek-R1 81.57 66.45 96.20 96.16 97.00 92.09 93.75 81.80 81.66 52.49 68.75 37.17 70.20 64.63 63.44 40.83
3 Qwen-QwQ-32B 80.20 65.67 95.83 95.78 96.20 92.49 95.00 82.89 78.75 54.79 69.58 41.53 63.64 60.03 62.37 32.16
4 OpenThinker2-32B 75.27 64.68 96.44 95.17 96.20 90.10 95.00 81.00 68.33 53.01 52.50 38.20 62.12 57.79 56.27 37.53
5 Qwen3-32B 81.28 62.62 95.91 93.81 96.60 83.64 93.75 80.62 85.00 56.67 71.67 44.53 63.33 48.13 62.72 30.98
6 Deepseek-R1-Distill-Qwen-32B 75.02 62.50 95.54 95.50 94.60 88.97 94.75 86.24 72.92 52.51 51.67 33.83 60.10 53.73 55.56 26.71
7 AReaL-boba-SFT-32B 78.44 60.61 95.01 94.75 95.00 88.92 97.50 78.96 77.50 45.79 60.00 33.55 63.13 50.59 60.93 31.74
8 O3-mini 80.44 58.58 95.83 93.85 95.00 86.62 90.00 59.17 79.16 34.07 71.66 20.63 71.21 67.39 60.21 48.36
9 Light-R1-32B-DS 78.82 57.23 95.83 94.79 95.60 83.66 96.25 68.80 77.50 41.26 60.00 33.80 65.66 50.11 60.93 28.22
10 Qwen3-8B 77.03 56.26 95.30 93.90 95.40 80.28 92.50 53.59 74.58 44.28 63.33 32.06 61.11 58.60 56.99 31.13
11 Gemini-2.5-flash-thinking 81.82 52.61 89.23 91.28 97.20 69.92 97.50 47.63 76.67 26.60 71.67 16.54 78.79 68.00 61.65 48.34
12 Open-Reasoner-Zero-32B 64.31 51.80 95.83 91.80 92.00 82.90 83.75 70.04 46.67 31.65 36.67 23.63 60.10 49.57 35.13 13.01
13 OpenThinker2-7B 65.48 50.25 94.39 91.99 93.80 83.30 85.00 63.23 54.58 34.50 41.67 23.66 49.49 40.60 39.43 14.51
14 DeepSeek-R1-Distill-Llama-8B 62.28 48.06 90.45 85.18 89.80 81.34 87.50 70.75 50.42 31.23 28.33 22.66 50.00 33.99 39.43 11.25
15 Llama-3.1-Nemotron-Nano-8B-v1 70.14 47.72 91.36 53.32 94.40 86.04 90.00 76.24 63.33 43.55 50.00 32.28 51.01 34.06 50.90 8.56
16 Qwen-2.5-32B-SimpleRL-Zoo 51.94 46.46 96.06 93.49 83.20 78.90 67.50 57.02 27.20 16.80 16.67 8.87 46.46 46.21 26.52 23.95
17 OpenR1-Qwen-7B 56.43 44.46 95.60 90.22 92.20 81.64 83.75 54.11 47.50 26.77 32.92 21.19 38.38 36.04 4.66 1.27
18 Qwen2.5-32B-Instruct 49.17 42.46 95.53 93.77 82.20 73.39 60.00 49.72 20.00 9.61 16.67 6.73 42.93 40.04 26.88 24.00
19 Efficient-R1-7B(\(\alpha=0.2\)) 61.71 40.15 87.95 80.38 88.20 76.41 85.00 48.05 50.42 22.12 33.75 17.25 47.97 34.37 38.71 2.50
20 Efficient-R1-7B(\(\alpha=0.1\)) 63.45 40.00 88.63 84.76 90.00 74.99 87.50 44.25 54.58 21.45 35.42 15.75 48.98 35.90 39.07 2.87
21 S1.1-32B 65.51 38.54 89.84 61.10 90.40 53.85 90.00 32.26 55.83 24.42 45.42 19.13 61.62 54.54 25.45 24.46
22 Qwen3-1.7B 59.44 38.20 90.22 85.26 90.20 62.13 85.00 40.43 46.67 23.61 35.83 17.13 37.37 32.21 30.82 6.65
23 L1-Qwen-1.5B-Max 49.17 37.71 84.17 77.79 83.40 73.23 77.50 48.37 20.00 15.13 22.92 14.95 36.87 32.03 19.35 2.45
24 L1-Qwen-1.5B-Exact 47.40 36.75 84.87 78.51 84.00 72.07 71.25 47.37 21.25 12.62 18.33 12.96 33.84 31.01 18.28 2.70
25 Deepseek-R1-Distill-Qwen-7B 64.03 36.33 89.49 89.06 93.00 66.75 87.50 36.06 54.17 16.53 35.42 11.37 51.01 31.67 37.63 2.89
26 Eurus-2-7B-PRIME 46.57 35.06 92.72 88.01 81.40 64.69 62.50 38.58 20.83 10.84 14.58 4.49 38.89 29.51 15.05 9.28
27 Light-R1-7B-DS 64.84 34.58 88.05 82.69 93.20 61.73 90.00 34.91 55.83 16.63 45.83 12.96 41.91 30.32 39.07 2.85
28 Qwen2.5-7B-Instruct 39.42 34.43 92.27 85.12 77.60 65.78 42.50 34.46 10.00 7.02 3.75 3.32 35.86 35.15 13.98 10.19
29 AReaL-boba-RL-7B 67.42 34.16 91.66 77.80 95.00 60.77 91.25 32.94 61.25 21.43 45.83 12.33 48.98 29.13 37.99 4.73
30 DeepScaleR-1.5B-Preview 53.09 30.74 84.84 66.58 87.60 59.77 76.25 32.05 38.75 12.82 31.25 14.23 31.82 27.90 21.15 1.83
31 Qwen-2.5-Math-7B-SimpleRL-Zoo 44.72 30.32 90.52 84.01 77.80 62.41 68.50 16.16 26.67 7.55 10.00 6.47 33.84 35.49 5.73 0.16
32 Qwen2.5-Math-7B-Instruct 43.52 30.10 95.53 78.53 83.60 56.59 60.00 28.46 14.17 6.40 11.67 5.33 35.35 33.96 4.30 1.45
33 Marco-o1 38.56 26.93 89.08 79.56 72.40 48.19 47.50 17.23 10.00 4.35 10.83 3.64 30.81 28.32 9.32 7.24
34 Qwen2.5-Math-1.5B-Instruct 38.01 25.37 85.37 67.49 73.00 53.94 57.50 22.22 10.83 6.17 10.83 2.83 26.77 24.58 1.79 0.37
35 Open-Reasoner-Zero-7B 46.22 24.82 92.87 65.14 83.00 32.51 60.00 31.23 17.92 6.13 16.25 3.89 37.37 34.25 16.13 0.64
36 Deepseek-R1-Distill-Qwen-1.5B 48.16 22.88 84.62 70.21 83.40 42.47 62.50 13.98 29.17 4.97 25.00 5.91 37.37 22.11 15.05 0.48
37 Llama-3.1-8B-Instruct 29.37 21.97 85.29 76.65 49.40 30.23 27.50 9.34 0.00 2.31 0.00 0.00 33.33 28.54 10.04 6.77
38 Qwen2.5-1.5B-Instruct 25.58 10.73 65.13 13.08 53.40 27.05 30.00 10.65 2.50 2.04 0.00 0.42 26.26 21.52 1.79 0.37

Table: Detailed Model Performance, Single Accuracy v.s. Stress Accuracy

Key Findings

(1) Even SOTA LRMs like DeepSeek-R1 exhibit significant performance degradation under REST, such as 29.17% accuracy drop on AIME24, revealing a critical limitation in their reasoning robustness.


(2) Although models of different sizes demonstrate comparable, near-ceiling performance in traditional single-question evaluations, REST reveals significant disparities among them, thereby enhancing the discriminative power of existing benchmarks.


Analysis on Model Size

Figure: Performance comparison of LRMs of different sizes under various stress levels.

(3) Models trained with "long2short" technique preserve more accuracy of their single-problem performance under REST, outperforming standard-trained counterparts.


Analysis on Long2Short

Figure: The effect of Long2Short training on REST.

(4) High-performing models on REST (e.g., Nemotron-nano-7B and DeepSeek-R1) tend to dynamically allocate less reasoning tokens for the first question when the stress level exceeds 1. In contrast, models with lower REST performance, such as DS-R1-Distill-Qwen-7B, often consume a lot of reasoning tokens to answer the first question.


Analysis on Token Count

Figure: The reasoning token count for questions at different positions on AIME24.

(5) LRMs generally achieve higher accuracy for earlier questions, while their performance declines for subsequent ones.


Analysis on Question Position

Figure: The effect of question position under stress tests.

(6) Presenting questions from easy to hard consistently yields better overall accuracy compared to the reverse order.


Analysis on Question Order

Figure: The effect of question order on overall performance under stress tests.

BibTeX

@misc{pan2025REST,
    title={REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once}, 
    author={Zhuoshi Pan and Qizhi Pei and Yu Li and Qiyao Sun and Zinan Tang and H. Vicky Zhao and Conghui He and Lijun Wu},
    year={2025},
    eprint={2507.10541},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2507.10541}, 
}