Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Pith reviewed 2026-05-10 14:09 UTC · model grok-4.3
The pith
Optimally allocating test-time compute adaptively lets smaller LLMs outperform 14x larger models when base success rates are non-trivial.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that scaling test-time computation via a difficulty-aware adaptive strategy, using either verifier search or distribution updates, produces higher performance per unit of compute than fixed strategies and, in FLOPs-equivalent comparisons, allows smaller models to surpass much larger models on tasks they can already solve with non-trivial probability.
What carries the argument
A compute-optimal scaling strategy that selects and allocates test-time compute per prompt according to its difficulty, switching between verifier-guided search and adaptive distribution updates to maximize output quality for the given inference budget.
If this is right
- Test-time compute can be traded against pre-training compute to achieve higher performance at lower total resource cost.
- Adaptive allocation per prompt is required to obtain the reported efficiency gains over non-adaptive baselines.
- On tasks where a base model already succeeds with some probability, extra inference compute can substitute for increases in model size.
- The tradeoff between inference-time and pre-training compute shifts in favor of the former when the right adaptive method is used.
Where Pith is reading between the lines
- This result suggests that model training objectives could be redesigned to better support subsequent test-time search and adaptation.
- Resource allocation in large-scale AI systems may move toward lighter pretrained models paired with strong inference-time engines.
- Extending the adaptive allocation idea to longer-horizon or multi-step tasks could support iterative self-improvement loops without further pretraining.
Load-bearing premise
The effectiveness of different test-time scaling methods varies predictably with prompt difficulty in a manner that permits reliable adaptive allocation without introducing new errors or overhead.
What would settle it
Direct measurement on a held-out set of prompts showing that the adaptive per-prompt allocation fails to deliver any efficiency gain over a fixed best-of-N strategy or fails to let the smaller model exceed the 14x larger model in FLOPs-matched runs.
read the original abstract
Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies scaling of test-time computation in LLMs via two mechanisms: searching with process-based verifier reward models and adaptive updates to the response distribution. It finds that effectiveness varies with prompt difficulty, motivating a compute-optimal adaptive allocation strategy. This strategy is claimed to improve efficiency by more than 4x over best-of-N and enable a smaller model to outperform a 14x larger model in FLOPs-matched settings on suitable prompts.
Significance. Should the results prove robust, the work is significant in demonstrating that test-time compute scaling can be more effective than parameter scaling for LLMs. It offers insights into optimal compute allocation and has implications for building self-improving AI agents and rethinking pretraining vs inference tradeoffs. The empirical demonstration of difficulty-dependent performance is a key contribution.
major comments (2)
- [Compute-optimal strategy section (likely §4.3)] The paper motivates the adaptive allocation from observed variation in method effectiveness with difficulty but does not account for the compute cost or error introduced by the difficulty estimator. This is load-bearing for the 4x efficiency claim and the 14x model outperformance, as misallocations or added FLOPs could invalidate the FLOPs-matched comparisons.
- [Experimental results (likely §5)] The results lack sufficient details on experimental setup, including specific benchmarks, baselines, statistical significance, error bars, and exact FLOPs calculation methodology for the adaptive methods. This hinders verification of the central empirical claims.
minor comments (2)
- [Abstract] The abstract could specify the base models and datasets used to provide context for the 14x larger model comparison.
- [Methods] Clarify the distinction between process-based and outcome-based verifiers in the methods section to avoid potential confusion.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments help clarify the presentation of our adaptive compute-optimal strategy and improve the experimental details. We respond to each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Compute-optimal strategy section (likely §4.3)] The paper motivates the adaptive allocation from observed variation in method effectiveness with difficulty but does not account for the compute cost or error introduced by the difficulty estimator. This is load-bearing for the 4x efficiency claim and the 14x model outperformance, as misallocations or added FLOPs could invalidate the FLOPs-matched comparisons.
Authors: We agree that a thorough accounting of the difficulty estimator is necessary to support the efficiency claims. In the revised manuscript, we will add a dedicated analysis in the compute-optimal strategy section. This will include the computational overhead of the estimator (which is a small fraction of the total FLOPs), its prediction accuracy, and sensitivity analysis showing that the reported 4x efficiency improvement and the outperformance results remain valid even when including estimator costs and accounting for potential errors in difficulty assessment. revision: yes
-
Referee: [Experimental results (likely §5)] The results lack sufficient details on experimental setup, including specific benchmarks, baselines, statistical significance, error bars, and exact FLOPs calculation methodology for the adaptive methods. This hinders verification of the central empirical claims.
Authors: We acknowledge the need for greater experimental transparency. The updated manuscript will provide comprehensive details on the experimental setup in §5, including the specific benchmarks employed, all baselines considered, results with error bars from multiple independent runs to establish statistical significance, and a clear, reproducible methodology for calculating FLOPs for both fixed and adaptive test-time compute strategies. revision: yes
Circularity Check
No circularity: claims rest on direct experimental comparisons
full rationale
The paper presents empirical results on test-time compute scaling for LLMs, comparing methods like search against verifiers and adaptive distribution updates. The central finding—that a compute-optimal adaptive strategy yields >4x efficiency gains and allows a smaller model to outperform a 14x larger one in FLOPs-matched settings—is supported by reported experiments on prompt difficulty variation, not by any self-referential definitions, fitted inputs renamed as predictions, or load-bearing self-citations. No equations reduce to tautologies, and the adaptive allocation is described as motivated by observations then validated experimentally rather than derived by construction from prior author work.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
Efficiently Representing Algorithms With Chain-of-Thought Transformers
CoT transformers simulate any Word RAM algorithm with poly-logarithmic overhead in three architectures, improving on quadratic TM overhead.
-
Entropy-Gated Latent Recursion
EGLR adds a deterministic layer-recursion axis gated by entropy that is complementary to temperature sampling, raising joint oracle accuracy on MATH-500 from 83.4% to 91.6% for a 3B model.
-
UniQL: Towards Dialect-Universal Benchmarking for Text-to-SQL
UniQL is a human-verified benchmark providing aligned natural language questions and dialect-specific SQL queries for 16 SQL systems to evaluate cross-dialect generalization.
-
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
-
MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning
MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6....
-
Test-Time Training with KV Binding Is Secretly Linear Attention
Test-time training with KV binding reduces to learned linear attention.
-
MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents
MobiBench is the first modular multi-path offline benchmark for mobile GUI agents, achieving 94.72% agreement with human evaluators while allowing component-level analysis.
-
Do generative video models understand physical principles?
Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.
-
Agentic generation of verifiable rules for deterministic, self-expanding reaction classification
Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.
-
MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark
MSQA benchmark shows LLMs exhibit cultural degradation and a locality effect where competence tracks pre-training exposure more than reasoning, and common inference-time fixes do not resolve it.
-
Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models
Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.
-
The Inference-Compute Frontier and a Latency-Efficient Architecture for Limit Order Book Prediction
Empirical power-law frontier between predictive loss and structural forward work in LOB models extrapolates to held-out high-compute architectures with R²=0.941, motivating FastBiNLOB which exceeds SOTA macro-F1 at lo...
-
Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing
LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.
-
Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing
Local Branch Routing (LBR) is a token-level framework for test-time scaling in language models that uses local branch hidden states for routing and supports end-to-end RL, showing gains in Pass@1 and Pass@32 on math r...
-
SPIRAL: Learning to Search and Aggregate
SPIRAL is a reinforcement learning framework that jointly optimizes sequential reasoning, parallel trace generation, and aggregation in language models for improved test-time performance.
-
SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs
SPOT-E uses entropy shaping on answer predictions with low-entropy anchors to optimize visual spotlights at test time via GRPO for better VLM performance on evidence-intensive tasks.
-
Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning
SWITCH uses explicit <swi> and </swi> boundary tokens to make latent chain-of-thought compatible with on-policy RL (GRPO) and open to causal mechanistic probing, outperforming prior hidden-state recurrence methods.
-
MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling
MARS is a margin-adversarial stopping rule for parallel LLM test-time scaling that saves 25-47% tokens while matching full-budget majority-vote accuracy by learning trace switch probabilities and applying adversarial bounds.
-
Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning
QGF performs test-time policy optimization for flow models in RL by guiding a behavior-cloned reference policy with value-function gradients, achieving strong results on high-dimensional offline RL benchmarks without ...
-
KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty
KCSAT-ML benchmark supplies human error rates for math problems and DRG metric exposes that model accuracy collapses on high-human-error items while test-time scaling shows non-monotonic gains and alignment failures.
-
The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning
PRISM is a contrastive, policy-aware training framework for process reward models that reduces false positives by 22% on PRMBench and boosts downstream accuracy up to 33% in Best-of-N selection by learning reliable re...
-
Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)
Three problem-level trajectory features derived from the distributional signature of failed LLM rollouts enable failure clustering at 84.3% accuracy and a training-free routing rule that improves rescue by 12.2% on ha...
-
Alpha-RTL: Test-Time Training for RTL Hardware Optimization
TTT-RTL performs per-design test-time RL on an LLM policy with EDA-derived PPA rewards and an adaptive KL controller, reducing geometric-mean PPA product by 65.1% on RTLLM v2.0 and ADP by 59.4% on an industrial FPU unit.
-
Synthetic Personalities: How Well Can LLMs Mimic Individual Respondents Using Socio-Economic Microdata?
LLMs achieve up to 78.8% accuracy and r=0.590 correlation mimicking individual SOEP respondents using cumulative microdata, with gains from more information but diminishing returns past the 75% entropy point.
-
Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation
Consequence-aware scheduler using an issue-text predictor routes more compute to high-cost failures and cuts cost-weighted loss by 22-33% versus difficulty-based allocation on SWE-bench tasks.
-
Rotate2Think: Geometric Priming via Orthogonal Rotation to Improve Language Model Reasoning
Rotate2Think estimates an orthogonal rotation from input to thinking embeddings via Procrustes analysis on a few examples and injects the resulting vector to prime reasoning traces, raising accuracy in 30 of 32 model-...
-
VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization
VLMs act as teachers by deriving differentiable rewards from task rules to adapt VGMs via test-time LoRA optimization, delivering 16.7-point average gains on symbolic and general video reasoning benchmarks.
-
VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization
VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver an...
-
ATLAS: Agentic Test-time Learning-to-Allocate Scaling
ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.
-
Unlocking the Working Memory of Large Language Models for Latent Reasoning
RiM trains LLMs to perform latent reasoning via fixed memory blocks processed in one forward pass using a two-stage curriculum, matching or exceeding prior latent methods on benchmarks.
-
The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure
The paper identifies unfaithful capitulation, a failure mode where chain-of-thought remains correct but the emitted answer flips wrong under sustained adversarial pressure in multi-turn dialogue.
-
LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation
LaneRoPE adds an inter-sequence attention mask and extended RoPE to enable collaborative parallel sequence generation in LLMs, yielding accuracy gains on math reasoning under length limits.
-
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
-
HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection
HIDBench unifies DARPA-E3, DARPA-E5, and NodLink datasets with a data pipeline to benchmark LLMs for host-based intrusion detection, showing high precision on simple logs but sharp drops in MCC and rises in false posi...
-
Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation
Anchored Tree Sampling converts horizon-compounding drift into anchor-bounded drift by organizing video generation as a sparse-to-dense tree of imputations instead of left-to-right autoregressive rollout.
-
Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management
Autonomous AI agents outperform humans in supply chain simulations but exhibit an inherent agent bullwhip effect of amplified decision unreliability, mitigated by GRPO reinforcement learning post-training.
-
Learning How to Cube
A neuro-symbolic post-training pipeline lets a 4B transformer learn cubing heuristics that reach pass@5 of 53 on 100 SAT competition instances, matching the strongest symbolic baseline.
-
CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning
CAPS is a four-stage inference-only cascade that adapts how much of each solution the verifier sees and how comparisons are distributed, halving per-candidate verifier tokens while outperforming uniform pairwise verif...
-
Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
-
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
-
Test-Time Learning with an Evolving Library
EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...
-
Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces
Language models produce overcomplete reasoning traces where on average 46% of steps can be removed while preserving the answer in 86% of cases, with necessity concentrated in the top three steps.
-
Query-Conditioned Test-Time Self-Training for Large Language Models
QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.
-
Query-Conditioned Test-Time Self-Training for Large Language Models
QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.
-
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
-
Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents
VeGAS improves MLLM-based embodied agents by sampling action ensembles and using a verifier trained on LLM-synthesized failure cases, yielding up to 36% relative gains on hard multi-object long-horizon tasks in Habita...
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
Test-Time Compute for Frozen Embedding Models through Agentic Program Search
Agentic program search over frozen embedding APIs yields a parameter-free inference algebra—a softmax-weighted centroid of top-K documents interpolated with the query—that lifts nDCG@10 across seven model families on ...
-
Test-Time Compute for Frozen Embedding Models through Agentic Program Search
A softmax-weighted centroid of the local top-K documents interpolated with the query improves nDCG@10 for frozen embedding models across seven families on held-out BEIR data.
-
Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities
Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.
-
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
-
Active Testing of Large Language Models via Approximate Neyman Allocation
Proposes surrogate semantic entropy stratification followed by approximate Neyman allocation for active testing of LLMs on generative benchmarks, reporting up to 28% MSE reduction and 22.9% average budget savings vers...
-
RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement
RubricRefine is a training-free pre-execution method that creates rubrics to score and fix inter-tool contract violations in agent code, reaching 0.86 average on M3ToolEval across seven models with zero executions and...
-
RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement
RubricRefine improves tool-use agent reliability to 0.86 on M3ToolEval by generating rubrics for pre-execution contract checking and iterative repair, outperforming baselines at 2.6X lower latency while showing no gai...
-
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.
-
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
-
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.
-
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
-
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.
-
CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation
CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candida...
Reference graph
Works this paper leans on
-
[1]
Training revision models with synthetic data. Coming soon, 2024. 16
work page 2024
-
[2]
C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan. An introduction to mcmc for machine learning. 2003
work page 2003
-
[3]
R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, E. Chu, J. H. Clark, L. E. Shafey, Y. Huang, K. Meier-Hellstern, G. Mishra, E. Moreira, M. Omernick, K. Robinson, S. Ruder, Y. Tay, K. Xiao, Y. Xu, Y. Zhang, G. H. Abrego, J. Ahn, J. Austin, P. Barham, J. Botha, J. Bradbury, S. Brahma, K. Brooks, M...
work page 2023
-
[4]
Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran- Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R. ...
work page 2022
-
[5]
arXiv preprint arXiv:2406.03476 , year=
C. Blakeney, M. Paul, B. W. Larsen, S. Owen, and J. Frankle. Does your data spark joy? performance gains from domain upsampling at the end of training, 2024. URLhttps://arxiv.org/abs/ 2406.03476
-
[6]
G. Chen, M. Liao, C. Li, and K. Fan. Alphamath almost zero: process supervision without process, 2024
work page 2024
- [7]
-
[8]
Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch. Improving factuality and reasoning in language models through multiagent debate, 2023
work page 2023
-
[9]
J. S. B. T. Evans. Heuristic and analytic processes in reasoning.British Journal of Psychology, 75(4): 451–468, 1984
work page 1984
-
[10]
X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and J. Wang. Alphazero-like tree-search can guide large language model decoding and training, 2024
work page 2024
-
[11]
L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig. Pal: Program-aided language models, 2023. URLhttps://arxiv.org/abs/2211.10435
work page Pith review arXiv 2023
- [12]
-
[13]
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset, 2021
work page 2021
-
[14]
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre. Training compute-optimal large language models, 2022
work page 2022
- [15]
-
[16]
A. L. Jones. Scaling scaling laws with board games, 2021. URLhttps://arxiv.org/abs/2104. 03113
work page 2021
- [17]
-
[18]
Kahneman.Thinking, fast and slow
D. Kahneman.Thinking, fast and slow. Farrar, Straus and Giroux, New York, first paperback edition edition, 2013
work page 2013
-
[19]
L. Kocsis and C. Szepesv’ari. Bandit based monte-carlo planning. InEuropean conference on machine learning, pages 282–293. Springer, 2006
work page 2006
-
[20]
A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y. Wu, B. Neyshabur, G. Gur-Ari, and V. Misra. Solving quantitative reasoning problems with language models, 2022
work page 2022
-
[21]
Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, and W. Chen. Making large language models better reasoners with step-aware verifier, 2023
work page 2023
-
[22]
H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step, 2023
work page 2023
- [23]
-
[24]
N. McAleese, R. Pokorny, J. F. Cerón Uribe, E. Nitishinskaya, M. Trębacz, and J. Leike. Llm critics help catch llm bugs.OpenAI, 2024
work page 2024
- [25]
-
[26]
Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023. URLhttps://arxiv.org/abs/2307.16789
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [27]
-
[28]
Y. Qu, T. Zhang, N. Garg, and A. Kumar. Recursive introspection: Teaching foundation models how to self-improve. 2024. 18
work page 2024
-
[29]
N. Sardana and J. Frankle. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws, 2023
work page 2023
-
[30]
W. Saunders, C. Yeh, J. Wu, S. Bills, L. Ouyang, J. Ward, and J. Leike. Self-critiquing models for assisting human evaluators, 2022
work page 2022
-
[31]
Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold
A. Setlur, S. Garg, X. Geng, N. Garg, V. Smith, and A. Kumar. Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold.arXiv preprint arXiv:2406.14532, 2024
-
[32]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024
work page 2024
- [33]
- [34]
-
[35]
A. Singh, J. D. Co-Reyes, R. Agarwal, A. Anand, P. Patil, X. Garcia, P. J. Liu, J. Harrison, J. Lee, K. Xu, A. Parisi, A. Kumar, A. Alemi, A. Rizkowsky, A. Nova, B. Adlam, B. Bohnet, G. Elsayed, H. Sedghi, I. Mordatch, I. Simpson, I. Gur, J. Snoek, J. Pennington, J. Hron, K. Kenealy, K. Swersky, K. Mahajan, L. Culp, L. Xiao, M. L. Bileschi, N. Constant, R...
work page 2024
- [36]
-
[37]
K. Stechly, M. Marquez, and S. Kambhampati. Gpt-4 doesn’t know it’s wrong: An analysis of iterative prompting for reasoning problems, 2023
work page 2023
-
[38]
R. S. Sutton and A. G. Barto.Reinforcement learning: An introduction. Second edition, 2018
work page 2018
-
[39]
G. Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024
work page 2024
-
[40]
Y. Tian, B. Peng, L. Song, L. Jin, D. Yu, H. Mi, and D. Yu. Toward self-improvement of llms via imagination, searching, and criticizing, 2024
work page 2024
-
[41]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H.Touvron, L.Martin, K.Stone, P.Albert, A.Almahairi, Y.Babaei, N.Bashlykov, S.Batra, P.Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Ko...
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [42]
-
[43]
K. Valmeekam, M. Marquez, and S. Kambhampati. Can large language models really improve by self-critiquing their own plans?, 2023
work page 2023
-
[44]
P. Villalobos and D. Atkinson. Trading off compute in training and inference, 2023. URLhttps: //epochai.org/blog/trading-off-compute-in-training-and-inference . Accessed: 2024-07-03
work page 2023
-
[45]
P. Wang, L. Li, Z. Shao, R. X. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2023
work page 2023
- [46]
-
[47]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of- thought prompting elicits reasoning in large language models, 2023
work page 2023
-
[48]
S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023
work page 2023
-
[49]
Z. Yuan, H. Yuan, C. Li, G. Dong, K. Lu, C. Tan, C. Zhou, and J. Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023
work page 2023
-
[50]
E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman. Star: Bootstrapping reasoning with reasoning, 2022
work page 2022
-
[51]
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
E. Zelikman, G. Harik, Y. Shao, V. Jayasiri, N. Haber, and N. D. Goodman. Quiet-star: Language models can teach themselves to think before speaking, 2024. URLhttps://arxiv.org/abs/ 2403.09629. 20 Appendices A. Related Work Language model reasoning.Language model performance on challenging mathematical reasoning tasks has rapidly improved in recent years [...
work page internal anchor Pith review arXiv 2024
-
[52]
improving the LLM proposal distribution by either applying targeted optimization on specific reasoning tasks by finetuning with RL [32, 35, 49, 50] enabling models to critique and revise their answers iteratively [4, 8, 23, 30]; 3) enabling LLMs to benefit from additional test-time computation by finetuning verifiers [6, 7, 10, 22, 40, 42, 45, 48]. Our wo...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.