Grounded Optimization: A Layered Engineering Framework for Reducing LLM Hallucination in Automated Personal Document Rewriting
Pith reviewed 2026-07-03 21:02 UTC · model grok-4.3
The pith
A five-layer framework reduces detected hallucinations in LLM resume rewriting from 2.48-5.36 to 0.04-0.24 per resume.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that the Grounded Optimization framework, consisting of temporal context validation, deterministic contamination detection, structural invariant enforcement, prompt-level grounding, and an evaluator agent, lowers the overall detected hallucination rate to 0.04-0.24 per resume across three LLMs and four temperature settings, down from 2.48-5.36 in undefended baselines, while prompt-level grounding alone reaches zero detected hallucinations at low temperature with a capable model.
What carries the argument
The five-layer Grounded Optimization framework that enforces temporal validation, contamination detection, structural invariants, prompt grounding, and agent evaluation to block anachronistic injection, terminology contamination, structural mutation, and fabrication.
If this is right
- Temporal hallucinations fall by 50-95 percent across all tested conditions when the layers are active.
- Prompt-level grounding by itself produces zero detected hallucinations at low temperature with a capable instruction-following model.
- Higher temperatures and weaker models require the deterministic layers to maintain low hallucination rates.
- The approach was tested on 25 synthetic resumes spanning 14 industries using three LLMs and six layer configurations.
Where Pith is reading between the lines
- The released contamination taxonomy could be applied to measure similar hallucination patterns in LLM rewriting of other personal documents such as cover letters.
- The layered structure offers a template for adding deterministic checks to other high-stakes LLM document tasks where synthetic test results need real-world validation.
Load-bearing premise
The synthetic resumes and the independent hallucination detectors used in the ablation experiments accurately capture and measure the real-world hallucination types without significant false positives or missed cases.
What would settle it
Running the framework on a collection of real user resumes, then having domain experts compare each output against the original facts for the four hallucination types, would confirm or refute whether the measured reductions occur outside synthetic test data.
Figures
read the original abstract
Large language models (LLMs) are increasingly applied to resume optimization for applicant tracking systems, introducing hallucination failures distinct from general text generation: anachronistic technology injection, cross-domain terminology contamination, structural mutation, and content fabrication. We present Grounded Optimization, a five-layer framework combining temporal context validation, deterministic contamination detection, structural invariant enforcement, prompt-level grounding, and an evaluator agent. In ablation experiments across three LLMs, four temperature settings, and six layer configurations on 25 synthetic resumes spanning 14 industries, undefended baselines produce 2.48-5.36 detected hallucinations per resume. Among detectors independent of the active defenses, temporal hallucinations are reduced by 50-95% across all conditions; overall detected hallucination rate falls to 0.04-0.24. Prompt-level grounding alone achieves zero detected hallucinations at low temperature with a capable instruction-following model; higher temperatures and weaker models reveal the need for the deterministic layers as a complement. We release the contamination taxonomy, evaluation code, and raw data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Grounded Optimization, a five-layer framework (temporal context validation, deterministic contamination detection, structural invariant enforcement, prompt-level grounding, and evaluator agent) to reduce four specific hallucination types in LLM-based resume rewriting. Ablation experiments across three LLMs, four temperatures, and six layer configurations on 25 synthetic resumes report baseline rates of 2.48-5.36 detected hallucinations per resume falling to 0.04-0.24 overall, with prompt-level grounding alone reaching zero at low temperature on capable models. The contamination taxonomy, evaluation code, and raw data are released.
Significance. If the empirical reductions hold under validated measurement, the work supplies a practical, composable engineering framework for constraining hallucinations in domain-specific LLM applications such as HR document processing. The explicit release of the taxonomy, code, and raw data is a clear strength that enables direct reproduction and extension by others.
major comments (2)
- [Ablation experiments] The central quantitative claims (baseline 2.48-5.36 hallucinations/resume reduced to 0.04-0.24) rest on custom detectors for anachronistic injection, contamination, structural mutation, and fabrication applied to the 25 synthetic resumes. The manuscript provides no validation of these detectors (e.g., inter-annotator agreement, precision/recall against human labels on real resumes, or comparison to established hallucination benchmarks), which is load-bearing for the reported layer-wise reductions.
- [Data and experimental setup] The construction and representativeness of the 25 synthetic resumes spanning 14 industries are not described in sufficient detail to determine whether they faithfully elicit the targeted real-world hallucination modes or whether detector outputs could be artifacts of the synthetic generation process itself.
minor comments (1)
- [Abstract and §4] The abstract states that 'detectors independent of the active defenses' were used, but the main text does not specify the exact protocol ensuring this independence across all six layer configurations.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. The feedback highlights important considerations for the validation of our hallucination detectors and the description of our synthetic dataset. We respond to each major comment below and commit to revisions that address the concerns raised.
read point-by-point responses
-
Referee: The central quantitative claims (baseline 2.48-5.36 hallucinations/resume reduced to 0.04-0.24) rest on custom detectors for anachronistic injection, contamination, structural mutation, and fabrication applied to the 25 synthetic resumes. The manuscript provides no validation of these detectors (e.g., inter-annotator agreement, precision/recall against human labels on real resumes, or comparison to established hallucination benchmarks), which is load-bearing for the reported layer-wise reductions.
Authors: We acknowledge the importance of validating the custom detectors. These detectors combine deterministic rules for temporal validation, contamination detection, and structural enforcement with an evaluator agent for fabrication. The full implementation is released with the paper to support reproducibility and external verification. We did not conduct inter-annotator agreement or human evaluation on real resumes due to challenges in accessing privacy-sensitive professional documents with reliable ground truth labels. We will add a new subsection in the discussion to explicitly address the detector design, its limitations, and the rationale for using synthetic data. Additionally, we will include a comparison to existing hallucination benchmarks where applicable to contextualize our approach. revision: partial
-
Referee: The construction and representativeness of the 25 synthetic resumes spanning 14 industries are not described in sufficient detail to determine whether they faithfully elicit the targeted real-world hallucination modes or whether detector outputs could be artifacts of the synthetic generation process itself.
Authors: We agree that the synthetic resume construction requires more detailed exposition. The 25 resumes were generated to cover 14 industries with controlled variations designed to provoke each of the four hallucination types (e.g., inclusion of future-dated technologies for anachronisms). We will revise the experimental setup section to include the complete generation protocol, including the base templates, industry-specific adaptations, and the specific prompts or rules used to introduce hallucination-prone elements. This will allow readers to assess the fidelity to real-world scenarios and rule out generation artifacts. revision: yes
Circularity Check
No circularity: empirical ablation results on synthetic data
full rationale
The paper presents an engineering framework evaluated via ablation experiments on 25 synthetic resumes using custom detectors and a contamination taxonomy. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the provided text. Results are reported as measured outcomes (e.g., hallucination rates dropping from 2.48-5.36 to 0.04-0.24) rather than quantities defined by construction from the authors' inputs or prior work. The measurement approach relies on author-defined detectors, but this is an empirical limitation, not a circular reduction in any derivation chain. The paper is self-contained as an applied study without load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Survey of hallucination in natural language generation
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023
work page 2023
-
[2]
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models.arXiv preprint arXiv:2309.01219, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models
Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, 2023
work page 2023
-
[4]
Evaluating the factual consistency of abstractive text summarization
Wojciech Kry´sci´nski, Bryan McCann, Caiming Xiong, and Richard Socher. Evaluating the factual consistency of abstractive text summarization. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 9332–9346, 2020
work page 2020
-
[5]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020
work page 2020
-
[6]
Kumar Abhinav. Career-aware resume tailoring via multi-source retrieval-augmented generation with provenance tracking: A case study.arXiv preprint arXiv:2605.05257, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
Langgraph: Building stateful, multi-actor applications with llms
LangChain. Langgraph: Building stateful, multi-actor applications with llms. https:// github.com/langchain-ai/langgraph, 2024
work page 2024
-
[8]
Critic: Large language models can self-correct with tool-interactive critiquing
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing. In Proceedings of the Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[9]
Factscore: Fine-grained atomic evaluation of factual precision in long form text generation
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023
work page 2023
-
[10]
Halueval: A large-scale hallucination evaluation benchmark for large language models
Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464, 2023
work page 2023
-
[11]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024
work page 2024
-
[12]
Improved lexically constrained decoding for translation and monolingual rewriting
J Edward Hu, Huda Khayrallah, Ryan Culkin, Patrick Xia, Tongfei Chen, Matt Post, and Benjamin Van Durme. Improved lexically constrained decoding for translation and monolingual rewriting. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 839–850, 2019
work page 2019
-
[13]
Neurologic decoding: (un)supervised neural text generation with predicate logic constraints
Ximing Lu, Peter West, Rowan Zellers, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Neurologic decoding: (un)supervised neural text generation with predicate logic constraints. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4288–4299, 2021
work page 2021
-
[14]
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate.arXiv preprint arXiv:2305.14325, 2023. 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2023
work page 2023
-
[16]
Schema-First Retrieval: Embedding Catalogs for Natural Language Analytics
Adarsh Agrawal and Shashank Indukuri. Schema-first retrieval: Embedding catalogs for natural language analytics.arXiv preprint arXiv:2606.28387, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, et al
Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, et al. Introducing v0.5 of the AI safety benchmark from MLCommons.arXiv preprint arXiv:2404.12241, 2024
-
[18]
Resume screening using natural language processing and machine learning: A systematic review
Ankit Kumar Sinha, M Amir Khusru Akhtar, and Anand Kumar. Resume screening using natural language processing and machine learning: A systematic review. InMachine Learning and Information Processing, pages 207–218. Springer, 2021
work page 2021
-
[19]
An improved deep neural network model for job matching
Yao Deng, Hang Lei, Xiao Li, and Yihong Lin. An improved deep neural network model for job matching. In2018 International Conference on Algorithms and Architectures for Parallel Processing, pages 86–96. Springer, 2018
work page 2018
-
[20]
Vikas Mittal, Palak Mehta, Devesh Relan, and Garima Shakhla. Methodology for resume parsing and job domain prediction.Journal of Statistics and Management Systems, 23(7): 1263–1274, 2020
work page 2020
-
[21]
S3” matches the AWS service but not substrings like “MS365
Saurabh Bhausaheb Zinjad, Amrita Bhattacharjee, Amey Bhilegaonkar, and Huan Liu. Resume- flow: An llm-facilitated pipeline for personalized resume generation and refinement.arXiv preprint arXiv:2402.06221, 2024. A Temporal Context Validation Details A.1 Timeline Construction Given a resume R with professional experience entries E={e 1, . . . , en}, where ...
-
[22]
Parse: LLM-based PDF-to-JSON conversion, producing structured resume data with typed fields (contact info, experience entries with dates and bullet points, education, skills, projects, certifications)
-
[23]
3.Rewrite: Multi-agent parallel optimization with all five defense layers active
Score: ATS scoring against the target job description, producing section-level feedback and an aggregate score. 3.Rewrite: Multi-agent parallel optimization with all five defense layers active
-
[24]
Re-Score: The optimized resume is scored again; if the score has not improved sufficiently, the rewrite stage is repeated (up to 5 cycles). E.2 Agent Specialization Five specialized agents run in parallel: •Summary Agent: Optimizes the professional summary •Skills Agent: Aligns skills with job requirements •Experience Agent: Rewrites professional experien...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.