pith. sign in

arxiv: 2607.01457 · v1 · pith:76EHTETYnew · submitted 2026-07-01 · 💻 cs.CL · cs.AI

Grounded Optimization: A Layered Engineering Framework for Reducing LLM Hallucination in Automated Personal Document Rewriting

Pith reviewed 2026-07-03 21:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM hallucinationresume optimizationgrounded optimizationprompt groundinghallucination mitigationdocument rewritingAI reliability
0
0 comments X

The pith

A five-layer framework reduces detected hallucinations in LLM resume rewriting from 2.48-5.36 to 0.04-0.24 per resume.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Grounded Optimization as a five-layer framework to cut specific hallucination failures when LLMs rewrite resumes for applicant tracking systems. The failures include anachronistic technology claims, cross-domain term mixing, structural changes, and outright fabrications. Ablation tests on 25 synthetic resumes across 14 industries show that baselines produce multiple hallucinations while the full framework and even its prompt-grounding layer alone bring rates near zero under suitable conditions. A reader would care because these tools now handle job-critical documents where errors can directly affect hiring outcomes.

Core claim

The authors establish that the Grounded Optimization framework, consisting of temporal context validation, deterministic contamination detection, structural invariant enforcement, prompt-level grounding, and an evaluator agent, lowers the overall detected hallucination rate to 0.04-0.24 per resume across three LLMs and four temperature settings, down from 2.48-5.36 in undefended baselines, while prompt-level grounding alone reaches zero detected hallucinations at low temperature with a capable model.

What carries the argument

The five-layer Grounded Optimization framework that enforces temporal validation, contamination detection, structural invariants, prompt grounding, and agent evaluation to block anachronistic injection, terminology contamination, structural mutation, and fabrication.

If this is right

  • Temporal hallucinations fall by 50-95 percent across all tested conditions when the layers are active.
  • Prompt-level grounding by itself produces zero detected hallucinations at low temperature with a capable instruction-following model.
  • Higher temperatures and weaker models require the deterministic layers to maintain low hallucination rates.
  • The approach was tested on 25 synthetic resumes spanning 14 industries using three LLMs and six layer configurations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The released contamination taxonomy could be applied to measure similar hallucination patterns in LLM rewriting of other personal documents such as cover letters.
  • The layered structure offers a template for adding deterministic checks to other high-stakes LLM document tasks where synthetic test results need real-world validation.

Load-bearing premise

The synthetic resumes and the independent hallucination detectors used in the ablation experiments accurately capture and measure the real-world hallucination types without significant false positives or missed cases.

What would settle it

Running the framework on a collection of real user resumes, then having domain experts compare each output against the original facts for the four hallucination types, would confirm or refute whether the measured reductions occur outside synthetic test data.

Figures

Figures reproduced from arXiv: 2607.01457 by Adarsh Agrawal, Shashank Indukuri.

Figure 1
Figure 1. Figure 1: Five-layer defense-in-depth architecture. Each layer addresses a distinct hallucination [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly applied to resume optimization for applicant tracking systems, introducing hallucination failures distinct from general text generation: anachronistic technology injection, cross-domain terminology contamination, structural mutation, and content fabrication. We present Grounded Optimization, a five-layer framework combining temporal context validation, deterministic contamination detection, structural invariant enforcement, prompt-level grounding, and an evaluator agent. In ablation experiments across three LLMs, four temperature settings, and six layer configurations on 25 synthetic resumes spanning 14 industries, undefended baselines produce 2.48-5.36 detected hallucinations per resume. Among detectors independent of the active defenses, temporal hallucinations are reduced by 50-95% across all conditions; overall detected hallucination rate falls to 0.04-0.24. Prompt-level grounding alone achieves zero detected hallucinations at low temperature with a capable instruction-following model; higher temperatures and weaker models reveal the need for the deterministic layers as a complement. We release the contamination taxonomy, evaluation code, and raw data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Grounded Optimization, a five-layer framework (temporal context validation, deterministic contamination detection, structural invariant enforcement, prompt-level grounding, and evaluator agent) to reduce four specific hallucination types in LLM-based resume rewriting. Ablation experiments across three LLMs, four temperatures, and six layer configurations on 25 synthetic resumes report baseline rates of 2.48-5.36 detected hallucinations per resume falling to 0.04-0.24 overall, with prompt-level grounding alone reaching zero at low temperature on capable models. The contamination taxonomy, evaluation code, and raw data are released.

Significance. If the empirical reductions hold under validated measurement, the work supplies a practical, composable engineering framework for constraining hallucinations in domain-specific LLM applications such as HR document processing. The explicit release of the taxonomy, code, and raw data is a clear strength that enables direct reproduction and extension by others.

major comments (2)
  1. [Ablation experiments] The central quantitative claims (baseline 2.48-5.36 hallucinations/resume reduced to 0.04-0.24) rest on custom detectors for anachronistic injection, contamination, structural mutation, and fabrication applied to the 25 synthetic resumes. The manuscript provides no validation of these detectors (e.g., inter-annotator agreement, precision/recall against human labels on real resumes, or comparison to established hallucination benchmarks), which is load-bearing for the reported layer-wise reductions.
  2. [Data and experimental setup] The construction and representativeness of the 25 synthetic resumes spanning 14 industries are not described in sufficient detail to determine whether they faithfully elicit the targeted real-world hallucination modes or whether detector outputs could be artifacts of the synthetic generation process itself.
minor comments (1)
  1. [Abstract and §4] The abstract states that 'detectors independent of the active defenses' were used, but the main text does not specify the exact protocol ensuring this independence across all six layer configurations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. The feedback highlights important considerations for the validation of our hallucination detectors and the description of our synthetic dataset. We respond to each major comment below and commit to revisions that address the concerns raised.

read point-by-point responses
  1. Referee: The central quantitative claims (baseline 2.48-5.36 hallucinations/resume reduced to 0.04-0.24) rest on custom detectors for anachronistic injection, contamination, structural mutation, and fabrication applied to the 25 synthetic resumes. The manuscript provides no validation of these detectors (e.g., inter-annotator agreement, precision/recall against human labels on real resumes, or comparison to established hallucination benchmarks), which is load-bearing for the reported layer-wise reductions.

    Authors: We acknowledge the importance of validating the custom detectors. These detectors combine deterministic rules for temporal validation, contamination detection, and structural enforcement with an evaluator agent for fabrication. The full implementation is released with the paper to support reproducibility and external verification. We did not conduct inter-annotator agreement or human evaluation on real resumes due to challenges in accessing privacy-sensitive professional documents with reliable ground truth labels. We will add a new subsection in the discussion to explicitly address the detector design, its limitations, and the rationale for using synthetic data. Additionally, we will include a comparison to existing hallucination benchmarks where applicable to contextualize our approach. revision: partial

  2. Referee: The construction and representativeness of the 25 synthetic resumes spanning 14 industries are not described in sufficient detail to determine whether they faithfully elicit the targeted real-world hallucination modes or whether detector outputs could be artifacts of the synthetic generation process itself.

    Authors: We agree that the synthetic resume construction requires more detailed exposition. The 25 resumes were generated to cover 14 industries with controlled variations designed to provoke each of the four hallucination types (e.g., inclusion of future-dated technologies for anachronisms). We will revise the experimental setup section to include the complete generation protocol, including the base templates, industry-specific adaptations, and the specific prompts or rules used to introduce hallucination-prone elements. This will allow readers to assess the fidelity to real-world scenarios and rule out generation artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ablation results on synthetic data

full rationale

The paper presents an engineering framework evaluated via ablation experiments on 25 synthetic resumes using custom detectors and a contamination taxonomy. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the provided text. Results are reported as measured outcomes (e.g., hallucination rates dropping from 2.48-5.36 to 0.04-0.24) rather than quantities defined by construction from the authors' inputs or prior work. The measurement approach relies on author-defined detectors, but this is an empirical limitation, not a circular reduction in any derivation chain. The paper is self-contained as an applied study without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the validity of the hallucination taxonomy and the synthetic dataset construction, neither of which is detailed in the abstract; no free parameters, axioms, or invented entities are explicitly introduced in the provided text.

pith-pipeline@v0.9.1-grok · 5718 in / 1171 out tokens · 17310 ms · 2026-07-03T21:02:59.835656+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 4 internal anchors

  1. [1]

    Survey of hallucination in natural language generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023

  2. [2]

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models.arXiv preprint arXiv:2309.01219, 2023

  3. [3]

    Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

    Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, 2023

  4. [4]

    Evaluating the factual consistency of abstractive text summarization

    Wojciech Kry´sci´nski, Bryan McCann, Caiming Xiong, and Richard Socher. Evaluating the factual consistency of abstractive text summarization. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 9332–9346, 2020

  5. [5]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

  6. [6]

    Career-Aware Resume Tailoring via Multi-Source Retrieval-Augmented Generation with Provenance Tracking: A Case Study

    Kumar Abhinav. Career-aware resume tailoring via multi-source retrieval-augmented generation with provenance tracking: A case study.arXiv preprint arXiv:2605.05257, 2026

  7. [7]

    Langgraph: Building stateful, multi-actor applications with llms

    LangChain. Langgraph: Building stateful, multi-actor applications with llms. https:// github.com/langchain-ai/langgraph, 2024

  8. [8]

    Critic: Large language models can self-correct with tool-interactive critiquing

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing. In Proceedings of the Twelfth International Conference on Learning Representations, 2024

  9. [9]

    Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

  10. [10]

    Halueval: A large-scale hallucination evaluation benchmark for large language models

    Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464, 2023

  11. [11]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  12. [12]

    Improved lexically constrained decoding for translation and monolingual rewriting

    J Edward Hu, Huda Khayrallah, Ryan Culkin, Patrick Xia, Tongfei Chen, Matt Post, and Benjamin Van Durme. Improved lexically constrained decoding for translation and monolingual rewriting. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 839–850, 2019

  13. [13]

    Neurologic decoding: (un)supervised neural text generation with predicate logic constraints

    Ximing Lu, Peter West, Rowan Zellers, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Neurologic decoding: (un)supervised neural text generation with predicate logic constraints. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4288–4299, 2021

  14. [14]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate.arXiv preprint arXiv:2305.14325, 2023. 9

  15. [15]

    Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2023

  16. [16]

    Schema-First Retrieval: Embedding Catalogs for Natural Language Analytics

    Adarsh Agrawal and Shashank Indukuri. Schema-first retrieval: Embedding catalogs for natural language analytics.arXiv preprint arXiv:2606.28387, 2026

  17. [17]

    Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, et al

    Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, et al. Introducing v0.5 of the AI safety benchmark from MLCommons.arXiv preprint arXiv:2404.12241, 2024

  18. [18]

    Resume screening using natural language processing and machine learning: A systematic review

    Ankit Kumar Sinha, M Amir Khusru Akhtar, and Anand Kumar. Resume screening using natural language processing and machine learning: A systematic review. InMachine Learning and Information Processing, pages 207–218. Springer, 2021

  19. [19]

    An improved deep neural network model for job matching

    Yao Deng, Hang Lei, Xiao Li, and Yihong Lin. An improved deep neural network model for job matching. In2018 International Conference on Algorithms and Architectures for Parallel Processing, pages 86–96. Springer, 2018

  20. [20]

    Methodology for resume parsing and job domain prediction.Journal of Statistics and Management Systems, 23(7): 1263–1274, 2020

    Vikas Mittal, Palak Mehta, Devesh Relan, and Garima Shakhla. Methodology for resume parsing and job domain prediction.Journal of Statistics and Management Systems, 23(7): 1263–1274, 2020

  21. [21]

    S3” matches the AWS service but not substrings like “MS365

    Saurabh Bhausaheb Zinjad, Amrita Bhattacharjee, Amey Bhilegaonkar, and Huan Liu. Resume- flow: An llm-facilitated pipeline for personalized resume generation and refinement.arXiv preprint arXiv:2402.06221, 2024. A Temporal Context Validation Details A.1 Timeline Construction Given a resume R with professional experience entries E={e 1, . . . , en}, where ...

  22. [22]

    Parse: LLM-based PDF-to-JSON conversion, producing structured resume data with typed fields (contact info, experience entries with dates and bullet points, education, skills, projects, certifications)

  23. [23]

    3.Rewrite: Multi-agent parallel optimization with all five defense layers active

    Score: ATS scoring against the target job description, producing section-level feedback and an aggregate score. 3.Rewrite: Multi-agent parallel optimization with all five defense layers active

  24. [24]

    Azure ML Studio,

    Re-Score: The optimized resume is scored again; if the score has not improved sufficiently, the rewrite stage is repeated (up to 5 cycles). E.2 Agent Specialization Five specialized agents run in parallel: •Summary Agent: Optimizes the professional summary •Skills Agent: Aligns skills with job requirements •Experience Agent: Rewrites professional experien...