Beyond Supervised Clarification: Input Rewriting with LLMs for Dialogue Discourse Parsing
Pith reviewed 2026-07-03 14:58 UTC · model grok-4.3
The pith
Unsupervised LLM rewriting for dialogue discourse parsing introduces more regressions than repairs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across three SDRT datasets and multiple parsers, last-utterance clarification via zero-shot prompting or frozen-parser feedback is far less reliable than supervised settings. Parser-agnostic rewriting introduces more regressions than repairs because edits that resolve ellipsis or references also disrupt existing discourse relations. Best-of-8 rewriting reveals a practical ceiling where a large fraction of errors cannot be repaired through input rewriting alone. A parser-aware clarifier trained with GRPO reduces regressions by up to 37 percent through conservative abstention yet still fails to deliver consistent parsing gains. These results recast clarification as a selective intervention pro
What carries the argument
Rewritability prediction, the decision of whether an utterance is repairable before any rewriting intervention is applied.
If this is right
- Clarification must be applied selectively rather than as a default step in the pipeline.
- Zero-shot LLM rewriting cannot be assumed to improve discourse parsing accuracy.
- Training clarifiers to abstain reduces regressions but does not guarantee net parsing gains.
- Agentic systems that rely on input rewriting need an explicit rewritability decision mechanism.
Where Pith is reading between the lines
- If rewritability prediction proves learnable, the same gating approach could be tested on other frozen downstream models beyond discourse parsing.
- The observed ceiling on repairable errors suggests that input-side optimization alone may need to be combined with model-side adaptations for further gains.
- One could measure whether rewritability predictors transfer across different parsers or datasets without retraining.
Load-bearing premise
Zero-shot prompting or feedback from a frozen parser can produce clarifications that fix errors without introducing new ones under realistic deployment conditions.
What would settle it
Train a rewritability predictor on held-out data and measure whether gating rewriting attempts behind its predictions produces higher overall parsing accuracy than applying rewriting to every utterance.
Figures
read the original abstract
Rewriting inputs to improve frozen downstream models has become a common strategy in modern NLP pipelines. Prior work on incremental dialogue discourse parsing (DDP) shows that supervised clarification models can rewrite fragmentary or underspecified utterances, such as resolving ellipsis or references, to improve parsing accuracy. In this work, we revisit this idea under realistic deployment conditions, where no clarification supervision is available and the clarifier must rely on zero-shot prompting or feedback from a frozen parser. Across three Segmented Discourse Representation Theory (SDRT) datasets and multiple parsers, we find that last-utterance clarification is far less reliable than suggested by supervised settings. Parser-agnostic rewriting often introduces more regressions than repairs, as edits that enable fixes also disrupt discourse cues relied upon by the parser. A best-of-8 rewriting analysis further reveals a practical ceiling: a large fraction of errors are not repairable through input rewriting alone. A parser-aware clarifier trained with GRPO reduces regressions by up to 37% by learning conservative abstention, yet still fails to produce selectivity-aware clarifications that consistently improve parsing. Together, these findings recast clarification as a selective intervention problem. We identify rewritability prediction, deciding whether an utterance is repairable before intervention, as the key missing capability for input-side optimization of frozen discourse parsers, and a critical direction for improving agentic pipelines more broadly.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that under realistic unsupervised conditions, LLM-based input rewriting for improving frozen dialogue discourse parsers (using zero-shot prompting or frozen parser feedback) is unreliable: parser-agnostic rewriting introduces more regressions than repairs across three SDRT datasets because edits disrupt discourse cues, a best-of-8 analysis shows a substantial fraction of errors are unrepairable by rewriting alone, and even a GRPO-trained parser-aware clarifier reduces regressions by up to 37% via conservative abstention but fails to yield consistent parsing improvements. The work reframes clarification as a selective intervention problem and identifies rewritability prediction as the key missing capability.
Significance. If the empirical results hold under the tested conditions, the paper makes a useful contribution by providing concrete evidence of the limitations of unsupervised rewriting strategies in discourse parsing pipelines. The dataset- and parser-spanning nature of the regression findings, the identification of a practical ceiling via best-of-8 sampling, and the constructive reframing toward rewritability prediction are strengths. The GRPO approach for learning from parser feedback is noted as a reproducible element.
major comments (3)
- [Abstract / §5 Results] Abstract and experimental results: the central claim that parser-agnostic rewriting 'often introduces more regressions than repairs' is load-bearing, yet the manuscript must specify the exact operationalization (e.g., delta in parsing F1, error-type counts, or utterance-level success/failure) and report per-dataset/per-parser breakdowns with confidence intervals to substantiate the 'net regressions' conclusion.
- [§5.2 Best-of-8 analysis] Best-of-8 rewriting analysis: the claim of a 'practical ceiling' with 'a large fraction of errors not repairable' directly supports the paper's recasting of the problem; the manuscript should detail the selection criterion (oracle vs. parser-score based), the precise fraction, and whether additional samples beyond 8 were tested, as these details determine whether the ceiling is methodological or fundamental.
- [§6 GRPO clarifier] GRPO-trained clarifier: the reported 'up to 37%' regression reduction is presented as partial mitigation, but the manuscript must clarify whether this yields net parsing gains on any dataset or merely reduced harm, and whether the training loop introduces any feedback leakage from the frozen parser; this distinction is required to support the claim that it 'still fails to produce selectivity-aware clarifications that consistently improve parsing'.
minor comments (2)
- [Abstract] The acronym GRPO should be expanded on first use.
- [Abstract] Clarify whether 'last-utterance clarification' operates on the final utterance in isolation or incorporates full dialogue history.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the contribution. We agree that additional precision on operational definitions, selection criteria, and outcome distinctions will strengthen the manuscript and will incorporate these clarifications in the revision.
read point-by-point responses
-
Referee: [Abstract / §5 Results] Abstract and experimental results: the central claim that parser-agnostic rewriting 'often introduces more regressions than repairs' is load-bearing, yet the manuscript must specify the exact operationalization (e.g., delta in parsing F1, error-type counts, or utterance-level success/failure) and report per-dataset/per-parser breakdowns with confidence intervals to substantiate the 'net regressions' conclusion.
Authors: We operationalize net regressions via utterance-level changes in parsing F1: a rewrite counts as a regression when post-rewrite F1 is lower than the original input F1 (and a repair when higher). We will add per-dataset and per-parser tables reporting mean F1 deltas together with 95% bootstrap confidence intervals in the revised §5 to make this explicit and to substantiate the aggregate claim. revision: yes
-
Referee: [§5.2 Best-of-8 analysis] Best-of-8 rewriting analysis: the claim of a 'practical ceiling' with 'a large fraction of errors not repairable' directly supports the paper's recasting of the problem; the manuscript should detail the selection criterion (oracle vs. parser-score based), the precise fraction, and whether additional samples beyond 8 were tested, as these details determine whether the ceiling is methodological or fundamental.
Authors: The best-of-8 analysis employs oracle selection (the rewrite among the eight that produces the largest F1 gain relative to the original). We will report the exact fraction of errors that remain unrepairable even under this oracle selection and confirm that sampling was limited to eight candidates; no further samples were evaluated because the analysis was intended only to establish an empirical upper bound rather than to optimize sampling strategy. revision: yes
-
Referee: [§6 GRPO clarifier] GRPO-trained clarifier: the reported 'up to 37%' regression reduction is presented as partial mitigation, but the manuscript must clarify whether this yields net parsing gains on any dataset or merely reduced harm, and whether the training loop introduces any feedback leakage from the frozen parser; this distinction is required to support the claim that it 'still fails to produce selectivity-aware clarifications that consistently improve parsing'.
Authors: The GRPO clarifier reduces the number of regressions (by up to 37%) through learned abstention but produces no net parsing F1 gains on any of the three datasets; overall performance remains statistically indistinguishable from the no-rewriting baseline. The training loop uses only scalar F1 feedback from the frozen parser as a reward signal and performs no updates to the parser itself, so no parameter leakage occurs. We will add these explicit statements to §6. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents purely empirical findings from experiments on three SDRT datasets and multiple parsers, comparing zero-shot LLM rewriting, best-of-8 sampling, and a GRPO-trained clarifier. Claims about regressions, repair ceilings, and the need for rewritability prediction rest on direct measurement of parsing accuracy changes rather than any derivation, equation, or self-referential construction. No self-citations are used to justify uniqueness or load-bearing premises, and no fitted parameters are relabeled as predictions. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Dialogue discourse parsing as generation: A sequence-to-sequence LLM-based approach. InPro- ceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 1–14, Kyoto, Japan. Association for Computational Linguistics. Jiaqi Li, Ming Liu, Min-Yen Kan, Zihao Zheng, Zekun Wang, Wenqiang Lei, Ting Liu, and Bing Qin. 2020. M...
-
[2]
Improving contextual query rewrite for conver- sational AI agents through user-preference feedback learning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 432–439, Singapore. Associa- tion for Computational Linguistics. Kate Thompson, Akshay Chaturvedi, Julie Hunter, and Nicholas Asher. 202...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
BERTScore: Evaluating Text Generation with BERT
Q-PRM: Adaptive query rewriting for retrieval- augmented generation via step-level process super- vision. InFindings of the Association for Compu- tational Linguistics: EMNLP 2025, pages 15113– 15128, Suzhou, China. Association for Computa- tional Linguistics. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Ev...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Identify if the last utterance has the following form-only issue: - A high-confidence typo/misspelling with a single obvious correction ( no guessing)
-
[5]
- Otherwise, output the last utterance exactly unchanged
Then choose one of the following actions: - If there is an above issue and the fix is safe, output the minimal allowed rewrite to fix only those issues. - Otherwise, output the last utterance exactly unchanged. Allowed edits: - Fix high-confidence typos with one obvious correction. Important: - If you are not fully confident the edit is meaning-preserving...
-
[6]
Identify if the last utterance has the following normalization issue : - It contains informal chat shorthand/abbreviation/slang that has ONE clear, widely accepted expansion in this context (no plausible alternative meanings)
-
[7]
- Otherwise, output the last utterance exactly unchanged
Then choose one of the following actions: - If there is an above issue and the fix is safe, output the minimal allowed rewrite to fix only those issues. - Otherwise, output the last utterance exactly unchanged. Allowed edits: - Expand the shorthand token(s) to their single canonical expansion. - Make the minimum number of expansions needed. Important: - I...
-
[8]
Identify if the last utterance has the following explicitness issue caused by ellipsis/fragmentation: - The last utterance is NOT a complete, stand-alone clause on its own ( e.g., a fragment, short answer, or bare reply), AND - The missing material (predicate / object / complement) is uniquely and explicitly recoverable from the dialogue context, AND - Ad...
-
[9]
Then choose one of the following actions: - If there is an above issue and the fix is safe, output the minimal allowed rewrite to fix only those issues. - Otherwise, output the last utterance exactly unchanged. Allowed edits: - Complete short answers/fragments by copying ONLY explicitly stated information from the context. - YES/NO answers: If the previou...
-
[10]
Identify if the last utterance has the following coreference issue: - The last utterance contains a pronoun or deictic (e.g., it/this/that/ these/those/he/she/they/him/her/them), AND - The dialogue context contains exactly one unambiguous antecedent for it that is explicitly mentioned
-
[11]
- Otherwise, output the last utterance exactly unchanged
Then choose one of the following actions: - If there is an above issue and the fix is safe, output the minimal allowed rewrite to fix only those issues. - Otherwise, output the last utterance exactly unchanged. Allowed edits: - Replace the pronoun/deictic with its antecedent ONLY when the antecedent is explicit and unambiguous. - Make the minimum number o...
-
[12]
Identify if the last utterance has the following discourse-marker issue: - The last utterance's intended discourse move is clear from the context, but the relation to the immediately preceding utterance is left implicit (e.g., topic shift, continuation/addition, contrast, consequence/next-step), AND - Adding a very short discourse marker would make this r...
-
[13]
Then choose one of the following actions: - If there is an above issue and the fix is safe, output the minimal allowed rewrite to fix only those issues. - Otherwise, output the last utterance exactly unchanged. Allowed edits: - Add AT MOST ONE short discourse marker phrase, typically at the beginning of the last utterance, you may choose from the followin...
-
[14]
- Normalization: informal shorthand/abbreviation/slang with ONE clear expansion in this context
Identify if the last utterance has ANY of the following issues: - Form-only: a high-confidence typo/misspelling with one obvious correction. - Normalization: informal shorthand/abbreviation/slang with ONE clear expansion in this context. - Explicitness: ellipsis/fragment/underspecified answer where the missing words are uniquely recoverable from context a...
-
[15]
Then choose one of the following actions: - If there is an above issue and the fix is safe, output the minimal allowed rewrite to fix only those issues. - Otherwise, output the last utterance exactly unchanged. Allowed edits: - Fix high-confidence typos (one obvious correction). - Expand shorthand tokens to their single canonical expansion. - Complete ell...
-
[16]
Identify if the last utterance has ambiguity or implicitness that could cause misunderstanding (e.g., typo, abbreviations, slang, vague references, incomplete phrasing, etc)
-
[17]
- Otherwise, output the last utterance exactly unchanged
Then choose one of the following actions: - If yes and you can safely preserve meaning, rewrite the last utterance to be clearer. - Otherwise, output the last utterance exactly unchanged. Important: - Do NOT add new facts or information not explicitly in the context. - Do NOT change meaning and intent. - Do NOT change stance/tone/sentiment. - The rewrite ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.