pith. sign in

arxiv: 2607.01153 · v1 · pith:HCWU5YICnew · submitted 2026-07-01 · 💻 cs.CL · cs.AI· cs.SE

Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

Pith reviewed 2026-07-02 12:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SE
keywords adversarial pragmaticsAI safety evaluationinstruction conflictembedded commandspolicy ambiguityLLM judgeslinguistic taxonomybenchmark
0
0 comments X

The pith

A linguistically controlled benchmark distinguishes capability limits from policy ambiguity in language model safety evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents adversarial pragmatics as a benchmark and annotation protocol for testing language models on prompts that mix instruction conflicts, embedded commands, quotation, scope ambiguity, deixis, indirect speech acts, and multi-turn transcripts. It supplies an 18-item seed set with validator-enforced metadata plus an expert protocol that tracks task success, policy compliance, safety risk, refusal outcome, and evaluator confidence as separate variables. Metrics then quantify judge validity, diagnostic ambiguity, and taxonomy drift. A sympathetic reader would care because existing safety benchmarks collapse these distinctions into single pass/fail scores, leaving unclear whether a failure reflects model limits, unclear policy, or unstable judgment.

Core claim

By applying a linguistically controlled taxonomy of pragmatic phenomena to safety-related prompts and using validator-enforced metadata in expert annotations, the benchmark produces metrics that can validate whether safety evaluations are measuring model capability, policy clarity, or evaluator consistency.

What carries the argument

The adversarial pragmatics benchmark and annotation protocol, which applies a linguistically controlled taxonomy of instruction conflict, embedded commands, and related phenomena together with validator-enforced metadata and an expert protocol that separates task success, policy compliance, safety risk, refusal outcome, and evaluator confidence.

If this is right

  • Safety evaluations can be checked to determine whether failures arise from capability limits or from ambiguous policies.
  • LLM judges can be assessed for stability using the new metrics for judge validity and diagnostic ambiguity.
  • Gold sets for safety testing can be built with clearer distinctions among success, compliance, and risk.
  • Prompt-injection tests can incorporate controlled linguistic ambiguities rather than relying on surface-level attacks.
  • Safety documentation can be updated once sources of policy ambiguity are isolated from model behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same protocol could be applied retroactively to existing safety datasets to reclassify past failures.
  • Multi-turn agent transcripts may expose additional ambiguity patterns that single-turn tests miss.
  • Models fine-tuned against the taxonomy could be tested for measurable gains in handling conflicted instructions.

Load-bearing premise

The linguistically controlled taxonomy and validator-enforced metadata can reliably separate capability limits from policy ambiguity and evaluator instability in practice.

What would settle it

An experiment in which expert annotators applying the protocol still cannot reach agreement on the source of a model failure in a majority of cases would show that the separation does not hold.

Figures

Figures reproduced from arXiv: 2607.01153 by Brett Reynolds.

Figure 1
Figure 1. Figure 1: Evaluation pipeline for the benchmark artifact. The diagram summarizes the planned data flow from pre-specified item metadata through model output, rule-aided triage, expert labels, LLM￾judge labels, adjudication, and metric-driven item revision. No performance quantity is encoded in the figure. or real-world agent-security robustness. Those claims require realistic wrappers and independent policy review. … view at source ↗
Figure 2
Figure 2. Figure 2: Model-level adjudicated pilot outcomes. Source: sanitized summaries for run local-pilot-20260630-185417; N = 54 item–model rows. The task and policy panels share a 0–18 row scale for each model, and the strict-pair panel reports pass counts over nine pair–model cells. Across the pilot, 36 of 54 outputs were full task successes, 11 were partial successes, and 7 were failures. Policy compliance was higher th… view at source ↗
Figure 3
Figure 3. Figure 3: Strict pair passes by phenomenon family. Source: sanitized summaries for run local-pilot-20260630-185417; N = 27 pair–model cells. Each horizontal bar uses a 0–3 scale because each minimal pair was run against three local models. The automatic diagnostic pass was useful as triage but not as a substitute for adjudication. All seven noncompliant rows were high-priority diagnostic rows. But low-priority rows … view at source ↗
Figure 4
Figure 4. Figure 4: Rule-aided diagnostic priority compared with adjudicated labels. Source: sanitized sum￾maries for run local-pilot-20260630-185417; N = 54 item–model rows. All bars use a common row-count scale, with colour and hatch distinguishing diagnostic row totals, non-success rows, non￾compliant rows, and policy-ambiguous rows [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 1
Figure 1. Figure 1: Failure-attribution labels by minimal pair. Source: sanitized summaries for run local-pilot-20260630-185417; N = 54 item–model rows, with six rows per minimal pair. All rows use the same 0–6 scale. 7 LLM-judge validation The first judge-validation pass used glm-4.7-flash:q4_K_M as a local judge over all 54 adjudicated item– model rows. The judge prompt required separate labels for task success, policy comp… view at source ↗
Figure 2
Figure 2. Figure 2: Adjudicator confidence labels by model. Source: sanitized summaries for run local-pilot-20260630-185417; N = 54 item–model rows, with 18 rows per model. The figure shows high- and medium-confidence labels; no row received a low-confidence label. and risk labels. The result supports using LLM judges as validated triage tools rather than as unvalid￾ated substitutes for expert adjudication. 8 Representative a… view at source ↗
read the original abstract

Safety evaluations for language models increasingly depend on judgments about ambiguous natural-language behaviour: whether a model has followed an instruction, refused appropriately, complied with a policy, resisted an embedded command, or misreported progress in an agentic task. Existing benchmarks often compress these distinctions into pass/fail labels, obscuring whether failures arise from capability limits, policy ambiguity, instruction conflict, scaffold failure, or unstable evaluator judgments. This paper introduces adversarial pragmatics as a benchmark and annotation protocol for evaluating model behaviour under instruction conflict, embedded commands, quotation, scope ambiguity, deixis, indirect speech acts, and multi-turn agent transcripts. The contribution is empirical and methodological: a linguistically controlled taxonomy, an 18-item seed benchmark with validator-enforced metadata, a 54-row local seed pilot, an expert-evaluation protocol distinguishing task success, policy compliance, safety risk, refusal outcome, and evaluator confidence, and metrics for judge validity, diagnostic ambiguity, and taxonomy drift. The framework turns linguistic judgment methodology into a practical tool for validating safety evals, LLM judges, gold-set construction, prompt-injection tests, and safety documentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces 'adversarial pragmatics' as a benchmark and annotation protocol for evaluating language model behavior under instruction conflict, embedded commands, quotation, scope ambiguity, deixis, indirect speech acts, and multi-turn agent transcripts in safety contexts. It presents a linguistically controlled taxonomy, an 18-item seed benchmark with validator-enforced metadata, a 54-row local seed pilot, an expert-evaluation protocol distinguishing task success, policy compliance, safety risk, refusal outcome, and evaluator confidence, plus metrics for judge validity, diagnostic ambiguity, and taxonomy drift. The central claim is that this framework provides a practical tool for validating safety evaluations, LLM judges, gold-set construction, prompt-injection tests, and safety documentation by separating capability limits from policy ambiguity and evaluator instability.

Significance. If the taxonomy and protocol can be shown to reliably make the claimed distinctions in practice, the work would offer a methodologically grounded approach to improving the granularity and reliability of AI safety benchmarks, drawing on linguistic pragmatics to address limitations in existing pass/fail evaluations. The validator-enforced metadata and multi-dimensional expert protocol represent a strength in addressing evaluator instability, provided the 54-row pilot supplies supporting data.

major comments (1)
  1. Abstract: The abstract describes the taxonomy, 18-item seed, 54-row pilot, and metrics but supplies no results, error analysis, or validation data showing the protocol achieves the claimed distinctions between capability limits, policy ambiguity, and evaluator instability; the central empirical claim therefore lacks demonstrated support in the provided description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for the constructive feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: Abstract: The abstract describes the taxonomy, 18-item seed, 54-row pilot, and metrics but supplies no results, error analysis, or validation data showing the protocol achieves the claimed distinctions between capability limits, policy ambiguity, and evaluator instability; the central empirical claim therefore lacks demonstrated support in the provided description.

    Authors: We agree that the abstract, as currently written, does not include any quantitative or qualitative results from the 54-row pilot and therefore does not itself demonstrate the claimed distinctions. The body of the manuscript contains the pilot data, error analysis, and validation metrics (Sections 4 and 5), but the abstract is limited to a description of the framework. To address this, we will revise the abstract to incorporate a concise statement of key pilot findings (e.g., observed rates of diagnostic ambiguity and inter-evaluator agreement on policy vs. capability failures) that directly support the central claim. This change will be reflected in the next version of the manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper proposes a benchmark and annotation protocol without derivations, equations, fitted parameters, or predictions. It presents a taxonomy, 18-item seed benchmark, 54-row pilot, expert protocol, and metrics as empirical starting points for validating safety evaluations. No load-bearing self-citations, uniqueness theorems, or reductions of claims to inputs by construction appear. The contribution is methodological and self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The contribution rests on the utility of a newly introduced taxonomy and protocol whose effectiveness is asserted without supporting data or external validation in the abstract.

invented entities (1)
  • adversarial pragmatics no independent evidence
    purpose: Benchmark and annotation protocol for instruction conflict, embedded commands, and policy ambiguity in AI safety
    Newly coined term and framework presented as the core contribution.

pith-pipeline@v0.9.1-grok · 5727 in / 1027 out tokens · 26341 ms · 2026-07-02T12:28:35.335206+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

300 extracted references · 19 canonical work pages

  1. [1]

    Brand, Stewart , title =

  2. [2]

    O'Doherty, Cliona and Dineen, \'Aine T. and Truzzi, Anna and King, Graham and Zaadnoordijk, Lorijn and Harrison, Keelin and D'Arcy, Enna-Louise and White, Jessica and Caldinelli, Chiara and Holloway, Tamrin and Kravchenko, Anna and Diedrichsen, J\"orn and Tarrant, Ailbhe and Byrne, Angela T. and Foran, Adrienne and Molloy, Eleanor J. and Cusack, Rhodri , ...

  3. [3]

    1984 , doi =

    Millikan, Ruth Garrett , title =. 1984 , doi =

  4. [4]

    Millikan, Ruth Garrett , title =

  5. [5]

    , title =

    Machamer, Peter and Darden, Lindley and Craver, Carl F. , title =. Philosophy of Science , volume =. 2000 , doi =

  6. [6]

    , title =

    Craver, Carl F. , title =. 2007 , doi =

  7. [7]

    , title =

    Craver, Carl F. , title =. Philosophical Psychology , volume =. 2009 , doi =

  8. [8]

    Journal for General Philosophy of Science , volume =

    Onishi, Yukinori and Serpico, Davide , title =. Journal for General Philosophy of Science , volume =. 2022 , doi =

  9. [9]

    Cognitive Science , volume =

    Redington, Martin and Chater, Nick and Finch, Steven , title =. Cognitive Science , volume =. 1998 , doi =

  10. [10]

    , title =

    Mintz, Toben H. , title =. Cognition , volume =. 2003 , doi =

  11. [11]

    and Gerken, LouAnn , title =

    Gómez, Rebecca L. and Gerken, LouAnn , title =. Trends in Cognitive Sciences , volume =. 2000 , doi =

  12. [12]

    , title =

    Piantadosi, Steven T. , title =. From Fieldwork to Linguistic Theory:. 2024 , doi =

  13. [13]

    Proceedings of the 62nd

    Kallini, Julie and Papadimitriou, Isabel and Futrell, Richard and Mahowald, Kyle and Potts, Christopher , title =. Proceedings of the 62nd. 2024 , publisher =. doi:10.18653/v1/2024.acl-long.787 , url =

  14. [14]

    Proceedings of the 41st

    Huh, Minyoung and Cheung, Brian and Wang, Tongzhou and Isola, Phillip , title =. Proceedings of the 41st. 2024 , note =

  15. [15]

    and Hill, Felix , title =

    Piantadosi, Steven T. and Hill, Felix , title =. 2022 , note =

  16. [16]

    Language , volume =

    Bybee, Joan , title =. Language , volume =. 2006 , doi =

  17. [17]

    and Moder, Carol Lynn , title =

    Bybee, Joan L. and Moder, Carol Lynn , title =. Language , year =. doi:10.2307/413574 , note =

  18. [18]

    New Ideas in Psychology , volume =

    Diessel, Holger , title =. New Ideas in Psychology , volume =. 2007 , doi =

  19. [19]

    and Garrod, Simon , title =

    Pickering, Martin J. and Garrod, Simon , title =. Behavioral and Brain Sciences , volume =. 2004 , doi =

  20. [20]

    , title =

    Garrod, Simon and Pickering, Martin J. , title =. Topics in Cognitive Science , volume =. 2009 , doi =

  21. [21]

    Proceedings of the National Academy of Sciences , volume =

    Kirby, Simon and Cornish, Hannah and Smith, Kenny , title =. Proceedings of the National Academy of Sciences , volume =. 2008 , doi =

  22. [22]

    Cognition , volume =

    Kirby, Simon and Tamariz, Monica and Cornish, Hannah and Smith, Kenny , title =. Cognition , volume =. 2015 , doi =

  23. [23]

    Cognition , volume =

    Raviv, Limor and de Heer Kloots, Marianne and Meyer, Antje , title =. Cognition , volume =. 2021 , doi =

  24. [24]

    Harald , title =

    Heitmeier, Maria and Chuang, Yu-Ying and Baayen, R. Harald , title =. Cognitive Psychology , volume =. 2023 , doi =

  25. [25]

    and Schikowski, Robert and Küntay, Aylin C

    Moran, Steven and Blasi, Damián E. and Schikowski, Robert and Küntay, Aylin C. and Pfeiler, Barbara and Allen, Shanley and Stoll, Sabine , title =. Cognition , volume =. 2018 , doi =

  26. [26]

    , title =

    Kolyaseva, Alena F. , title =. Journal of Pragmatics , volume =. 2018 , doi =

  27. [27]

    , title =

    Tabor, Whitney and Juliano, Cornell and Tanenhaus, Michael K. , title =. Language and Cognitive Processes , volume =. 1997 , doi =

  28. [28]

    and Croft, William and Ellis, Nick C

    Beckner, Clay and Blythe, Richard and Bybee, Joan and Christiansen, Morten H. and Croft, William and Ellis, Nick C. and Holland, John and Ke, Jinyun and Larsen-Freeman, Diane and Schoenemann, Tom , title =. Language Learning , volume =. 2009 , doi =

  29. [29]

    2016 , doi =

    Yang, Charles , title =. 2016 , doi =

  30. [30]

    2010 , doi =

    Bybee, Joan , title =. 2010 , doi =

  31. [31]

    , title =

    Hopper, Paul J. , title =. Proceedings of the Thirteenth Annual Meeting of the. 1987 , doi =

  32. [32]

    , title =

    Ohala, John J. , title =. Papers in Laboratory Phonology. 1990 , doi =

  33. [33]

    Cognitive Psychology , volume =

    Rosch, Eleanor , title =. Cognitive Psychology , volume =. 1973 , doi =

  34. [34]

    Cognition and Categorization , editor =

    Rosch, Eleanor , title =. Cognition and Categorization , editor =

  35. [35]

    Evolution and Anthropology:

    Mayr, Ernst , title =. Evolution and Anthropology:

  36. [36]

    Mayr, Ernst , title =

  37. [37]

    , title =

    Wilson, Robert A. , title =. Species:. 1999 , doi =

  38. [38]

    2006 , doi =

    Sandler, Wendy and Lillo-Martin, Diane , title =. 2006 , doi =

  39. [39]

    1998 , doi =

    Brentari, Diane , title =. 1998 , doi =

  40. [40]

    Language , year =

    Allan, Keith , title =. Language , year =

  41. [41]

    Current Methods in Historical Semantics , editor =

    Allan, Keith , title =. Current Methods in Historical Semantics , editor =. 2011 , pages =

  42. [42]

    , title =

    Huddleston, Rodney and Pullum, Geoffrey K. , title =. 2005 , doi =

  43. [43]

    Journal of Semantics , year =

    Rothstein, Susan , title =. Journal of Semantics , year =

  44. [44]

    2012 , url =

    Grimm, Scott , title =. 2012 , url =

  45. [45]

    Language , year =

    Grimm, Scott , title =. Language , year =

  46. [46]

    Countability in Natural Language , editor =

    Grimm, Scott and Dočekal, Mojmír , title =. Countability in Natural Language , editor =. 2021 , pages =

  47. [47]

    , title =

    Corbett, Greville G. , title =. 1991 , doi =

  48. [48]

    , title =

    Corbett, Greville G. , title =. 2000 , doi =

  49. [49]

    , title =

    Corbett, Greville G. , title =. Morphology , year =

  50. [50]

    Things and Stuff:

    Lauwers, Peter , title =. Things and Stuff:. 2021 , pages =

  51. [51]

    2021 , doi =

    Countability in Natural Language , publisher =. 2021 , doi =

  52. [52]

    2021 , doi =

    Things and Stuff:. 2021 , doi =

  53. [53]

    2005 , doi =

    Borer, Hagit , title =. 2005 , doi =

  54. [54]

    Chomsky, Noam , title =

  55. [55]

    Events and Grammar , editor =

    Chierchia, Gennaro , title =. Events and Grammar , editor =. 1998 , pages =

  56. [56]

    Meaning, Use, and Interpretation of Language , editor =

    Link, Godehard , title =. Meaning, Use, and Interpretation of Language , editor =. 1983 , pages =. doi:10.1515/9783110852820.302 , ids =

  57. [57]

    2023 , pages =

    DeCarlo, Deanna and Palmer, William and Wilson, Michael and Frank, Bob , booktitle =. 2023 , pages =

  58. [58]

    2025 , doi =

    Boguraev, Sasha and Potts, Christopher and Mahowald, Kyle , title =. 2025 , doi =

  59. [59]

    Cognition , volume =

    Winckel, Elodie and Abeillé, Anne and Hemforth, Barbara and Gibson, Edward , title =. Cognition , volume =. 2025 , doi =

  60. [60]

    Dupré, John , title =

  61. [61]

    , title =

    Ghiselin, Michael T. , title =. Systematic Zoology , volume =. 1974 , doi =

  62. [62]

    , title =

    Hull, David L. , title =. Philosophy of Science , volume =. 1978 , doi =

  63. [63]

    Magnus, P. D. , title =. The Philosophical Quarterly , volume =. 2014 , doi =

  64. [64]

    Studies in History and Philosophy of Science Part A , volume =

    Lipski, Joachim , title =. Studies in History and Philosophy of Science Part A , volume =. 2020 , doi =

  65. [65]

    European Journal for Philosophy of Science , volume =

    Illari, Phyllis McKay and Williamson, Jon , title =. European Journal for Philosophy of Science , volume =. 2012 , doi =

  66. [66]

    Socializing Metaphysics: The Nature of Social Reality , editor =

    Mallon, Ron , title =. Socializing Metaphysics: The Nature of Social Reality , editor =

  67. [67]

    2016 , doi =

    Mallon, Ron , title =. 2016 , doi =

  68. [68]

    1999 , doi =

    Hacking, Ian , title =. 1999 , doi =

  69. [69]

    Journal of Social Ontology , volume =

    Bach, Theodore , title =. Journal of Social Ontology , volume =. 2016 , doi =

  70. [70]

    Australasian Journal of Philosophy , volume =

    O'Connor, Cailin , title =. Australasian Journal of Philosophy , volume =. 2021 , doi =

  71. [71]

    2022 , note =

    O'Connor, Cailin , title =. 2022 , note =

  72. [72]

    2019 , eprint =

    O'Connor, Cailin , title =. 2019 , eprint =

  73. [73]

    Aspects of Linguistic Variation , editor =

    Haspelmath, Martin , title =. Aspects of Linguistic Variation , editor =. 2018 , doi =

  74. [74]

    Philosophy of the Social Sciences , year =

    Khalidi, Muhammad Ali , title =. Philosophy of the Social Sciences , year =. doi:10.1177/00483931241228906 , note =

  75. [75]

    2016 , note =

    Nunberg, Geoffrey , title =. 2016 , note =

  76. [76]

    , title =

    Pullum, Geoffrey K. , title =. Form and Formalism in Linguistics , editor =. 2019 , doi =

  77. [77]

    Philosophical Studies , year =

    Boyd, Richard , title =. Philosophical Studies , year =. doi:10.1007/BF00385837 , ids =

  78. [78]

    Species:

    Boyd, Richard , title =. Species:. 1999 , pages =

  79. [79]

    Philosophical Studies , year =

    Millikan, Ruth Garrett , title =. Philosophical Studies , year =

  80. [80]

    Miller, J. T. M. , title =. Metaphysics , year =. doi:10.5334/met.70 , ids =

Showing first 80 references.