pith. sign in

arxiv: 2607.01934 · v1 · pith:AW5DSG2Anew · submitted 2026-07-02 · 💻 cs.CL · cs.AI· cs.DB

AIriskEval-edu: New Dataset for Risk Assessment in AI-mediated K-12 Educational Explanations

Pith reviewed 2026-07-03 15:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.DB
keywords AI risk assessmenteducational explanationsK-12 educationLLM fine-tuningpedagogical risksexplainable AIdataset
0
0 comments X

The pith

A new dataset lets fine-tuned local LLMs match frontier models at detecting risks in K-12 teaching explanations while keeping data private.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AIriskEval-edu-db2, a collection of 1,639 explanations drawn from 170 ScienceQA items across science, language arts, and social sciences. Each item pairs a human teacher explanation with eleven LLM-simulated teacher outputs that embed distinct pedagogical risks, annotated across five rubric dimensions and with structured risk localization and description labels. Experiments compare proprietary frontier models against a fine-tuned Llama 3.1 8B model on both risk detection and the generation of explainable risk assessments. A sympathetic reader would care because the work tests whether supervised fine-tuning on this data can deliver competitive performance without transmitting sensitive educational content to external services.

Core claim

Supervised fine-tuning on AIriskEval-edu-db2 enables a locally deployable Llama 3.1 8B model to approach or outperform stronger frontier models on pedagogical risk detection and explainability assessment tasks for K-12 instructional content while preserving privacy.

What carries the argument

The AIriskEval-edu-db2 dataset together with its five-dimension risk rubric (factual precision, depth and completeness, focus and relevance, student-level appropriateness, ideological bias) and the 785 semi-automatically annotated explanations that supply risk localization and risk description labels.

If this is right

  • Educational institutions can run risk audits on AI-generated materials locally without sending content to third-party APIs.
  • The five-dimension rubric supplies a reusable standard for evaluating pedagogical quality of explanations.
  • Fine-tuned local models can generate both risk scores and human-readable justifications for those scores.
  • The dataset format supports repeated evaluation cycles as new LLM teacher profiles or curriculum topics appear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same annotation workflow could be applied to create risk datasets for higher education or vocational training materials.
  • Local models trained this way might be combined with student performance data to study whether risk-flagged explanations actually improve learning outcomes.
  • The approach opens a path to domain-specific safety layers that schools could maintain and update without vendor lock-in.

Load-bearing premise

The semi-automatic risk localization and description annotations, even after expert teacher validation, accurately and consistently capture the intended pedagogical risks across the five rubric dimensions.

What would settle it

An independent test set of new K-12 explanations where the fine-tuned local model shows substantially lower accuracy than frontier models on risk detection or produces risk explanations that independent teachers rate as less useful or less faithful.

Figures

Figures reproduced from arXiv: 2607.01934 by Alvaro Ortigosa, Aythami Morales, Enrique Blas, Francisco Jurado, Javier Irigoyen, Julian Fierrez, Roberto Daza, Ruben Tolosana.

Figure 1
Figure 1. Figure 1: Overview of the framework used in this work for the construction of AIriskEval-edu-db2 and explainable pedagogical risk evaluation. The diagram [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example from AIriskEval-edu-db2 showing a Sarcastic Teacher explanation and its explainable pedagogical risk evaluation, including the detected [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
read the original abstract

This work introduces AIriskEval-edu-db2, a new dataset designed to train and evaluate auditors based on LLMs for an explainable pedagogical risk assessment in instructional content for grades K-12. The dataset comprises 1,639 explanations from 170 curated ScienceQA questions, covering science, language arts, and social sciences. For each question, the dataset includes an explanation written by a human teacher alongside 11 explanations generated by LLM-simulated teacher profiles associated with distinct pedagogical risks. We propose a comprehensive risk rubric aligned with established educational standards that covers five complementary dimensions: factual precision, depth and completeness, focus and relevance, student-level appropriateness, and ideological bias. A key contribution is the addition of 785 explanations with structured explainability annotations, including risk localization and risk description. The annotations are produced through a semi-automatic process with expert teacher validation. Finally, we present validation experiments comparing state-of-the-art proprietary models with a lightweight local Llama 3.1 8B model in both the pedagogical risk detection and the explainability assessment. These experiments evaluate whether supervised fine-tuning on AIriskEval-edu-db2 enables a locally deployable model to approach or outperform stronger frontier models while preserving privacy in educational auditing and assessment tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces AIriskEval-edu-db2, a dataset of 1,639 explanations (human teacher plus 11 LLM-simulated profiles) for 170 ScienceQA questions across science, language arts, and social sciences. It defines a five-dimension risk rubric (factual precision, depth/completeness, focus/relevance, student-level appropriateness, ideological bias) and supplies structured explainability annotations (risk localization + description) for 785 items via a semi-automatic process followed by expert teacher validation. Validation experiments compare frontier proprietary models against a supervised fine-tuned Llama 3.1 8B model on pedagogical risk detection and explainability tasks, with the central claim that the fine-tuned local model can approach or outperform stronger systems while preserving privacy.

Significance. If the annotation quality and experimental results hold, the dataset would provide a useful resource for training privacy-preserving, locally deployable auditors of AI-generated K-12 instructional content, addressing a practical need at the intersection of NLP and educational technology.

major comments (3)
  1. [Abstract] Abstract (paragraph on annotations): the semi-automatic risk localization and description annotations for the 785 explanations are presented as the foundation for supervised fine-tuning, yet no inter-annotator agreement, consistency metrics, or error analysis against fully manual labels is reported. This directly affects the reliability of the training signal for the five rubric dimensions and therefore the attribution of any performance gains to dataset quality.
  2. [Abstract] Abstract (paragraph on LLM-simulated profiles): the 11 distinct pedagogical-risk profiles are introduced without any description of the prompting strategy, temperature settings, or few-shot examples used to induce the targeted risks. This information is required to evaluate whether the generated explanations actually span the intended dimensions of the rubric and to assess potential confounds in the comparative experiments.
  3. [Abstract] Abstract (final paragraph on validation experiments): the central claim that supervised fine-tuning on the dataset enables the Llama 3.1 8B model to approach or outperform frontier models is stated without any quantitative metrics, baselines, error bars, or statistical tests. Because the soundness of this performance claim is load-bearing for the paper's contribution, the absence of these details prevents assessment of whether the result is defensible.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., F1 or accuracy delta) from the validation experiments so readers can immediately gauge the magnitude of the reported gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the manuscript to enhance the abstract with additional details on annotation quality, profile generation, and experimental results.

read point-by-point responses
  1. Referee: [Abstract] Abstract (paragraph on annotations): the semi-automatic risk localization and description annotations for the 785 explanations are presented as the foundation for supervised fine-tuning, yet no inter-annotator agreement, consistency metrics, or error analysis against fully manual labels is reported. This directly affects the reliability of the training signal for the five rubric dimensions and therefore the attribution of any performance gains to dataset quality.

    Authors: We agree that inter-annotator agreement and error analysis metrics would strengthen claims about annotation reliability. The manuscript describes the semi-automatic process with expert teacher validation but does not report quantitative agreement statistics. We will add consistency metrics for the validation step and an error analysis on a subset of items compared to fully manual labels in the revised manuscript. revision: yes

  2. Referee: [Abstract] Abstract (paragraph on LLM-simulated profiles): the 11 distinct pedagogical-risk profiles are introduced without any description of the prompting strategy, temperature settings, or few-shot examples used to induce the targeted risks. This information is required to evaluate whether the generated explanations actually span the intended dimensions of the rubric and to assess potential confounds in the comparative experiments.

    Authors: The prompting strategy, temperature settings, and few-shot examples for generating the 11 profiles are detailed in the dataset construction section of the full manuscript. To improve self-containment of the abstract, we will incorporate a concise description of the generation approach in the revised abstract. revision: yes

  3. Referee: [Abstract] Abstract (final paragraph on validation experiments): the central claim that supervised fine-tuning on the dataset enables the Llama 3.1 8B model to approach or outperform frontier models is stated without any quantitative metrics, baselines, error bars, or statistical tests. Because the soundness of this performance claim is load-bearing for the paper's contribution, the absence of these details prevents assessment of whether the result is defensible.

    Authors: The abstract summarizes the experimental outcome at a high level. The full manuscript reports the quantitative metrics, baselines, error bars, and statistical tests in the validation experiments section. We will revise the abstract to include key performance numbers, baselines, and significance indicators to make the central claim more concrete and directly assessable. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a new dataset (AIriskEval-edu-db2) constructed from the public ScienceQA benchmark plus newly generated semi-automatic annotations with expert validation. The central claims rest on empirical validation experiments comparing a fine-tuned Llama 3.1 8B model against frontier systems; no equations, fitted parameters renamed as predictions, or self-citation chains reduce any reported result to prior inputs by construction. The derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The contribution rests on the domain assumption that the proposed five-dimension rubric comprehensively covers pedagogical risks in K-12 explanations and that the semi-automatic annotations faithfully reflect expert judgment; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption The five dimensions (factual precision, depth and completeness, focus and relevance, student-level appropriateness, ideological bias) together form a sufficient rubric for pedagogical risk assessment
    Invoked when the authors state the rubric is aligned with established educational standards and use it to label all explanations.
invented entities (1)
  • LLM-simulated teacher profiles no independent evidence
    purpose: Generate explanations carrying distinct pedagogical risks for dataset construction
    Eleven distinct profiles are created to produce the 11 LLM explanations per question; no external evidence of their realism is provided in the abstract.

pith-pipeline@v0.9.1-grok · 5784 in / 1407 out tokens · 28945 ms · 2026-07-03T15:04:35.860734+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

  1. [1]

    E-EV AL: A Comprehensive Chinese K-12 Edu- cation Evaluation Benchmark for Large Language Models,

    J. Hou, C. Aoet al., “E-EV AL: A Comprehensive Chinese K-12 Edu- cation Evaluation Benchmark for Large Language Models,” inFindings of ACL, 2024, pp. 7753–7774

  2. [2]

    Inteligencia artificial, educaci ´on y la pregunta por los fines,

    A. G. Carreras, “Inteligencia artificial, educaci ´on y la pregunta por los fines,”Pensamiento: revista de investigaci ´on e Informaci ´on filos ´ofica, vol. 80, no. 312, pp. 2029–2046, 2024

  3. [3]

    Educators’ perceptions of large language models as tutors: Comparing human and AI tutors in a blind text-only setting,

    S. P. Chowdhury, T. J. Zhanget al., “Educators’ perceptions of large language models as tutors: Comparing human and AI tutors in a blind text-only setting,” inProc. of the 20th Workshop on Innovative Use of NLP for Building Educational Applications, 2025

  4. [4]

    LearnLM: Improving Gemini for Learning,

    LearnLM Team and Google DeepMind, “LearnLM: Improving Gemini for Learning,”arXiv, 2024

  5. [5]

    Large Language Models in Education: A Focus on the Complementary Relationship between Human Teachers and ChatGPT,

    J. Jeon and S. Lee, “Large Language Models in Education: A Focus on the Complementary Relationship between Human Teachers and ChatGPT,”Education and Information Technologies, 2023

  6. [6]

    Siren’s Song in the AI Ocean: A Survey on Halluci- nation in Large Language Models,

    Y . Zhanget al., “Siren’s Song in the AI Ocean: A Survey on Halluci- nation in Large Language Models,”Comp. Linguistics, pp. 1–46, 2025

  7. [7]

    https://www.oecd.org/education/, 2025, accessed: 2025-12-08

  8. [8]

    Automating Pedagogical Evaluation of LLM-based Conversational Agents,

    Z. Pauzi, M. Dodmanet al., “Automating Pedagogical Evaluation of LLM-based Conversational Agents,” inCEUR, vol. 4006, 2025

  9. [9]

    EduEV AL-DB: A Role-Based Dataset for Pedagogical Risk Evaluation in Educational Explanations,

    J. Irigoyen, R. Daza, A. Morales, J. Fierrez, F. Jurado, A. Ortigosa, and R. Tolosana, “EduEV AL-DB: A Role-Based Dataset for Pedagogical Risk Evaluation in Educational Explanations,” inIntl. Conf. on Learning Analytics & Knowledge Workshops (GenAI-LA), 2026

  10. [10]

    Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering,

    P. Lu, S. Mishraet al., “Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering,”Advances in Neural Information Processing Systems, vol. 35, pp. 2507–2521, 2022

  11. [11]

    Mathdial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems,

    J. Macinaet al., “Mathdial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems,” in Findings of the ACL, 2023, pp. 5602–5621

  12. [12]

    Boosting Large Language Models with Socratic Method for Conversational Mathematics Teaching,

    Y . Ding, H. Hu, J. Zhou, Q. Chen, B. Jiang, and L. He, “Boosting Large Language Models with Socratic Method for Conversational Mathematics Teaching,” inProc. of the 33rd ACM International Conference on Information and Knowledge Management, 2024, pp. 3730–3735

  13. [13]

    Uni- fying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors,

    K. K. Maurya, K. A. Srivatsa, K. Petukhova, and E. Kochmar, “Uni- fying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors,” inProc. Conf. of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025, pp. 1234–1251

  14. [14]

    Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-Powered Tutors,

    E. Kochmar, K. K. Maurya, K. Petukhova, K. V . Srivatsa, A. Tack, and J. Vasselli, “Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-Powered Tutors,” inProc. of the 20th Work- shop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, 2025, pp. 1011–1033

  15. [15]

    EducationQ: Evaluating LLMs’ Teaching Capabilities Through Multi-Agent Dialogue Framework,

    Y . Shi, R. Liang, and Y . Xu, “EducationQ: Evaluating LLMs’ Teaching Capabilities Through Multi-Agent Dialogue Framework,” inProc. of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025, pp. 32 799–32 828

  16. [16]

    SocraticLM: Exploring Socratic Personalized Teaching with Large Language Models,

    J. Liu, Z. Huang, T. Xiao, J. Sha, J. Wu, Q. Liu, S. Wang, and E. Chen, “SocraticLM: Exploring Socratic Personalized Teaching with Large Language Models,”Advances in Neural Information Processing Systems, vol. 37, pp. 85 693–85 721, 2024

  17. [17]

    LLMs are Biased Teachers: Evaluating LLM Bias in Personalized Education,

    I. Weissburg, S. Anand, S. Levy, and H. Jeong, “LLMs are Biased Teachers: Evaluating LLM Bias in Personalized Education,” inFindings of the Association for Computational Linguistics, 2025, pp. 5650–5698

  18. [18]

    A General Language Assistant as a Laboratory for Alignment

    A. Askell, Y . Bai, A. Chen, T. Conerly, S. Das, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Josephet al., “A General Language Assistant as a Laboratory for Alignment,”arXiv preprint arXiv:2112.00861, 2021

  19. [19]

    Promoting Conceptual Change in Science: A Comparative Meta-Analysis of In- structional Interventions,

    B. J. Guzzetti, T. Snyder, G. V . Glass, and W. S. Gamas, “Promoting Conceptual Change in Science: A Comparative Meta-Analysis of In- structional Interventions,”Reading Research Quarterly, vol. 28, no. 2, pp. 116–159, 1993

  20. [20]

    Learning from Human Tutoring,

    M. T. Chi, S. A. Siler, H. Jeong, T. Yamauchi, and R. G. Hausmann, “Learning from Human Tutoring,”Cognitive Science, vol. 25, no. 4, pp. 471–533, 2001

  21. [21]

    Cognitive Load Theory,

    J. Sweller, “Cognitive Load Theory,” inThe Psychology of Learning and Motivation: Cognition in Education, J. P. Mestre and B. H. Ross, Eds. Academic Press, 2011, vol. 55, pp. 37–76

  22. [22]

    L. S. Vygotsky,Mind in Society: The Development of Higher Psycho- logical Processes. Cambridge, MA: Harvard University Press, 1978

  23. [23]

    Whose Opinions Do Language Models Reflect?

    S. Santurkar, E. Durmus, F. Ladhak, C. Lee, P. Liang, and T. Hashimoto, “Whose Opinions Do Language Models Reflect?” inProc. of the 40th Intl. Conf. on Machine Learning. PMLR, 2023, pp. 30 193–30 204

  24. [24]

    E. A. City, R. F. Elmore, S. E. Fiarman, and L. Teitel,Instructional Rounds in Education: A Network Approach to Improving Teaching and Learning. Cambridge, MA: Harvard Education Press, 2009

  25. [25]

    Holistic Evaluation of Language Models,

    P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumaret al., “Holistic Evaluation of Language Models,”Transactions on Machine Learning Research, 2023, iSSN 2835-8856

  26. [26]

    edBB-Demo: Biometrics and Behavior Analysis for Online Educational Platforms,

    R. Daza, A. Moraleset al., “edBB-Demo: Biometrics and Behavior Analysis for Online Educational Platforms,” inProc. AAAI Conf. on Artificial Intelligence (Demonstration), 2023, pp. 16 422–16 424

  27. [27]

    Evaluating Social Engineering Risks in AI- based Interaction using Biometrics and a Gaming Setup,

    R. Daza, J. Irigoyenet al., “Evaluating Social Engineering Risks in AI- based Interaction using Biometrics and a Gaming Setup,” inIEEE Intl. Carnahan Conf. on Security Technology (ICCST), 2026

  28. [28]

    AI-based Multimodal Biometrics for Detecting Smartphone Distractions: Application to Online Learning,

    A. Becerra, R. Daza, R. Cobos, A. Morales, M. Cukurova, and J. Fierrez, “AI-based Multimodal Biometrics for Detecting Smartphone Distractions: Application to Online Learning,” inProc. of the European Conference on Technology-Enhanced Learning. Springer, 2025

  29. [29]

    Bio- metrics and behavior analysis for detecting distractions in e-learning,

    ´A. Becerra, J. Irigoyen, R. Daza, R. Cobos, A. Moraleset al., “Bio- metrics and behavior analysis for detecting distractions in e-learning,” inIntl. Symposium on Computers in Education (SIIE). IEEE, 2024

  30. [30]

    Is my vision-language data in your AI? membership inference test (MINT) Demo 2,

    D. DeAlcalaet al., “Is my vision-language data in your AI? membership inference test (MINT) Demo 2,” inIEEE COMPSAC, 2026

  31. [31]

    Leveraging avatar fingerprinting: A photorealistic talking-head public database and benchmark,

    L. Pedrouzoet al., “Leveraging avatar fingerprinting: A photorealistic talking-head public database and benchmark,”arXiv:2603.26934, 2026

  32. [32]

    Symbolic AI (LFIT) for XAI to handle biases,

    J. Tello, M. de la Cruz, T. Ribeiroet al., “Symbolic AI (LFIT) for XAI to handle biases,” inEuropean Conf. on Artificial Intelligence Workshops (ECAIw), ser. CEUR-WS, vol. 3523, October 2023

  33. [33]

    Addressing bias in LLMs: Strategies and application to fair AI-based recruitment,

    A. Pe ˜naet al., “Addressing bias in LLMs: Strategies and application to fair AI-based recruitment,” inAAAI/ACM AIES, 2025

  34. [34]

    DeepID challenge of detecting synthetic manipu- lations in ID documents,

    P. Korshunovet al., “DeepID challenge of detecting synthetic manipu- lations in ID documents,” inIEEE ICCV Workshops, 2025