AIriskEval-edu: New Dataset for Risk Assessment in AI-mediated K-12 Educational Explanations
Pith reviewed 2026-07-03 15:04 UTC · model grok-4.3
The pith
A new dataset lets fine-tuned local LLMs match frontier models at detecting risks in K-12 teaching explanations while keeping data private.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Supervised fine-tuning on AIriskEval-edu-db2 enables a locally deployable Llama 3.1 8B model to approach or outperform stronger frontier models on pedagogical risk detection and explainability assessment tasks for K-12 instructional content while preserving privacy.
What carries the argument
The AIriskEval-edu-db2 dataset together with its five-dimension risk rubric (factual precision, depth and completeness, focus and relevance, student-level appropriateness, ideological bias) and the 785 semi-automatically annotated explanations that supply risk localization and risk description labels.
If this is right
- Educational institutions can run risk audits on AI-generated materials locally without sending content to third-party APIs.
- The five-dimension rubric supplies a reusable standard for evaluating pedagogical quality of explanations.
- Fine-tuned local models can generate both risk scores and human-readable justifications for those scores.
- The dataset format supports repeated evaluation cycles as new LLM teacher profiles or curriculum topics appear.
Where Pith is reading between the lines
- The same annotation workflow could be applied to create risk datasets for higher education or vocational training materials.
- Local models trained this way might be combined with student performance data to study whether risk-flagged explanations actually improve learning outcomes.
- The approach opens a path to domain-specific safety layers that schools could maintain and update without vendor lock-in.
Load-bearing premise
The semi-automatic risk localization and description annotations, even after expert teacher validation, accurately and consistently capture the intended pedagogical risks across the five rubric dimensions.
What would settle it
An independent test set of new K-12 explanations where the fine-tuned local model shows substantially lower accuracy than frontier models on risk detection or produces risk explanations that independent teachers rate as less useful or less faithful.
Figures
read the original abstract
This work introduces AIriskEval-edu-db2, a new dataset designed to train and evaluate auditors based on LLMs for an explainable pedagogical risk assessment in instructional content for grades K-12. The dataset comprises 1,639 explanations from 170 curated ScienceQA questions, covering science, language arts, and social sciences. For each question, the dataset includes an explanation written by a human teacher alongside 11 explanations generated by LLM-simulated teacher profiles associated with distinct pedagogical risks. We propose a comprehensive risk rubric aligned with established educational standards that covers five complementary dimensions: factual precision, depth and completeness, focus and relevance, student-level appropriateness, and ideological bias. A key contribution is the addition of 785 explanations with structured explainability annotations, including risk localization and risk description. The annotations are produced through a semi-automatic process with expert teacher validation. Finally, we present validation experiments comparing state-of-the-art proprietary models with a lightweight local Llama 3.1 8B model in both the pedagogical risk detection and the explainability assessment. These experiments evaluate whether supervised fine-tuning on AIriskEval-edu-db2 enables a locally deployable model to approach or outperform stronger frontier models while preserving privacy in educational auditing and assessment tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AIriskEval-edu-db2, a dataset of 1,639 explanations (human teacher plus 11 LLM-simulated profiles) for 170 ScienceQA questions across science, language arts, and social sciences. It defines a five-dimension risk rubric (factual precision, depth/completeness, focus/relevance, student-level appropriateness, ideological bias) and supplies structured explainability annotations (risk localization + description) for 785 items via a semi-automatic process followed by expert teacher validation. Validation experiments compare frontier proprietary models against a supervised fine-tuned Llama 3.1 8B model on pedagogical risk detection and explainability tasks, with the central claim that the fine-tuned local model can approach or outperform stronger systems while preserving privacy.
Significance. If the annotation quality and experimental results hold, the dataset would provide a useful resource for training privacy-preserving, locally deployable auditors of AI-generated K-12 instructional content, addressing a practical need at the intersection of NLP and educational technology.
major comments (3)
- [Abstract] Abstract (paragraph on annotations): the semi-automatic risk localization and description annotations for the 785 explanations are presented as the foundation for supervised fine-tuning, yet no inter-annotator agreement, consistency metrics, or error analysis against fully manual labels is reported. This directly affects the reliability of the training signal for the five rubric dimensions and therefore the attribution of any performance gains to dataset quality.
- [Abstract] Abstract (paragraph on LLM-simulated profiles): the 11 distinct pedagogical-risk profiles are introduced without any description of the prompting strategy, temperature settings, or few-shot examples used to induce the targeted risks. This information is required to evaluate whether the generated explanations actually span the intended dimensions of the rubric and to assess potential confounds in the comparative experiments.
- [Abstract] Abstract (final paragraph on validation experiments): the central claim that supervised fine-tuning on the dataset enables the Llama 3.1 8B model to approach or outperform frontier models is stated without any quantitative metrics, baselines, error bars, or statistical tests. Because the soundness of this performance claim is load-bearing for the paper's contribution, the absence of these details prevents assessment of whether the result is defensible.
minor comments (1)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., F1 or accuracy delta) from the validation experiments so readers can immediately gauge the magnitude of the reported gains.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the manuscript to enhance the abstract with additional details on annotation quality, profile generation, and experimental results.
read point-by-point responses
-
Referee: [Abstract] Abstract (paragraph on annotations): the semi-automatic risk localization and description annotations for the 785 explanations are presented as the foundation for supervised fine-tuning, yet no inter-annotator agreement, consistency metrics, or error analysis against fully manual labels is reported. This directly affects the reliability of the training signal for the five rubric dimensions and therefore the attribution of any performance gains to dataset quality.
Authors: We agree that inter-annotator agreement and error analysis metrics would strengthen claims about annotation reliability. The manuscript describes the semi-automatic process with expert teacher validation but does not report quantitative agreement statistics. We will add consistency metrics for the validation step and an error analysis on a subset of items compared to fully manual labels in the revised manuscript. revision: yes
-
Referee: [Abstract] Abstract (paragraph on LLM-simulated profiles): the 11 distinct pedagogical-risk profiles are introduced without any description of the prompting strategy, temperature settings, or few-shot examples used to induce the targeted risks. This information is required to evaluate whether the generated explanations actually span the intended dimensions of the rubric and to assess potential confounds in the comparative experiments.
Authors: The prompting strategy, temperature settings, and few-shot examples for generating the 11 profiles are detailed in the dataset construction section of the full manuscript. To improve self-containment of the abstract, we will incorporate a concise description of the generation approach in the revised abstract. revision: yes
-
Referee: [Abstract] Abstract (final paragraph on validation experiments): the central claim that supervised fine-tuning on the dataset enables the Llama 3.1 8B model to approach or outperform frontier models is stated without any quantitative metrics, baselines, error bars, or statistical tests. Because the soundness of this performance claim is load-bearing for the paper's contribution, the absence of these details prevents assessment of whether the result is defensible.
Authors: The abstract summarizes the experimental outcome at a high level. The full manuscript reports the quantitative metrics, baselines, error bars, and statistical tests in the validation experiments section. We will revise the abstract to include key performance numbers, baselines, and significance indicators to make the central claim more concrete and directly assessable. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces a new dataset (AIriskEval-edu-db2) constructed from the public ScienceQA benchmark plus newly generated semi-automatic annotations with expert validation. The central claims rest on empirical validation experiments comparing a fine-tuned Llama 3.1 8B model against frontier systems; no equations, fitted parameters renamed as predictions, or self-citation chains reduce any reported result to prior inputs by construction. The derivation chain is therefore self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The five dimensions (factual precision, depth and completeness, focus and relevance, student-level appropriateness, ideological bias) together form a sufficient rubric for pedagogical risk assessment
invented entities (1)
-
LLM-simulated teacher profiles
no independent evidence
Reference graph
Works this paper leans on
-
[1]
E-EV AL: A Comprehensive Chinese K-12 Edu- cation Evaluation Benchmark for Large Language Models,
J. Hou, C. Aoet al., “E-EV AL: A Comprehensive Chinese K-12 Edu- cation Evaluation Benchmark for Large Language Models,” inFindings of ACL, 2024, pp. 7753–7774
work page 2024
-
[2]
Inteligencia artificial, educaci ´on y la pregunta por los fines,
A. G. Carreras, “Inteligencia artificial, educaci ´on y la pregunta por los fines,”Pensamiento: revista de investigaci ´on e Informaci ´on filos ´ofica, vol. 80, no. 312, pp. 2029–2046, 2024
work page 2029
-
[3]
S. P. Chowdhury, T. J. Zhanget al., “Educators’ perceptions of large language models as tutors: Comparing human and AI tutors in a blind text-only setting,” inProc. of the 20th Workshop on Innovative Use of NLP for Building Educational Applications, 2025
work page 2025
-
[4]
LearnLM: Improving Gemini for Learning,
LearnLM Team and Google DeepMind, “LearnLM: Improving Gemini for Learning,”arXiv, 2024
work page 2024
-
[5]
J. Jeon and S. Lee, “Large Language Models in Education: A Focus on the Complementary Relationship between Human Teachers and ChatGPT,”Education and Information Technologies, 2023
work page 2023
-
[6]
Siren’s Song in the AI Ocean: A Survey on Halluci- nation in Large Language Models,
Y . Zhanget al., “Siren’s Song in the AI Ocean: A Survey on Halluci- nation in Large Language Models,”Comp. Linguistics, pp. 1–46, 2025
work page 2025
-
[7]
https://www.oecd.org/education/, 2025, accessed: 2025-12-08
work page 2025
-
[8]
Automating Pedagogical Evaluation of LLM-based Conversational Agents,
Z. Pauzi, M. Dodmanet al., “Automating Pedagogical Evaluation of LLM-based Conversational Agents,” inCEUR, vol. 4006, 2025
work page 2025
-
[9]
EduEV AL-DB: A Role-Based Dataset for Pedagogical Risk Evaluation in Educational Explanations,
J. Irigoyen, R. Daza, A. Morales, J. Fierrez, F. Jurado, A. Ortigosa, and R. Tolosana, “EduEV AL-DB: A Role-Based Dataset for Pedagogical Risk Evaluation in Educational Explanations,” inIntl. Conf. on Learning Analytics & Knowledge Workshops (GenAI-LA), 2026
work page 2026
-
[10]
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering,
P. Lu, S. Mishraet al., “Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering,”Advances in Neural Information Processing Systems, vol. 35, pp. 2507–2521, 2022
work page 2022
-
[11]
J. Macinaet al., “Mathdial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems,” in Findings of the ACL, 2023, pp. 5602–5621
work page 2023
-
[12]
Boosting Large Language Models with Socratic Method for Conversational Mathematics Teaching,
Y . Ding, H. Hu, J. Zhou, Q. Chen, B. Jiang, and L. He, “Boosting Large Language Models with Socratic Method for Conversational Mathematics Teaching,” inProc. of the 33rd ACM International Conference on Information and Knowledge Management, 2024, pp. 3730–3735
work page 2024
-
[13]
K. K. Maurya, K. A. Srivatsa, K. Petukhova, and E. Kochmar, “Uni- fying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors,” inProc. Conf. of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025, pp. 1234–1251
work page 2025
-
[14]
Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-Powered Tutors,
E. Kochmar, K. K. Maurya, K. Petukhova, K. V . Srivatsa, A. Tack, and J. Vasselli, “Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-Powered Tutors,” inProc. of the 20th Work- shop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, 2025, pp. 1011–1033
work page 2025
-
[15]
EducationQ: Evaluating LLMs’ Teaching Capabilities Through Multi-Agent Dialogue Framework,
Y . Shi, R. Liang, and Y . Xu, “EducationQ: Evaluating LLMs’ Teaching Capabilities Through Multi-Agent Dialogue Framework,” inProc. of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025, pp. 32 799–32 828
work page 2025
-
[16]
SocraticLM: Exploring Socratic Personalized Teaching with Large Language Models,
J. Liu, Z. Huang, T. Xiao, J. Sha, J. Wu, Q. Liu, S. Wang, and E. Chen, “SocraticLM: Exploring Socratic Personalized Teaching with Large Language Models,”Advances in Neural Information Processing Systems, vol. 37, pp. 85 693–85 721, 2024
work page 2024
-
[17]
LLMs are Biased Teachers: Evaluating LLM Bias in Personalized Education,
I. Weissburg, S. Anand, S. Levy, and H. Jeong, “LLMs are Biased Teachers: Evaluating LLM Bias in Personalized Education,” inFindings of the Association for Computational Linguistics, 2025, pp. 5650–5698
work page 2025
-
[18]
A General Language Assistant as a Laboratory for Alignment
A. Askell, Y . Bai, A. Chen, T. Conerly, S. Das, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Josephet al., “A General Language Assistant as a Laboratory for Alignment,”arXiv preprint arXiv:2112.00861, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[19]
B. J. Guzzetti, T. Snyder, G. V . Glass, and W. S. Gamas, “Promoting Conceptual Change in Science: A Comparative Meta-Analysis of In- structional Interventions,”Reading Research Quarterly, vol. 28, no. 2, pp. 116–159, 1993
work page 1993
-
[20]
M. T. Chi, S. A. Siler, H. Jeong, T. Yamauchi, and R. G. Hausmann, “Learning from Human Tutoring,”Cognitive Science, vol. 25, no. 4, pp. 471–533, 2001
work page 2001
-
[21]
J. Sweller, “Cognitive Load Theory,” inThe Psychology of Learning and Motivation: Cognition in Education, J. P. Mestre and B. H. Ross, Eds. Academic Press, 2011, vol. 55, pp. 37–76
work page 2011
-
[22]
L. S. Vygotsky,Mind in Society: The Development of Higher Psycho- logical Processes. Cambridge, MA: Harvard University Press, 1978
work page 1978
-
[23]
Whose Opinions Do Language Models Reflect?
S. Santurkar, E. Durmus, F. Ladhak, C. Lee, P. Liang, and T. Hashimoto, “Whose Opinions Do Language Models Reflect?” inProc. of the 40th Intl. Conf. on Machine Learning. PMLR, 2023, pp. 30 193–30 204
work page 2023
-
[24]
E. A. City, R. F. Elmore, S. E. Fiarman, and L. Teitel,Instructional Rounds in Education: A Network Approach to Improving Teaching and Learning. Cambridge, MA: Harvard Education Press, 2009
work page 2009
-
[25]
Holistic Evaluation of Language Models,
P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumaret al., “Holistic Evaluation of Language Models,”Transactions on Machine Learning Research, 2023, iSSN 2835-8856
work page 2023
-
[26]
edBB-Demo: Biometrics and Behavior Analysis for Online Educational Platforms,
R. Daza, A. Moraleset al., “edBB-Demo: Biometrics and Behavior Analysis for Online Educational Platforms,” inProc. AAAI Conf. on Artificial Intelligence (Demonstration), 2023, pp. 16 422–16 424
work page 2023
-
[27]
Evaluating Social Engineering Risks in AI- based Interaction using Biometrics and a Gaming Setup,
R. Daza, J. Irigoyenet al., “Evaluating Social Engineering Risks in AI- based Interaction using Biometrics and a Gaming Setup,” inIEEE Intl. Carnahan Conf. on Security Technology (ICCST), 2026
work page 2026
-
[28]
A. Becerra, R. Daza, R. Cobos, A. Morales, M. Cukurova, and J. Fierrez, “AI-based Multimodal Biometrics for Detecting Smartphone Distractions: Application to Online Learning,” inProc. of the European Conference on Technology-Enhanced Learning. Springer, 2025
work page 2025
-
[29]
Bio- metrics and behavior analysis for detecting distractions in e-learning,
´A. Becerra, J. Irigoyen, R. Daza, R. Cobos, A. Moraleset al., “Bio- metrics and behavior analysis for detecting distractions in e-learning,” inIntl. Symposium on Computers in Education (SIIE). IEEE, 2024
work page 2024
-
[30]
Is my vision-language data in your AI? membership inference test (MINT) Demo 2,
D. DeAlcalaet al., “Is my vision-language data in your AI? membership inference test (MINT) Demo 2,” inIEEE COMPSAC, 2026
work page 2026
-
[31]
Leveraging avatar fingerprinting: A photorealistic talking-head public database and benchmark,
L. Pedrouzoet al., “Leveraging avatar fingerprinting: A photorealistic talking-head public database and benchmark,”arXiv:2603.26934, 2026
-
[32]
Symbolic AI (LFIT) for XAI to handle biases,
J. Tello, M. de la Cruz, T. Ribeiroet al., “Symbolic AI (LFIT) for XAI to handle biases,” inEuropean Conf. on Artificial Intelligence Workshops (ECAIw), ser. CEUR-WS, vol. 3523, October 2023
work page 2023
-
[33]
Addressing bias in LLMs: Strategies and application to fair AI-based recruitment,
A. Pe ˜naet al., “Addressing bias in LLMs: Strategies and application to fair AI-based recruitment,” inAAAI/ACM AIES, 2025
work page 2025
-
[34]
DeepID challenge of detecting synthetic manipu- lations in ID documents,
P. Korshunovet al., “DeepID challenge of detecting synthetic manipu- lations in ID documents,” inIEEE ICCV Workshops, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.