KRCA: An Efficient Root Cause Analysis System in Hyper-Scale Microservice Systems via Agentic AI

Dan Pei; Jiamin Jiang; Jielong Huang; Jingfei Feng; Nan Qi; Qingliang Zhang; Shenglin Zhang; Tianyu Cui; Wenwei Gu; Yao Wu

arxiv: 2607.01788 · v1 · pith:LYPWAJOZnew · submitted 2026-07-02 · 💻 cs.SE

KRCA: An Efficient Root Cause Analysis System in Hyper-Scale Microservice Systems via Agentic AI

Jiamin Jiang , Jingfei Feng , Yu Luo , Qingliang Zhang , Yongqian Su , Wenwei Gu , Shenglin Zhang , Tianyu Cui

show 4 more authors

Yao Wu Jielong Huang Nan Qi Dan Pei

This is my paper

Pith reviewed 2026-07-03 09:00 UTC · model grok-4.3

classification 💻 cs.SE

keywords root cause analysismicroservice systemscausal graphsmulti-agent systemsfailure diagnosisproduction deploymentanomaly metricsservice localization

0 comments

The pith

KRCA localizes root causes in hyper-scale microservices at 0.88 accuracy by using a skeleton causal graph prior and memory-augmented multi-agent verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces KRCA as an end-to-end system for root cause analysis in massive, rapidly changing microservice architectures. It processes failures through API-level drilldown to narrow suspects, builds a skeleton causal graph from anomalous metrics as a structural prior, and applies a memory-augmented multi-agent setup to confirm causes and classify failure types. This combination is presented as necessary because existing methods cannot keep up with the scale and independence of services. A sympathetic reader would care because faster, more accurate diagnosis directly shortens downtime in systems that serve large user bases. The reported results include AC@1 scores of 0.88 and 0.79 plus a 77.3 percent reduction in average diagnosis time after six months of production use.

Core claim

KRCA manages the vast search space in hyper-scale microservice systems through a multi-stage pipeline that begins with an API-level drilldown to isolate suspicious services, instantiates a skeleton-based causal graph from anomalous metrics to serve as a high-recall structural prior, and then utilizes a memory-augmented multi-agent framework to verify causality and generate the final failure report. By combining structured causal constraints with multi-agent reasoning, KRCA balances diagnostic accuracy with the efficiency requirements of real-time production use, achieving AC@1 scores of 0.88 for root cause service localization and 0.79 for failure type classification while outperforming the

What carries the argument

The skeleton-based causal graph instantiated from anomalous metrics, which acts as a high-recall structural prior that the memory-augmented multi-agent framework uses to verify causality.

If this is right

The system outperforms the strongest baseline by at least 31 percent in absolute gains on AC@1 for both service localization and failure classification.
Average diagnosis time drops by 77.3 percent after six months of live production deployment.
The multi-stage pipeline keeps computational cost low enough for real-time use while maintaining high accuracy.
The approach handles independent evolution of services through continuous deployment without requiring full system retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar pipelines could be tested in other large distributed systems such as cloud orchestration platforms to check whether the causal-graph-plus-agents pattern transfers.
Widespread adoption might shift SRE workflows from manual graph inspection toward reviewing agent-generated reports.
A controlled experiment comparing the multi-agent verification step against a single large language model could quantify the benefit of the memory-augmented design.

Load-bearing premise

The skeleton-based causal graph from anomalous metrics supplies a high-recall structural prior that the multi-agent framework can reliably use to verify true causality without too many false positives or excessive overhead.

What would settle it

Running KRCA on a different hyper-scale microservice deployment and measuring whether AC@1 scores drop below 0.7 for localization or diagnosis time reduction falls below 50 percent would directly test whether the claimed gains hold.

Figures

Figures reproduced from arXiv: 2607.01788 by Dan Pei, Jiamin Jiang, Jielong Huang, Jingfei Feng, Nan Qi, Qingliang Zhang, Shenglin Zhang, Tianyu Cui, Wenwei Gu, Yao Wu, Yongqian Su, Yu Luo.

**Figure 2.** Figure 2: Empirical study on the limitations of existing RCA [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the API-level drilldown process. Starting from the alerting API, KRCA recursively evaluates and prunes downstream APIs based on a scoring function. supplies similar historical cases and diagnostic experience. After several rounds of refinement, the final causal graph is used to generate a failure report that identifies the root cause service and failure type. 3.2 API-level drilldown In ou… view at source ↗

**Figure 5.** Figure 5: Skeleton-based causal graph instantiation. (a) The [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of the Multi-Agent Collaboration framework. The Main Agent orchestrates domain-specific Sub Agents [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Sensitivity analysis of key hyperparameters in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Deployment architecture of KRCA. E-commerce, and Algorithms. During this period, we collected internal statistics on 483 emergency incidents3 . For each incident, the ground-truth root cause service and failure type were established through postmortem analysis by the SREs responsible for the affected services. According to these records, KRCA correctly identified both the root cause service and the failu… view at source ↗

read the original abstract

Hyper-scale microservice systems have become the standard infrastructure for large-scale Internet companies. These systems consist of numerous loosely coupled microservices that evolve independently through continuous development and deployment. Such complexity makes failures unavoidable, necessitating efficient Root Cause Analysis (RCA) to help Site Reliability Engineers (SREs) quickly localize root cause services and classify failure types. However, existing RCA methods often struggle to adapt to the extreme dynamism and massive scale of these systems. In this paper, we present KRCA, an end-to-end RCA system designed for hyper-scale microservice systems. To manage the vast search space, KRCA employs a multi-stage pipeline that begins with an API-level drilldown to isolate suspicious services. It then instantiates a skeleton-based causal graph from anomalous metrics to serve as a high-recall structural prior, before utilizing a memory-augmented multi-agent framework to verify causality and generate the final failure report. By combining structured causal constraints with multi-agent reasoning, KRCA employs balances diagnostic accuracy with the efficiency requirements of real-time production use. Experimental results show that KRCA achieves AC@1 scores of 0.88 and 0.79 for root cause service localization and failure type classification, outperforming the strongest baseline by at lease 31% in absolute gains. KRCA has been deployed in Kuaishou's production environment for over six months, reducing the average diagnosis time by 77.3%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KRCA describes a multi-stage RCA pipeline for microservices that mixes causal graphs with multi-agent AI and reports production deployment, but the abstract supplies zero methods or data details so the numbers cannot be checked.

read the letter

The paper presents KRCA, an end-to-end system for root cause analysis in hyper-scale microservices. It narrows the search with API drilldown, builds a skeleton causal graph from anomalous metrics as a high-recall prior, then runs a memory-augmented multi-agent framework to verify causes and output reports. The authors say it has run in Kuaishou production for over six months, cutting diagnosis time by 77.3 percent, and they report AC@1 scores of 0.88 for service localization and 0.79 for failure classification, beating the strongest baseline by at least 31 percent absolute.

The combination of structured causal constraints with agent reasoning is a concrete engineering step for this domain and the production claim is the part that stands out. It shows someone actually tried to ship the thing rather than just run offline tests.

The soft spot is straightforward: the text gives no experimental setup, no baseline descriptions, no dataset details, no error bars, and no account of how the skeleton graph is constructed or how it handles topology changes. The stress-test concern about recall and false positives in the causal prior therefore holds; without those construction details the reported gains rest on uncheckable numbers.

This is for engineers working on microservice reliability who want a high-level template they can adapt. It does not yet show the clear thinking or reproducible evidence needed for a serious referee. I would not bring it to a reading group or cite it, and I would desk-reject rather than send to review.

Referee Report

3 major / 2 minor

Summary. The paper presents KRCA, an end-to-end root cause analysis system for hyper-scale microservice systems. It uses a multi-stage pipeline consisting of API-level drilldown to isolate suspicious services, instantiation of a skeleton-based causal graph from anomalous metrics as a high-recall structural prior, and a memory-augmented multi-agent framework to verify causality and produce failure reports. The central claims are AC@1 scores of 0.88 for root cause service localization and 0.79 for failure type classification (outperforming the strongest baseline by at least 31% absolute gain), plus a 77.3% reduction in average diagnosis time after six months of production deployment at Kuaishou.

Significance. If the empirical results and deployment claims hold under scrutiny, the work would be significant for practical RCA in large-scale, dynamic microservice environments by demonstrating how causal graph priors can be combined with agentic reasoning to balance accuracy and real-time efficiency. The production deployment evidence, if substantiated with before/after metrics, would constitute a notable strength for an applied systems paper.

major comments (3)

[Abstract] Abstract: The headline AC@1 scores (0.88/0.79), 31% gains, and 77.3% diagnosis-time reduction are stated without any experimental setup, dataset description, baseline definitions, anomaly detection thresholds, metric selection criteria, or statistical details (error bars, number of incidents, exclusion criteria). These omissions make the central performance claims impossible to evaluate or reproduce from the manuscript.
[Method (causal graph instantiation)] Method description of skeleton-based causal graph (the step immediately following API-level drilldown): No algorithm, pseudocode, or parameters are supplied for metric selection, anomaly scoring, edge extraction, or handling of service topology changes. In hyper-scale non-stationary systems this step is load-bearing for the high-recall prior assumption; without these details the downstream multi-agent verification claims cannot be assessed for false-positive or recall failure modes.
[Evaluation / Deployment] Deployment and evaluation sections: The six-month production deployment and 77.3% time reduction are asserted without before/after measurement methodology, incident sampling criteria, or comparison to prior SRE workflows, leaving the real-world impact claim unsupported.

minor comments (2)

[Abstract] Abstract contains two typographical errors: 'at lease' should be 'at least' and 'employs balances' appears to be a phrasing error.
[Method] The term 'memory-augmented multi-agent framework' is introduced without a high-level diagram or pseudocode showing agent roles, memory structure, or interaction protocol.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve detail, clarity, and reproducibility where needed.

read point-by-point responses

Referee: [Abstract] Abstract: The headline AC@1 scores (0.88/0.79), 31% gains, and 77.3% diagnosis-time reduction are stated without any experimental setup, dataset description, baseline definitions, anomaly detection thresholds, metric selection criteria, or statistical details (error bars, number of incidents, exclusion criteria). These omissions make the central performance claims impossible to evaluate or reproduce from the manuscript.

Authors: We agree the abstract is too concise and omits key context. The Evaluation section (Section 4) contains the full experimental setup, including dataset descriptions (production incidents from Kuaishou), baseline definitions, anomaly thresholds, metric criteria, and statistical details with error bars and incident counts. To address the concern directly, we will revise the abstract to briefly note the evaluation scale (e.g., number of incidents and services) and explicitly reference Section 4 for complete setup, thresholds, and statistics. This makes the claims more evaluable while respecting abstract length limits. revision: yes
Referee: [Method (causal graph instantiation)] Method description of skeleton-based causal graph (the step immediately following API-level drilldown): No algorithm, pseudocode, or parameters are supplied for metric selection, anomaly scoring, edge extraction, or handling of service topology changes. In hyper-scale non-stationary systems this step is load-bearing for the high-recall prior assumption; without these details the downstream multi-agent verification claims cannot be assessed for false-positive or recall failure modes.

Authors: We agree that the skeleton causal graph step requires more algorithmic transparency. In the revised manuscript we will add a dedicated subsection with pseudocode for metric selection, anomaly scoring, edge extraction from the skeleton, and handling of topology changes, including all parameters used. This will allow readers to assess the high-recall prior and any associated false-positive or recall issues. revision: yes
Referee: [Evaluation / Deployment] Deployment and evaluation sections: The six-month production deployment and 77.3% time reduction are asserted without before/after measurement methodology, incident sampling criteria, or comparison to prior SRE workflows, leaving the real-world impact claim unsupported.

Authors: We agree the deployment claim would be stronger with explicit methodology. We will expand the deployment section to describe the before/after measurement approach, incident sampling criteria (e.g., selection rules and exclusion criteria), and a high-level comparison to prior SRE workflows. Some production details will remain aggregated due to company confidentiality policies, but the added methodology will substantiate the 77.3% reduction claim. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external benchmarks

full rationale

The paper describes an applied RCA pipeline (API drilldown, skeleton causal graph from anomalous metrics, memory-augmented multi-agent verification) and reports performance numbers (AC@1 0.88/0.79, 77.3% time reduction) from experiments and six-month production deployment. No equations, parameter fits, or derivations appear; the central claims are therefore not reducible to self-defined quantities or self-citation chains. The skeleton-graph step is presented as an engineering choice whose effectiveness is measured externally rather than assumed by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review provides no visibility into free parameters, background axioms, or independent evidence for introduced components.

invented entities (2)

memory-augmented multi-agent framework no independent evidence
purpose: verify causality and generate final failure report
Introduced as core component of KRCA but no independent evidence or falsifiable handle supplied in abstract.
skeleton-based causal graph no independent evidence
purpose: high-recall structural prior from anomalous metrics
Constructed as intermediate artifact but no details on construction method or validation outside the system.

pith-pipeline@v0.9.1-grok · 5824 in / 1364 out tokens · 38439 ms · 2026-07-03T09:00:33.200703+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 5 canonical work pages

[1]

Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao Zhang, and Saravan Rajmohan. 2023. Recommending root-cause and mitigation steps for cloud incidents using large language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1737–1749

2023
[2]

Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. InNoise reduction in speech processing. Springer, 1–4

2009
[3]

O’Reilly Media, Inc

Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. 2016.Site reliability engineering: how Google runs production systems. " O’Reilly Media, Inc. "

2016
[4]

2011.Bayesian inference in statistical analysis

George EP Box and George C Tiao. 2011.Bayesian inference in statistical analysis. John Wiley & Sons

2011
[5]

Pengfei Chen, Yong Qi, Pengfei Zheng, and Di Hou. 2014. Causeinfer: automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. InIEEE INFOCOM 2014-IEEE Conference on Computer Communications. IEEE, 1887–1895

2014
[6]

Yinfang Chen et al. 2024. Automatic root cause analysis via large language models for cloud incidents. InProceedings of the Nineteenth European Conference on Computer Systems, 674–688

2024
[7]

Yinfang Chen et al. 2025. Stratus: a multi-agent system for autonomous relia- bility engineering of modern clouds.arXiv preprint arXiv:2506.02009

work page arXiv 2025
[8]

Yuxiao Cheng, Lianglong Li, Tingxiong Xiao, Zongren Li, Jinli Suo, Kunlun He, and Qionghai Dai. 2024. Cuts+: high-dimensional causal discovery from irreg- ular time-series. InProceedings of the AAAI Conference on Artificial Intelligence number 10. Vol. 38, 11525–11533

2024
[9]

Tianyu Cui et al. 2025. Logeval: a comprehensive benchmark suite for llms in log analysis.Empirical Software Engineering, 30, 6, 173

2025
[10]

Huaming Du, Yujia Zheng, Baoyu Jing, Yu Zhao, Gang Kou, Guisong Liu, Tao Gu, Weimin Li, and Carl Yang. 2025. Causal discovery through synergizing large language model and data-driven reasoning. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, 543–554

2025
[11]

Tao Feng, Lizhen Qu, Niket Tandon, Zhuang Li, Xiaoxi Kang, and Gholamreza Haffari. 2025. On the reliability of large language models for causal discovery. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 9565–9590

2025
[12]

Ruowei Fu et al. 2025. Llm-powered multi-agent collaboration for intelligent industrial on-call automation. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2222–2234

2025
[13]

Google Cloud Platform. 2021. Online boutique: a cloud-native microservices demo application. https://github.com/GoogleCloudPlatform/microservices-de mo. Accessed: 2026-03-23. (2021)

2021
[14]

Clive WJ Granger. 1980. Testing for causality: a personal viewpoint.Journal of Economic Dynamics and control, 2, 329–352

1980
[15]

Yongqi Han, Qingfeng Du, Ying Huang, Jiaqi Wu, Fulong Tian, and Cheng He
[16]

InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 931–943

The potential of one-shot failure root cause analysis: collaboration of the large language model and small classifier. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 931–943
[17]

Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. 2022. Root cause analysis of failures in microservices through causal discovery.Advances in Neural Information Processing Systems, 35, 31158–31170

2022
[18]

2010.Random walk: a modern introduction

Gregory F Lawler and Vlada Limic. 2010.Random walk: a modern introduction. Vol. 123. Cambridge University Press

2010
[19]

Mingjie Li, Zeyan Li, Kanglin Yin, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2022. Causal inference-based root cause analysis for online service systems with intervention recognition. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 3230–3240

2022
[20]

Zeyan Li et al. 2023. Generic and robust root cause localization for multi- dimensional data in online service systems.Journal of Systems and Software, 203, 111748

2023
[21]

Zeyan Li et al. 2021. Practical root cause localization for microservice systems via trace analysis. In2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). IEEE, 1–10

2021
[22]

Cheng-Ming Lin, Ching Chang, Wei-Yao Wang, Kuang-Da Wang, and Wen- Chih Peng. 2024. Root cause analysis in microservice using neural granger causal discovery. InProceedings of the AAAI Conference on Artificial Intelligence number 1. Vol. 38, 206–213

2024
[23]

Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. 2021. Microhecl: high-efficient root cause localization in large-scale microservice systems. In2021 IEEE/ACM 43rd Inter- national Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 338–347

2021
[24]

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In2008 eighth ieee international conference on data mining. IEEE, 413–422

2008
[25]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: how language models use long contexts.Transactions of the association for computational linguistics, 12, 157–173

2024
[26]

Qihan Liu, Pengfei Chen, Guangba Yu, Yuanhao Lai, and Xiaoyun Li. 2025. Causelens: causality-based interpretable root cause analysis for microservice systems. In2025 IEEE/ACM 33rd International Symposium on Quality of Service (IWQoS). IEEE, 1–10

2025
[27]

Yilun Liu, Shimin Tao, Weibin Meng, Feiyu Yao, Xiaofeng Zhao, and Hao Yang
[28]

InProceedings of the 2024 IEEE/ACM 46th international conference on software engineering: Companion proceedings, 364–365

Logprompt: prompt engineering towards zero-shot and interpretable log analysis. InProceedings of the 2024 IEEE/ACM 46th international conference on software engineering: Companion proceedings, 364–365

2024
[29]

Yilun Liu et al. 2025. R-log: incentivizing log analysis capability in llms via reasoning-based reinforcement learning.arXiv preprint arXiv:2509.25987

work page arXiv 2025
[30]

Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang. 2020. Automap: diagnose your microservice-based web applications automatically. InProceedings of The Web Conference 2020, 246–258

2020
[31]

Yuan Meng, Shenglin Zhang, Yongqian Sun, Ruru Zhang, Zhilong Hu, Yiyin Zhang, Chenyang Jia, Zhaogang Wang, and Dan Pei. 2020. Localizing failure root causes in a microservice through causality inference. In2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS). IEEE, 1–10

2020
[32]

Changhua Pei et al. 2025. Flow-of-action: sop enhanced llm-based multi-agent system for root cause analysis. InCompanion Proceedings of the ACM on Web Conference 2025, 422–431

2025
[33]

Stephen Robertson, Hugo Zaragoza, and Michael Taylor. 2004. Simple bm25 extension to multiple weighted fields. InProceedings of the thirteenth ACM international conference on Information and knowledge management, 42–49

2004
[34]

Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca, and Saravan Rajmohan. 2024. Exploring llm-based agents for root cause analysis. InCompanion proceedings of the 32nd ACM international conference on the foundations of software engineering, 208–219

2024
[35]

Binpeng Shi et al. 2025. Flowxpert: expertizing troubleshooting workflow orchestration with knowledge base and multi-agent coevolution. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, 4839–4850

2025
[36]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: language agents with verbal reinforcement learn- ing.Advances in neural information processing systems, 36, 8634–8652

2023
[37]

2000.Causation, predic- tion, and search

Peter Spirtes, Clark N Glymour, and Richard Scheines. 2000.Causation, predic- tion, and search. MIT press

2000
[38]

Yongqian Sun, Yu Luo, Xidao Wen, Yuan Yuan, Xiaohui Nie, Shenglin Zhang, Tong Liu, and Xi Luo. 2025. Trioxpert: an automated incident management framework for microservice system.arXiv preprint arXiv:2506.10043

work page arXiv 2025
[39]

Yongqian Sun, Binpeng Shi, Mingyu Mao, Minghua Ma, Sibo Xia, Shenglin Zhang, and Dan Pei. 2024. Art: a unified unsupervised framework for incident management in microservice systems. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 1183–1194

2024
[40]

Yongqian Sun et al. 2025. Llm-augmented ticket aggregation for low-cost mobile os defect resolution. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, 215–226

2025
[41]

Yuni Susanti and Michael Färber. 2025. Paths to causality: finding informative subgraphs within knowledge graphs for knowledge-based causal discovery. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, 2778–2789

2025
[42]

Guangya Wan, Yunsheng Lu, Yuqi Wu, Mengxuan Hu, and Sheng Li. 2024. Large language models for causal discovery: current landscape and future directions.arXiv preprint arXiv:2402.11068

work page arXiv 2024
[43]

Chenxu Wang et al. 2025. Towards llm-based failure localization in production- scale networks. InProceedings of the ACM SIGCOMM 2025 Conference, 496– 511

2025
[44]

Dongjie Wang, Zhengzhang Chen, Yanjie Fu, Yanchi Liu, and Haifeng Chen
[45]

In Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, 2269–2278

Incremental causal graph learning for online root cause analysis. In Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, 2269–2278
[46]

Yidan Wang, Zhouruixing Zhu, Qiuai Fu, Yuchi Ma, and Pinjia He. 2024. Mrca: metric-level root cause analysis for microservices via multi-modal data. InPro- ceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 1057–1068

2024
[47]

Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong, Jihong Wang, Fengbin Yin, Lunting Fan, Lingfei Wu, and Qingsong Wen. 2024. Rcagent: cloud root cause analysis by autonomous agents with tool-augmented large language models. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management, 4966–4974

2024
[48]

Zeying Wang et al. 2025. Kaiops: a platform solution of end-to-end multi-modal aiops for ai training at scale. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 3192–3203

2025
[49]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits rea- soning in large language models.Advances in neural information processing systems, 35, 24824–24837. ASE ’26, October 12–16, 2026, Munich, Germany Jiamin Jiang and Yongqian Sun et al

2022
[50]

Canhua Wu et al. 2021. Identifying root-cause metrics for incident diagnosis in online service systems. In2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 91–102

2021
[51]

Zhe Xie et al. 2026. Foundroot: towards foundation model for root cause analysis via structured deep thinking

2026
[52]

Zhe Xie et al. 2024. Microservice root cause analysis with limited observability through intervention recognition in the latent space. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, 6049–6060

2024
[53]

Junjielong Xu et al. 2025. Openrca: can large language models locate the root cause of software failures? InThe Thirteenth International Conference on Learn- ing Representations

2025
[54]

Jian Yang, Zian Wang, Shuangwu Chen, Huasen He, Yunpeng Hou, and Xi- aofeng Jiang. 2025. Hg-pad: heterogeneous graph structure learning aided performance anomaly diagnosis in microservice systems.IEEE Transactions on Services Computing

2025
[55]

Xiaojie Yang, Hangli Ge, Jiawei Wang, Zipei Fan, Renhe Jiang, Ryosuke Shibasaki, and Noboru Koshizuka. 2025. Causalmob: causal human mobility prediction with llms-derived human intentions toward public events. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, 1773–1784

2025
[56]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: synergizing reasoning and acting in language mod- els. InThe eleventh international conference on learning representations

2022
[57]

Zhenhe Yao et al. 2024. Sparserca: unsupervised root cause analysis in sparse microservice testing traces. In2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 391–402

2024
[58]

Guangba Yu, Pengfei Chen, Yufeng Li, Hongyang Chen, Xiaoyun Li, and Zibin Zheng. 2023. Nezha: interpretable fine-grained root causes analysis for mi- croservices on multi-modal observability data. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Founda- tions of Software Engineering, 553–565

2023
[59]

Shenglin Zhang, Xiaoyu Feng, Runzhou Wang, Minghua Ma, Wenwei Gu, Yongqian Sun, Zedong Jia, Jinrui Sun, and Dan Pei. 2025. Too many cooks: assessing the need for multi-source data in microservice failure diagnosis. In 2025 IEEE 36th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 1–12

2025
[60]

Shenglin Zhang, Chenyu Zhao, Yicheng Sui, Ya Su, Yongqian Sun, Yuzhi Zhang, Dan Pei, and Yizhe Wang. 2021. Robust kpi anomaly detection for large-scale software services with partial labels. In2021 IEEE 32nd international symposium on software reliability engineering (ISSRE). IEEE, 103–114

2021
[61]

Shenglin Zhang et al. 2024. Illuminating the gray zone: non-intrusive gray failure localization in server operating systems. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, 126–137

2024
[62]

Shenglin Zhang et al. 2023. Robust failure diagnosis of microservice system through multimodal data.IEEE Transactions on Services Computing, 16, 6, 3851– 3864

2023
[63]

Chenyu Zhao et al. 2023. Robust multimodal failure detection for microservice systems. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 5639–5649

2023
[64]

Yongxin Zhao et al. [n. d.] When llms listen to experts: accurate failure diagnosis in operating systems
[65]

Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, Zhiyuan Liu, Weize Chen, Jian- ming Wu, Jiesi Liu, Ruohang Feng, and Guoyang Zeng. 2023. D-bot: database diagnosis system using large language models.arXiv preprint arXiv:2312.01454

work page arXiv 2023
[66]

Zhouruixing Zhu, Cheryl Lee, Xiaoying Tang, and Pinjia He. 2024. Hemirca: fine-grained root cause analysis for microservices with heterogeneous data sources.ACM Transactions on Software Engineering and Methodology, 33, 8, 1–25. Wenwei Gu1 Shenglin Zhang1, Tianyu Cui2, Yao Wu2, Jielong Huang2, Nan Qi2, Dan Pei3 „

2024

[1] [1]

Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao Zhang, and Saravan Rajmohan. 2023. Recommending root-cause and mitigation steps for cloud incidents using large language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1737–1749

2023

[2] [2]

Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. InNoise reduction in speech processing. Springer, 1–4

2009

[3] [3]

O’Reilly Media, Inc

Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. 2016.Site reliability engineering: how Google runs production systems. " O’Reilly Media, Inc. "

2016

[4] [4]

2011.Bayesian inference in statistical analysis

George EP Box and George C Tiao. 2011.Bayesian inference in statistical analysis. John Wiley & Sons

2011

[5] [5]

Pengfei Chen, Yong Qi, Pengfei Zheng, and Di Hou. 2014. Causeinfer: automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. InIEEE INFOCOM 2014-IEEE Conference on Computer Communications. IEEE, 1887–1895

2014

[6] [6]

Yinfang Chen et al. 2024. Automatic root cause analysis via large language models for cloud incidents. InProceedings of the Nineteenth European Conference on Computer Systems, 674–688

2024

[7] [7]

Yinfang Chen et al. 2025. Stratus: a multi-agent system for autonomous relia- bility engineering of modern clouds.arXiv preprint arXiv:2506.02009

work page arXiv 2025

[8] [8]

Yuxiao Cheng, Lianglong Li, Tingxiong Xiao, Zongren Li, Jinli Suo, Kunlun He, and Qionghai Dai. 2024. Cuts+: high-dimensional causal discovery from irreg- ular time-series. InProceedings of the AAAI Conference on Artificial Intelligence number 10. Vol. 38, 11525–11533

2024

[9] [9]

Tianyu Cui et al. 2025. Logeval: a comprehensive benchmark suite for llms in log analysis.Empirical Software Engineering, 30, 6, 173

2025

[10] [10]

Huaming Du, Yujia Zheng, Baoyu Jing, Yu Zhao, Gang Kou, Guisong Liu, Tao Gu, Weimin Li, and Carl Yang. 2025. Causal discovery through synergizing large language model and data-driven reasoning. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, 543–554

2025

[11] [11]

Tao Feng, Lizhen Qu, Niket Tandon, Zhuang Li, Xiaoxi Kang, and Gholamreza Haffari. 2025. On the reliability of large language models for causal discovery. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 9565–9590

2025

[12] [12]

Ruowei Fu et al. 2025. Llm-powered multi-agent collaboration for intelligent industrial on-call automation. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2222–2234

2025

[13] [13]

Google Cloud Platform. 2021. Online boutique: a cloud-native microservices demo application. https://github.com/GoogleCloudPlatform/microservices-de mo. Accessed: 2026-03-23. (2021)

2021

[14] [14]

Clive WJ Granger. 1980. Testing for causality: a personal viewpoint.Journal of Economic Dynamics and control, 2, 329–352

1980

[15] [15]

Yongqi Han, Qingfeng Du, Ying Huang, Jiaqi Wu, Fulong Tian, and Cheng He

[16] [16]

InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 931–943

The potential of one-shot failure root cause analysis: collaboration of the large language model and small classifier. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 931–943

[17] [17]

Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. 2022. Root cause analysis of failures in microservices through causal discovery.Advances in Neural Information Processing Systems, 35, 31158–31170

2022

[18] [18]

2010.Random walk: a modern introduction

Gregory F Lawler and Vlada Limic. 2010.Random walk: a modern introduction. Vol. 123. Cambridge University Press

2010

[19] [19]

Mingjie Li, Zeyan Li, Kanglin Yin, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2022. Causal inference-based root cause analysis for online service systems with intervention recognition. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 3230–3240

2022

[20] [20]

Zeyan Li et al. 2023. Generic and robust root cause localization for multi- dimensional data in online service systems.Journal of Systems and Software, 203, 111748

2023

[21] [21]

Zeyan Li et al. 2021. Practical root cause localization for microservice systems via trace analysis. In2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). IEEE, 1–10

2021

[22] [22]

Cheng-Ming Lin, Ching Chang, Wei-Yao Wang, Kuang-Da Wang, and Wen- Chih Peng. 2024. Root cause analysis in microservice using neural granger causal discovery. InProceedings of the AAAI Conference on Artificial Intelligence number 1. Vol. 38, 206–213

2024

[23] [23]

Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. 2021. Microhecl: high-efficient root cause localization in large-scale microservice systems. In2021 IEEE/ACM 43rd Inter- national Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 338–347

2021

[24] [24]

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In2008 eighth ieee international conference on data mining. IEEE, 413–422

2008

[25] [25]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: how language models use long contexts.Transactions of the association for computational linguistics, 12, 157–173

2024

[26] [26]

Qihan Liu, Pengfei Chen, Guangba Yu, Yuanhao Lai, and Xiaoyun Li. 2025. Causelens: causality-based interpretable root cause analysis for microservice systems. In2025 IEEE/ACM 33rd International Symposium on Quality of Service (IWQoS). IEEE, 1–10

2025

[27] [27]

Yilun Liu, Shimin Tao, Weibin Meng, Feiyu Yao, Xiaofeng Zhao, and Hao Yang

[28] [28]

InProceedings of the 2024 IEEE/ACM 46th international conference on software engineering: Companion proceedings, 364–365

Logprompt: prompt engineering towards zero-shot and interpretable log analysis. InProceedings of the 2024 IEEE/ACM 46th international conference on software engineering: Companion proceedings, 364–365

2024

[29] [29]

Yilun Liu et al. 2025. R-log: incentivizing log analysis capability in llms via reasoning-based reinforcement learning.arXiv preprint arXiv:2509.25987

work page arXiv 2025

[30] [30]

Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang. 2020. Automap: diagnose your microservice-based web applications automatically. InProceedings of The Web Conference 2020, 246–258

2020

[31] [31]

Yuan Meng, Shenglin Zhang, Yongqian Sun, Ruru Zhang, Zhilong Hu, Yiyin Zhang, Chenyang Jia, Zhaogang Wang, and Dan Pei. 2020. Localizing failure root causes in a microservice through causality inference. In2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS). IEEE, 1–10

2020

[32] [32]

Changhua Pei et al. 2025. Flow-of-action: sop enhanced llm-based multi-agent system for root cause analysis. InCompanion Proceedings of the ACM on Web Conference 2025, 422–431

2025

[33] [33]

Stephen Robertson, Hugo Zaragoza, and Michael Taylor. 2004. Simple bm25 extension to multiple weighted fields. InProceedings of the thirteenth ACM international conference on Information and knowledge management, 42–49

2004

[34] [34]

Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca, and Saravan Rajmohan. 2024. Exploring llm-based agents for root cause analysis. InCompanion proceedings of the 32nd ACM international conference on the foundations of software engineering, 208–219

2024

[35] [35]

Binpeng Shi et al. 2025. Flowxpert: expertizing troubleshooting workflow orchestration with knowledge base and multi-agent coevolution. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, 4839–4850

2025

[36] [36]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: language agents with verbal reinforcement learn- ing.Advances in neural information processing systems, 36, 8634–8652

2023

[37] [37]

2000.Causation, predic- tion, and search

Peter Spirtes, Clark N Glymour, and Richard Scheines. 2000.Causation, predic- tion, and search. MIT press

2000

[38] [38]

Yongqian Sun, Yu Luo, Xidao Wen, Yuan Yuan, Xiaohui Nie, Shenglin Zhang, Tong Liu, and Xi Luo. 2025. Trioxpert: an automated incident management framework for microservice system.arXiv preprint arXiv:2506.10043

work page arXiv 2025

[39] [39]

Yongqian Sun, Binpeng Shi, Mingyu Mao, Minghua Ma, Sibo Xia, Shenglin Zhang, and Dan Pei. 2024. Art: a unified unsupervised framework for incident management in microservice systems. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 1183–1194

2024

[40] [40]

Yongqian Sun et al. 2025. Llm-augmented ticket aggregation for low-cost mobile os defect resolution. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, 215–226

2025

[41] [41]

Yuni Susanti and Michael Färber. 2025. Paths to causality: finding informative subgraphs within knowledge graphs for knowledge-based causal discovery. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, 2778–2789

2025

[42] [42]

Guangya Wan, Yunsheng Lu, Yuqi Wu, Mengxuan Hu, and Sheng Li. 2024. Large language models for causal discovery: current landscape and future directions.arXiv preprint arXiv:2402.11068

work page arXiv 2024

[43] [43]

Chenxu Wang et al. 2025. Towards llm-based failure localization in production- scale networks. InProceedings of the ACM SIGCOMM 2025 Conference, 496– 511

2025

[44] [44]

Dongjie Wang, Zhengzhang Chen, Yanjie Fu, Yanchi Liu, and Haifeng Chen

[45] [45]

In Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, 2269–2278

Incremental causal graph learning for online root cause analysis. In Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, 2269–2278

[46] [46]

Yidan Wang, Zhouruixing Zhu, Qiuai Fu, Yuchi Ma, and Pinjia He. 2024. Mrca: metric-level root cause analysis for microservices via multi-modal data. InPro- ceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 1057–1068

2024

[47] [47]

Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong, Jihong Wang, Fengbin Yin, Lunting Fan, Lingfei Wu, and Qingsong Wen. 2024. Rcagent: cloud root cause analysis by autonomous agents with tool-augmented large language models. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management, 4966–4974

2024

[48] [48]

Zeying Wang et al. 2025. Kaiops: a platform solution of end-to-end multi-modal aiops for ai training at scale. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 3192–3203

2025

[49] [49]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits rea- soning in large language models.Advances in neural information processing systems, 35, 24824–24837. ASE ’26, October 12–16, 2026, Munich, Germany Jiamin Jiang and Yongqian Sun et al

2022

[50] [50]

Canhua Wu et al. 2021. Identifying root-cause metrics for incident diagnosis in online service systems. In2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 91–102

2021

[51] [51]

Zhe Xie et al. 2026. Foundroot: towards foundation model for root cause analysis via structured deep thinking

2026

[52] [52]

Zhe Xie et al. 2024. Microservice root cause analysis with limited observability through intervention recognition in the latent space. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, 6049–6060

2024

[53] [53]

Junjielong Xu et al. 2025. Openrca: can large language models locate the root cause of software failures? InThe Thirteenth International Conference on Learn- ing Representations

2025

[54] [54]

Jian Yang, Zian Wang, Shuangwu Chen, Huasen He, Yunpeng Hou, and Xi- aofeng Jiang. 2025. Hg-pad: heterogeneous graph structure learning aided performance anomaly diagnosis in microservice systems.IEEE Transactions on Services Computing

2025

[55] [55]

Xiaojie Yang, Hangli Ge, Jiawei Wang, Zipei Fan, Renhe Jiang, Ryosuke Shibasaki, and Noboru Koshizuka. 2025. Causalmob: causal human mobility prediction with llms-derived human intentions toward public events. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, 1773–1784

2025

[56] [56]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: synergizing reasoning and acting in language mod- els. InThe eleventh international conference on learning representations

2022

[57] [57]

Zhenhe Yao et al. 2024. Sparserca: unsupervised root cause analysis in sparse microservice testing traces. In2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 391–402

2024

[58] [58]

Guangba Yu, Pengfei Chen, Yufeng Li, Hongyang Chen, Xiaoyun Li, and Zibin Zheng. 2023. Nezha: interpretable fine-grained root causes analysis for mi- croservices on multi-modal observability data. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Founda- tions of Software Engineering, 553–565

2023

[59] [59]

Shenglin Zhang, Xiaoyu Feng, Runzhou Wang, Minghua Ma, Wenwei Gu, Yongqian Sun, Zedong Jia, Jinrui Sun, and Dan Pei. 2025. Too many cooks: assessing the need for multi-source data in microservice failure diagnosis. In 2025 IEEE 36th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 1–12

2025

[60] [60]

Shenglin Zhang, Chenyu Zhao, Yicheng Sui, Ya Su, Yongqian Sun, Yuzhi Zhang, Dan Pei, and Yizhe Wang. 2021. Robust kpi anomaly detection for large-scale software services with partial labels. In2021 IEEE 32nd international symposium on software reliability engineering (ISSRE). IEEE, 103–114

2021

[61] [61]

Shenglin Zhang et al. 2024. Illuminating the gray zone: non-intrusive gray failure localization in server operating systems. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, 126–137

2024

[62] [62]

Shenglin Zhang et al. 2023. Robust failure diagnosis of microservice system through multimodal data.IEEE Transactions on Services Computing, 16, 6, 3851– 3864

2023

[63] [63]

Chenyu Zhao et al. 2023. Robust multimodal failure detection for microservice systems. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 5639–5649

2023

[64] [64]

Yongxin Zhao et al. [n. d.] When llms listen to experts: accurate failure diagnosis in operating systems

[65] [65]

Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, Zhiyuan Liu, Weize Chen, Jian- ming Wu, Jiesi Liu, Ruohang Feng, and Guoyang Zeng. 2023. D-bot: database diagnosis system using large language models.arXiv preprint arXiv:2312.01454

work page arXiv 2023

[66] [66]

Zhouruixing Zhu, Cheryl Lee, Xiaoying Tang, and Pinjia He. 2024. Hemirca: fine-grained root cause analysis for microservices with heterogeneous data sources.ACM Transactions on Software Engineering and Methodology, 33, 8, 1–25. Wenwei Gu1 Shenglin Zhang1, Tianyu Cui2, Yao Wu2, Jielong Huang2, Nan Qi2, Dan Pei3 „

2024