NEW Research from Google. Integration test failures are painful because the signal is buried in messy logs. Massive output, heterogeneous systems, low signal-to-noise ratio, and unclear root causes. This paper introduces Auto-Diagnose, an LLM-based tool deployed inside Google's Critique code review system. Auto-Diagnose analyzes failure logs, summarizes the most relevant lines, and suggests the root cause in the developer workflow where the failure is already being reviewed. The deployment numbers are notable. In a manual evaluation of 71 real-world failures, Auto-Diagnose reached 90.14% root-cause diagnosis accuracy. After Google-wide deployment, it was used across 52,635 distinct failing tests. User feedback marked it "Not helpful" in only 5.8% of cases, and it ranked #14 in helpfulness among 370 Critique tools. Paper: https://arxiv.org/abs/2604.12108 Learn to build effective AI agents in our academy: https://academy.dair.ai/

The first page of the arXiv preprint paper displays the large bold title "LLM-Based Automated Diagnosis Of Integration Test Failures At Google" above author names Celal Ziftci, Spencer Greene, Ray Liu, and Livio Dalloro with their Google New York email addresses and affiliations. Visible text includes the Abstract introducing Auto-Diagnose as an LLM-based tool integrated into Google's Critique code review system with 90.14% root cause accuracy from a 71-failure evaluation and 52,635-test deployment, the start of the 1 Introduction section on software testing and log analysis challenges, CCS Concepts, Keywords listing Software, Testing, Debugging, Diagnosis, Productivity, LLM, a Preprint Notice for ICSE 2026, and a vertical left-margin stamp reading arXiv:2604.12108v1 [cs.SE] 13 Apr 2026. Celal Ziftci and three co-authors published a paper on 13 Apr 2026 titled "LLM-Based Automated Diagnosis Of Integration Test Failures At Google," detailing Google's use of large language models to automatically diagnose integration test failures. This advances software engineering by reducing manual debugging time at scale in Google's complex CI/CD pipelines, enabling faster release cycles and higher reliability.
Tweet: NEW Research from Google. Integration test failures are painful because the signal is buried in messy logs. Massive output, heterogeneous systems, low signal-to-noise ratio, and unclear root causes. This paper introduces Auto-Diagnose, an LLM-based tool deployed inside Google's Critique code review system. Auto-Diagnose analyzes failure logs, summarizes the most relevant lines, and suggests the root cause in the developer workflow where the failure is already being reviewed. The deployment numbers are notable. In a manual evaluation of 71 real-world failures, Auto-Diagnose reached 90.14% root-cause diagnosis accuracy. After Google-wide deployment, it was used across 52,635 distinct failing tests. User feedback marked it "Not helpful" in only 5.8% of cases, and it ranked #14 in helpfulness among 370 Critique tools. Paper: https://arxiv.org/abs/2604.12108 Learn to build effective AI agents in our academy: https://academy.dair.ai/ The first page of the arXiv preprint paper displays the large bold title "LLM-Based Automated Diagnosis Of Integration Test Failures At Google" above author names Celal Ziftci, Spencer Greene, Ray Liu, and Livio Dalloro with their Google New York email addresses and affiliations. Visible text includes the Abstract introducing Auto-Diagnose as an LLM-based tool integrated into Google's Critique code review system with 90.14% root cause accuracy from a 71-failure evaluation and 52,635-test deployment, the start of the 1 Introduction section on software testing and log analysis challenges, CCS Concepts, Keywords listing Software, Testing, Debugging, Diagnosis, Productivity, LLM, a Preprint Notice for ICSE 2026, and a vertical left-margin stamp reading arXiv:2604.12108v1 [cs.SE] 13 Apr 2026. Celal Ziftci and three co-authors published a paper on 13 Apr 2026 titled "LLM-Based Automated Diagnosis Of Integration Test Failures At Google," detailing Google's use of large language models to automatically diagnose integration test failures. This advances software engineering by reducing manual debugging time at scale in Google's complex CI/CD pipelines, enabling faster release cycles and higher reliability.
Learn AI with hands-on courses on AI agents, RAG, LLMs, and more. Powered by AI tutoring and a supportive community.
| Time | Views | Likes | Bookmarks | RTs | Replies |
|---|---|---|---|---|---|
| 11:00 AM UTC | +55 | +1 | — | — | — |
| 10:50 AM UTC | +59 | — | — | — | — |
| 10:40 AM UTC | +78 | +2 | — | +1 | — |
| 10:30 AM UTC | +48 | — | — | — | — |
| 10:20 AM UTC | +61 | +1 | +2 | — | — |
| 10:10 AM UTC | +52 | — | — | — | — |
| 10:00 AM UTC | +75 | +1 | — | — | +1 |
| 9:50 AM UTC | +6 | — | — | — | — |
| 9:40 AM UTC | +131 | +1 | — | — | — |
| 9:30 AM UTC | +52 | +2 | — | — | — |