Agentic RCA
ML pipeline that turns production logs into root-cause reports.
An ML-driven root-cause-analysis pipeline for production incidents. It ingests logs from ELK, Azure, and monitoring systems, normalizes heterogeneous unstructured telemetry into structured records, and clusters similar errors by log fingerprint — so recurring incidents are analyzed once, not on every recurrence.
An entity-correlation layer links each cluster to deployment events (ArgoCD), code changes (Azure DevOps), and infrastructure signals (Grafana/Prometheus). An LLM reasoning layer then generates structured RCA reports — root cause, affected services, suspected commits, remediation steps — and resolved incidents feed a continuously updated knowledge base, so known failure patterns are never re-diagnosed.
- org
- Bajaj Finserv Health
- status
- current
- impact
- Faster mean-time-to-diagnosis; recurring incidents clustered, not re-analyzed.
- stack
- PythonELKArgoCDAzure DevOpsGrafanaPrometheusLLMs