skip to content
vaibhav.vanage
← all work
projectcurrent

Agentic RCA

ML pipeline that turns production logs into root-cause reports.

AI EngineeringDevSecOpscross-domain

An ML-driven root-cause-analysis pipeline for production incidents. It ingests logs from ELK, Azure, and monitoring systems, normalizes heterogeneous unstructured telemetry into structured records, and clusters similar errors by log fingerprint — so recurring incidents are analyzed once, not on every recurrence.

An entity-correlation layer links each cluster to deployment events (ArgoCD), code changes (Azure DevOps), and infrastructure signals (Grafana/Prometheus). An LLM reasoning layer then generates structured RCA reports — root cause, affected services, suspected commits, remediation steps — and resolved incidents feed a continuously updated knowledge base, so known failure patterns are never re-diagnosed.

org
Bajaj Finserv Health
status
current
impact
Faster mean-time-to-diagnosis; recurring incidents clustered, not re-analyzed.
stack
PythonELKArgoCDAzure DevOpsGrafanaPrometheusLLMs

// skills

ELK · Grafana · PrometheusLLM Orchestration

// connections

see this in the graph