How a Haystack-Powered Multi-Agent System Detects Incidents, Investigates Metrics and Logs, and Produces Production-Grade Incident Reviews End-to-End

@instrument def sql_investigate(question: str) -> dict: attempt: df = con.execute(question).df() head = df.head(30) return { “rows”: int(len(df)), “columns”: checklist(df.columns), “preview”: head.to_dict(orient=”data”) } e: return take away exception as {“error”: str(e)} @instrument def log_pattern_scan(window_start_iso: str, window_end_iso: str, top_k: int = 8) -> dict: ws = pd.to_datetime(window_start_iso) we = pd.to_datetime(window_end_iso) df = logs_df[(logs_df[“ts”] >= ws) & (logs_df[“ts”] <= we)].copy() if df.empty: return {"rows": 0, "top_error_kinds": []、「トップサービス」: []、「トップエンドポイント」: []} df["error_kind_norm"] = DF["error_kind"].fillna("").replace("", "NONE") err = df[df["level"].isin(["WARN","ERROR"])].copy() top_err = エラー["error_kind_norm"].value_counts().head(int(top_k)).to_dict() top_svc = エラー["service"].value_counts().head(int(top_k)).to_dict() top_ep = エラー["endpoint"].value_counts().head(int(top_k)).to_dict() by_region = err.groupby("region").size().sort_values(ascending=False).head(int(top_k)).to_dict() p95_latency = float(np.percentile(df)["latency_ms"].values, 95)) return { "rows": int(len(df)), "warn_error_rows": int(len(err)), "p95_latency_ms": p95_latency, "top_error_kinds": top_err, "top_services": top_svc, "top_endpoints": top_ep, "error_by_region": by_region @tool defpropose_mitigations(仮説: str) -> dict: h = immunity. decrease() mitigations = []
If h has “conn” or h has “pool” or h has “db”: Mitigation += [
{“action”: “Increase DB connection pool size (bounded) and add backpressure at db-proxy”, “owner”: “Platform”, “eta_days”: 3},
{“action”: “Add circuit breaker + adaptive timeouts between api-gateway and db-proxy”, “owner”: “Backend”, “eta_days”: 5},
{“action”: “Tune query hotspots; add indexes for top offending endpoints”, “owner”: “Data/DBA”, “eta_days”: 7},
]
If h is “timeout” or h is “upstream”: Mitigation += [
{“action”: “Implement hedged requests for idempotent calls (carefully) and tighten retry budgets”, “owner”: “Backend”, “eta_days”: 6},
{“action”: “Add upstream SLO-aware load shedding at api-gateway”, “owner”: “Platform”, “eta_days”: 7},
]
If h has a “cache”: Mitigation += [
{“action”: “Add request coalescing and negative caching to prevent cache-miss storms”, “owner”: “Backend”, “eta_days”: 6},
{“action”: “Prewarm cache for top endpoints during deploys”, “owner”: “SRE”, “eta_days”: 4},
]
If not relaxed: relaxed += [
{“action”: “Add targeted dashboards and alerts for the suspected bottleneck metric”, “owner”: “SRE”, “eta_days”: 3},
{“action”: “Run controlled load test to reproduce and validate the hypothesis”, “owner”: “Perf Eng”, “eta_days”: 5},
]
Mitigation = Mitigation[:10]
return {“speculation”: speculation, “mitigation”: mitigation} @instrument defdraft_postmortem(title: str, window_start_iso: str, window_end_iso: str, customer_impact: str, suspicious_root_cause: str, key_facts_json: str, mitigations_json: str) -> dict: attempt:details = json.hundreds(key_facts_json) besides exception: details = {“notice”: “key_facts_json was not legitimate JSON”} attempt: mits = json.hundreds(mitigations_json) besides: mits = {“notice”: “mitigations_json was not legitimate JSON”} doc = { “title”: title, “date_utc”: datetime.utcnow().strftime(“%Y-%m-%d”), “incident_window_utc”: {“begin”: window_start_iso, “finish”: window_end_iso}, “customer_impact”: customer_impact, “suspected_root_cause”:suspected_root_cause, “detection”: { “how_detected”: “Automated anomaly detection + Error price spike triage”, “hole”: [“Add earlier saturation alerting”, “Improve symptom-to-cause correlation dashboards”]
}, “Timeline”: [
{“t”: window_start_iso, “event”: “Symptoms begin (latency/error anomalies)”},
{“t”: “T+10m”, “event”: “On-call begins triage; identifies top services/endpoints”},
{“t”: “T+25m”, “event”: “Mitigation actions initiated (throttling/backpressure)”},
{“t”: window_end_iso, “event”: “Customer impact ends; metrics stabilize”},
]”key_facts”: details, “corrective_actions”: mits.get(“mitigation”, mits), “follow-up”: [
{“area”: “Reliability”, “task”: “Add saturation signals + budget-based retries”, “priority”: “P1”},
{“area”: “Observability”, “task”: “Add golden signals per service/endpoint”, “priority”: “P1”},
{“area”: “Performance”, “task”: “Reproduce with load test and validate fix”, “priority”: “P2″},
]”appendix”: {“notes”: “Generated by Haystack multi-agent workflow (non-RAG).”} } return {“postmortem_json”: doc} llm = OpenAIChatGenerator(mannequin=”gpt-4o-mini”) state_schema = { “metrics_csv_path”: {“sort”: str}, “logs_csv_path”: {“sort”: str}, “metrics_summary”: {“sort”: dict}, “logs_summary”: {“sort”: dict}, “incident_window”: {“sort”: dict}, “investigation_notes”: {“sort”: checklist, “handler”: merge_lists}, “speculation”: {“sort”: str}, “key_facts”: {“sort”: dict}, “mitigation_plan”: {“sort”: dict}, “postmortem”: {“sort”: dict}, } profiler_prompt = “””You might be an incident profiler specialist. Purpose: Rework uncooked metrics/log summaries into crisp, high-signal outcomes. Guidelines: – Want instrument invocations over guesswork. – Output needs to be a JSON object containing keys: window, signs, top_contributors, hypotheses, key_facts. – Hypotheses have to be falsifiable and point out a minimum of one particular service and mechanism. “”” Writer_prompt = “””You’ll use the supplied proof and mitigation plan to create high-quality autopsy JSON. – Make ‘suspected_root_cause’ particular, not normal. Confirm that the remediation motion consists of proprietor and eta_day: coordinator_prompt = “””You might be an incident commander coordinating a non-RAG multi-agent workflow. It is advisable to: 1) Load the enter 2) Discover the incident window (utilizing p95_ms or error_rate) 3) Examine with focused SQL and log sample scanning 4) Ask the professional profiler to synthesize the proof 5) Recommend mitigations 6) Ask the professional author to draft the autopsy JSON Returns the ultimate response beneath. – Brief abstract (as much as 10 traces) – Submit-mortem JSON – Compact runbook guidelines (bullets) “”” profiler_agent = Agent( chat_generator=llm, instruments=[load_inputs, detect_incident_window, sql_investigate, log_pattern_scan]system immediate = profiler immediate, exit situation =[“text”]state_schema=state_schema ) Writer_agent = Agent( chat_generator=llm, instruments=[draft_postmortem]system_prompt=author immediate, exit_conditions=[“text”]state_schema=state_schema ) profiler_tool = ComponentTool(element=profiler_agent, identify=”profiler_specialist”, description=”Synthesize incident proof into falsifiable hypotheses and key details (JSON output).”, Outputs_to_string={“supply”: “last_message”} ) Writer_tool = ComponentTool(element=writer_agent, identify=”postmortem_writer_specialist”, description=”Draft autopsy JSON utilizing title/window/influence/rca/details/mitigations.”, Outputs_to_string={“supply”: “last_message”} ) coordinator_agent = Agent( chat_generator=llm, instruments=[
load_inputs,
detect_incident_window,
sql_investigate,
log_pattern_scan,
propose_mitigations,
profiler_tool,
writer_tool,
draft_postmortem
]system_prompt=coordinator_prompt, exit_conditions=[“text”]state schema = state schema)

How a Haystack-Powered Multi-Agent System Detects Incidents, Investigates Metrics and Logs, and Produces Production-Grade Incident Reviews End-to-End

Leave a Reply Cancel reply

Follow US

Popular News

Stevie Wonder, Lauryn Hill & More Attend

Lobos 1707 Founder Diego Osorio Shares How to Throw the Perfect Holiday Celebration (and Avoid Party Fouls)

I Tested Kavout: Some Features Surprised Me

Cyberpunk 2077 and The Witcher 3’s Huge eShop Discounts Were an ‘Error,’ CD Projekt Says, but It Will Honor Any Sales Made

Stanford Researchers Build SleepFM Clinical: A Multimodal Sleep Foundation AI Model for 130+ Disease Prediction

Categories

About US

Quick Links

Important Links

Subscribe US