Aqweth sees your production so your on-call engineers don't have to.
// ˈAK-WETH // NOUN · PRODUCTION RELIABILITY
An AI SRE agent for autonomous root-cause analysis. 15 parallel fetch nodes investigate across logs, metrics, traces, deploys, and code — and report back to on-call engineers in plain language within seconds.
Root cause
Null check removed in PaymentProcessor.validate() in commit a3f92b (deploy 14:28).
Suggested fix
Revert to v2.4.0 or patch null guard on line 142.
30–90 minutes of every incident goes to gathering evidence, not fixing it.
Aqweth automates the investigation — AI-powered RCA that cuts MTTR from hours to seconds.
Mean time to diagnosis
30–90
minutes
Engineering cost per incident
2–6
eng-hours
Who gets paged
1
engineer
What you're correlating, by hand, at 03:14 a.m.
Production incidents cost more than downtime.
Three structural problems compound every incident response.
problem · 01
Silent degradation
Error rates climb, latency drifts — gradually, beneath alert thresholds. No single metric trips the wire. By the time a page fires, users have already been impacted for minutes and the clearest evidence has started to decay.
problem · 02
Access walls
Engineers are paged at 3 AM and spend the first 20 minutes navigating VPN, requesting elevated access, and waiting for approval. By the time they reach logs, the critical window has passed.
problem · 03
Tool fragmentation
RCA means manually cross-referencing five different systems with no shared timeline. Every tool has a different auth flow, a different query syntax, and a different data model — all under pressure, in the middle of the night.
One investigation. Fifteen sources. Seconds.
Triage runs first — noise dismissed before a single LLM token is spent.
/rca · alert · proactive
Trigger
Invoke with /rca in Slack, connect to your alerting pipeline, or let Aqweth run proactive scans on schedule. Any alert format, any channel.
Dedupe + classify
Triage
Signal is separated from noise before a single LLM token is spent. Duplicate alerts are merged, severity is classified, irrelevant signals are dropped.
15 fetch nodes · parallel
Fan-out
Up to 15 fetch nodes execute in parallel, each querying a different backend. Slow or offline backends time out gracefully — the rest continue.
RCA card → chat
Synthesise
All evidence is assembled into a structured RCA card with confidence score, root cause, and suggested fix. Streamed directly to the Slack thread that triggered the investigation.
The deliverable
A cited RCA card. In the chat platform you already use.
Hypothesis with citations. Confidence score. Ranked fixes. Approve or reject — never auto-applied.
- 01
Every claim is cited to the raw evidence — log line, span ID, deploy SHA, ticket. No hallucinated conclusions.
- 02
Below 0.70 confidence, Aqweth automatically escalates to deep reasoning and surfaces uncertainty explicitly on the card.
- 03
Slack and Google Chat shipping today. Microsoft Teams on the roadmap.
payments-api 5xx surge after deploy a3f1c92
Root cause
a3f1c92 removed the null check in PaymentProcessor.validate() at line 142. Every transaction since deploy 14:28 fails validation.
Evidence · 4 sources
Error rate jumped 14× within 90s of rollout.
prometheus · payments_5xx_total · 14:30:12 → 14:31:42
47 ERRORs in 30 min, NullPointerException at line 142.
loki · payments-api · trace 9d4ba2e1
Removed null check in PaymentProcessor.validate() in commit a3f1c92.
github · PR #4417 · merged 14:28 UTC
Similar failure resolved in inc-1879 (similarity 0.91) — 4 prior matches.
fetch_similar_rcas · vector store · embed_role bge-m3
Confidence
0.84Suggested fixes · ranked
Revert payments-api to a3f0d4b — restores null guard. Runbook: payments-validation.
MTTR < 2 minAdd null guard at line 142 in PaymentProcessor.validate() — forward-fix, no rollback required.
MTTR ~5 minAqweth recommends. Your engineers act.
The only production action Aqweth can take is opening a Jira ticket — and only on explicit approval.
No automation, no surprises, no "AI rolled back the deploy while you slept."
RCA card posted
in Slack / Chat
Engineer reviews
evidence, confidence, fix
Approve or reject
human_review interrupt
on approve only ↓
Jira ticket opened
with full RCA evidence attached
Fits the observability stack you already have.
Works with Kubernetes, AWS, and GCP out of the box. Switching backends is
one line in aqweth.yaml. No code. No rebuild.
Data residency on your terms.
Run all inference in your cluster. Or use cloud APIs. Or mix both. One config file either way.
Cloud API
Self-hosted
Mix both: embedder + triage self-hosted, reasoning via cloud API. One YAML line per role.
Always on. Not just when alerts fire.
When an anomaly is detected, Aqweth automatically triggers a full investigation — RCA card in Slack before anyone is paged.
Anomaly scan
z-score + EWMA on error rates and latency per service.
Correlation sweep
Multi-service degradation within a time window.
Health digest
Deterministic summary posted to SRE channel.
Trend report
Week-on-week regressions, no LLM cost.
Nightly embed
Resolved incidents + runbooks → vector store.
Engineers wake up to a resolved investigation, not a page.
Let us run an investigation on one of your incidents.
No deployment required. You nominate an incident from your retro doc — we run the analysis together.
Request accessOr email us at hello@aqweth.ai · No commitment