Pythia — The Incident Oracle for Distributed Systems

What Pythia investigates

Infrastructure failures.
Not product bugs.

Most incidents are not bugs in application code. They are failures in the environment surrounding that code — and that distinction matters.

Pythia investigates & resolves

✓ Downstream service unreachable or returning errors
✓ OOMKill, CPU throttling, pod crash-loop
✓ Database connection pool exhaustion or host unreachable
✓ Message queue lag or broker unreachable (Kafka support — coming soon)
✓ Bad deployment — config change, image regression
✓ Network policy or certificate expiry blocking traffic
✓ Cascading failures from a single upstream fault

For these, Pythia identifies the root cause and suggests an immediate workaround — so the system can recover while the permanent fix is prepared.

Pythia identifies & hands off

→ Application logic errors (wrong business calculation)
→ Bugs introduced in a recent code change
→ Incorrect API contract between two services
→ Data corruption caused by application-layer code
→ Feature-level failures (wrong output, not a crash)

For these, Pythia draws the line: it names the service, the affected code path, and the symptom — then hands a focused brief to the engineer who owns that service. No guessing, no sprawling war room.

How it works

From alert to root cause,
fully automated

Pythia runs entirely inside your infrastructure. No data leaves your environment unless you choose a cloud LLM.

STEP 01

📚

Package your codebase

Run the Pythia packager once against your repositories. It indexes your source code, architecture docs, and runbooks into a local vector store — so Pythia understands your specific services, not just generic Kubernetes patterns. This knowledge lives entirely inside your infrastructure and is queried automatically on every investigation.

STEP 02

⚲

Deploy once

Pythia installs as a single pod in your cluster — or runs on any machine with HTTPS access to the kube-apiserver. No code instrumentation, no sidecars, no per-language agents required.

STEP 03

⚠

Submit an incident

Paste an error message, alert text, or log line. Pythia identifies the source service and expands the blast radius — discovering every upstream and downstream dependency that could be involved.

STEP 04

🔎

Pythia investigates

Pythia collects logs, metrics, K8s events, and deployment state from every service in scope. Signals are correlated across the dependency graph — and relevant excerpts from your packaged docs are surfaced into the reasoning — to surface what changed and where the fault originated.

STEP 05

◆

Verdict + workaround

If the fault is in infrastructure — a dependency down, resource exhaustion, a bad deploy — Pythia names it and offers an immediate workaround. If the fault is inside product code, Pythia draws the boundary precisely and hands off a scoped brief to the developer who owns it.

Data Privacy & Deployment

Your data, your infrastructure,
your choice

Pythia is fully self-hosted. You choose where it runs and which LLM it uses. Nothing is forced into a SaaS model.

Flavor 1 — In-cluster

Deploy inside your cluster

Pythia runs as a Deployment inside your Kubernetes cluster. It communicates with the kube-apiserver and other cluster services over the internal network. No external ingress is required for Pythia itself.

Best for: Air-gapped environments, strict data residency requirements, and teams who want Pythia co-located with the services it investigates.

Flavor 2 — External machine

Run from any machine

Run Pythia on a laptop, a dedicated server, or a VM — anywhere that has HTTPS access to the Kubernetes API server (kube-apiserver). Pythia authenticates using a standard kubeconfig file. The kube-apiserver does not need to be publicly exposed; a VPN or bastion is sufficient.

Best for: Teams who prefer investigative tooling off-cluster, or who want to run Pythia from a machine with a powerful local GPU for inference.

LLM options — mix and match per stage

💻

Ollama — co-located

Run an Ollama model on the same machine as Pythia. Zero data egress. All investigation prompts stay on your hardware. Recommended for maximum data sovereignty.

📱

Ollama — external server

Point Pythia at an Ollama instance on a separate machine. Useful when you want to offload LLM inference to a dedicated GPU server while keeping Pythia itself lightweight.

☁

Cloud API

Connect to Claude (Anthropic) or any OpenAI-compatible API. Log excerpts and metric snapshots included in prompts will be sent to the provider you configure. Choose this when model quality takes priority over data locality.

Who it's for

Built for the people
who own reliability

Site Reliability Engineer

Stop drowning in dashboards at 2 am

An alert fires. Three services are red. You have no idea which one is the cause and which are victims of a cascade. Pythia maps the blast radius and points at the origin — before the incident runs long.

The problem Pythia solves: Cascading failure triage across dozens of services with no clear starting point.

Platform Engineer

Investigate incidents you didn't build

Your team owns the platform, not every service running on it. When product teams escalate, Pythia gives you the service graph context, recent deployment events, and log correlation you need — even for services you've never opened.

The problem Pythia solves: Debugging unfamiliar polyglot services without the original author available.

DevOps & Engineering Lead

Shorter MTTR, without more headcount

Every hour of P0 burns engineering attention and erodes user trust. Pythia compresses the investigation phase so your engineers spend less time in war rooms and more time on fixes and prevention.

The problem Pythia solves: High mean time to resolution on complex distributed system failures.

Capabilities

Everything needed
to close the loop

Pythia covers the full investigation cycle — from alert text to verdict to workaround — without human navigation.

🗺

Automatic topology discovery

Builds the service graph from your Kubernetes manifests and live cluster state — no manual wiring, no sidecars, no custom annotations required.

🗣

Polyglot by design

Works across Go, Java, Python, Node, .NET, Ruby, Rust — any language stack running in Kubernetes. Pythia reads infrastructure signals, not source code.

⚖

Infra vs. code — always separated

Distinguishes infrastructure faults (dependency down, resource exhaustion, bad deploy) from product code bugs — and only attempts to resolve the former.

🎯

Finds the origin, not the symptoms

When a fault cascades and five services go red, Pythia walks the dependency graph and correlates logs, metrics, events, and deploys to name the one service that actually broke — separating the cause from its downstream victims.

🤖

Runs on your LLM

Local model via Ollama for full data sovereignty, or connect to Claude or any OpenAI-compatible API. Configurable per investigation stage.

🧠

Design-doc & runbook context

Your runbooks, architecture docs, and codebase are vector-indexed at deploy time. During investigations, relevant excerpts are surfaced automatically into the reasoning — so Pythia knows your system, not just the incident.

How Pythia compares

Other tools surface signals.
Pythia reaches a verdict.

Observability shows you symptoms. Incident managers route humans. Runbooks rely on memory. Pythia takes the raw error and tells you what broke, whether you can fix it, and what to do right now.

	Pythia	APM / Observability Datadog, New Relic, Dynatrace	Incident Management PagerDuty, incident.io, FireHydrant	Manual Runbooks on-call playbooks, wikis
Names the root-cause service in a multi-service cascade	✓	Shows all as red; you trace	✗	Manual tracing
Classifies fault into operational categories — resource constraint, network, auth, config drift, data integrity, product bug, and more	✓	✗	✗	✗
Delivers an actionable workaround to restore service now	✓	✗	✗	If pre-written
Investigates from a single raw error message	✓	You navigate dashboards	Routes, doesn't investigate	✗
Correlates logs, metrics, events & deploys in one pass	✓	Separate views; you stitch	✗	Manual
Hands off code bugs as a scoped brief to the owning team	✓	✗	Pages a human, no scope	✗
No per-language agent or SDK to instrument	✓	Agent / eBPF per host	✗	✓
Runs fully self-hosted on your own LLM — no data egress	✓	SaaS; data leaves	SaaS; data leaves	✓

✓ Built-in ✗ Not offered italic Partial / manual

Pythia is not a replacement for your observability stack — it sits on top of it. APM platforms are excellent at capturing signals and even flagging anomalies, but they stop at "here is a service map and some red graphs — now you investigate." Pythia closes that last, most expensive mile: it walks the dependency graph itself, separates the originating fault from its downstream victims, decides whether the cause is infrastructure (fixable now, with a workaround) or product code (a scoped handoff), and hands you a decision instead of a dashboard. The richer your observability data, the sharper its verdict.

FAQ

Common questions

Does my data leave the cluster?

Only if you choose a cloud LLM backend (Anthropic or OpenAI-compatible). In that case, log excerpts and metric snapshots included in investigation prompts are sent to the provider you configure.

If you run Ollama locally or point Pythia at an internal Ollama server, nothing leaves your infrastructure. All investigation data is processed in-memory and stored in a local SQLite database on the machine running Pythia.

Do I need to instrument my services?

No. Pythia builds the service graph from your existing Kubernetes manifests and live cluster state — no SDK integration, no custom annotations, no sidecars. It reads what is already there.

If you have Prometheus, Pythia uses it to significantly improve investigation quality — especially for resource exhaustion and cascade failures. Without it, Pythia still investigates using pod logs and Kubernetes events.

Can I run Pythia outside the cluster?

Yes. Pythia can run on any machine — a laptop, a dedicated server, a VM — as long as it has HTTPS access to the Kubernetes API server (kube-apiserver). Pythia authenticates using a standard kubeconfig file. The API server does not need to be publicly exposed; access through a VPN or bastion host works fine.

This is useful for teams who prefer to run investigative tooling off-cluster, or who want to co-locate Pythia with a powerful GPU machine for local LLM inference.

What LLMs does Pythia support?

Three backends are supported:

Ollama — any model available in the Ollama library (Qwen, Llama, Mistral, Phi, and others)
Anthropic — Claude Haiku, Sonnet, and Opus
OpenAI-compatible — any API implementing the OpenAI chat completions specification

The backend and model can be configured per investigation stage. You can use a fast local model for log extraction and a more capable model for the final synthesis step.

What Kubernetes version is required?

Kubernetes 1.20 or later. Pythia uses standard K8s APIs (core/v1, apps/v1, events). No alpha or beta APIs, no custom resource definitions, and no cluster-admin role required — read permissions on the target namespace are sufficient for most investigations.

How long does an investigation take?

Typically 2–5 minutes for a complete investigation, depending on the number of services in scope, log volume, and the LLM backend you've chosen. Local Ollama models are slower than cloud APIs but eliminate data egress entirely. Investigations run fully in the background — you submit and check back.

Do I need Prometheus or distributed tracing already set up?

Prometheus is recommended but not required. When available, metrics significantly improve investigation quality — especially for resource exhaustion and cascade failures. Without it, Pythia still investigates using pod logs and Kubernetes events.

Distributed tracing is on the roadmap but not yet consumed by the investigation pipeline. The current version uses logs, metrics, and Kubernetes events for its analysis.

What happens when Pythia can't determine the cause?

Pythia returns an INCONCLUSIVE verdict rather than guessing. This happens when the collected evidence is insufficient to assign a root cause with high confidence — for example, if the relevant logs were rotated before the investigation started, or the signals genuinely point in more than one direction.

An INCONCLUSIVE result is intentional. A wrong answer with false confidence causes more damage during an incident than an honest "I don't know." When Pythia returns INCONCLUSIVE, it includes the evidence it did collect and the reason it stopped short — giving you a focused starting point rather than a misleading verdict.

What's the licensing model?

Pythia is self-hosted. The licensing model is being finalised. If you'd like early access or want to discuss your deployment scenario, reach out at hello@pythia.in.

I already have Datadog or Grafana. Why do I need Pythia?

Observability platforms surface signals — dashboards, alerts, traces. They tell you something is wrong and show you where to look. Pythia takes those signals and produces an investigation verdict: what broke, why, and what to do about it right now.

The two are complementary. Pythia reads Prometheus data and works better when you have rich observability in place — it does not replace your APM, it sits on top of it.

When production breaks,
ask Pythia why.

Infrastructure failures.
Not product bugs.

An oracle for
the age of microservices

From alert to root cause,
fully automated

Package your codebase

Deploy once

Submit an incident

Pythia investigates

Verdict + workaround

What you need to run Pythia

Your data, your infrastructure,
your choice

Deploy inside your cluster

Run from any machine

Ollama — co-located

Ollama — external server

Cloud API

Built for the people
who own reliability

Stop drowning in dashboards at 2 am

Investigate incidents you didn't build

Shorter MTTR, without more headcount

Everything needed
to close the loop

Automatic topology discovery

Polyglot by design

Infra vs. code — always separated

Finds the origin, not the symptoms

Runs on your LLM

Design-doc & runbook context

Other tools surface signals.
Pythia reaches a verdict.

Common questions

Fix infrastructure fast.
Hand off code issues precisely.

When production breaks,ask Pythia why.

Infrastructure failures.Not product bugs.

An oracle forthe age of microservices

From alert to root cause,fully automated

Package your codebase

Deploy once

Submit an incident

Pythia investigates

Verdict + workaround

What you need to run Pythia

Your data, your infrastructure,your choice

Deploy inside your cluster

Run from any machine

Ollama — co-located

Ollama — external server

Cloud API

Built for the peoplewho own reliability

Stop drowning in dashboards at 2 am

Investigate incidents you didn't build

Shorter MTTR, without more headcount

Everything neededto close the loop

Automatic topology discovery

Polyglot by design

Infra vs. code — always separated

Finds the origin, not the symptoms

Runs on your LLM

Design-doc & runbook context

Other tools surface signals.Pythia reaches a verdict.

Common questions

Fix infrastructure fast.Hand off code issues precisely.

When production breaks,
ask Pythia why.

Infrastructure failures.
Not product bugs.

An oracle for
the age of microservices

From alert to root cause,
fully automated

Your data, your infrastructure,
your choice

Built for the people
who own reliability

Everything needed
to close the loop

Other tools surface signals.
Pythia reaches a verdict.

Fix infrastructure fast.
Hand off code issues precisely.