data-oncall

A 4-agent incident response loop for DataHub.

Four agents — Kimi-K2-Thinking, Llama 3.1 8B + LoRA, and MiniMax-M2.5 — query DataHub in parallel to identify an affected dataset, trace its upstream lineage, diff quality assertions across a clean snapshot and a production instance, and write the postmortem back via Python SDK. End-to-end in ~65 seconds. Built for the DataHub × Nebius hackathon, EF SF, April 10 2026.

▶ Live demo GitHub How it works

What it does

Given a natural-language incident — "revenue dashboard wrong" — four agents run in parallel. Detective identifies the affected dataset and traces upstream lineage. Reality-Checker queries quality assertions on the same datasets in two parallel DataHub instances (one clean, one production) and returns the diff. Fixer writes the postmortem back to the affected datasets via Python SDK. Coordinator orchestrates the other three and synthesizes the final report.

DataHub stores schema, lineage, and ownership — what should be true. The hackathon kit ships two parallel Olist snapshots: one clean, one with three planted referential-integrity bugs. Identical schemas, identical lineage; only the data inside differs. We run 11 quality assertions against both and write the results back as DataHub assertion entities. The diff between those assertion sets is the production incident — no SQL execution at runtime, no data movement. The four agents coordinate by reading and writing the same dataset URNs; DataHub is the shared memory layer. (Maps to L5 + L6 in the hackathon's rubric.)

Architecture

                      [ Trigger CLI ]                     [ Browser tab 1: Agent Console ]
                            │                             [ Browser tab 2: DataHub UI    ]
                            ▼                                       ▲
                  Coordinator (Kimi-K2-Thinking)                    │ SSE
                            │                                       │
                ┌───────────┼───────────┐                           │
                ▼           ▼           ▼                           │
            Detective  Reality-Chk    Fixer                         │
            (Llama+    (Llama+      (MiniMax                        │
             LoRA)      LoRA)        M2.5)                          │
                │           │           │                           │
                └───────────┼───────────┘                           │
                            ▼                                       │
                  DataHub @ Studio A 100.114.31.63                  │
                  GraphQL reads + Python SDK writes                 │
                            │                                       │
                            └───────────────────────────────────────┘
                                  (live event stream)

Coordinator dispatches Detective and Reality-Checker concurrently with asyncio.gather, awaits both, then dispatches Fixer with their output and produces the final synthesis. Reads go through DataHub's GraphQL API; writes go through the Python SDK (per DataHub's recommendation against GraphQL mutations for programmatic use).

Three labs · Four roles · One fine-tune

Agent	Lab	Model	Picked for
Coordinator	Moonshot	`moonshotai/Kimi-K2-Thinking`	Long-horizon planning with visible `<think>` reasoning traces
Detective	Meta	`meta-llama/Meta-Llama-3.1-8B-Instruct` + LoRA	Cheapest fast model on Nebius; narrow NL→GraphQL task fits a small LoRA
Reality-Checker	Meta	(same LoRA endpoint as Detective)	Same model, different system prompt — shared compute, different role
Fixer	MiniMax	`MiniMaxAI/MiniMax-M2.5`	Trained for agentic code generation — drafts the Python SDK quarantine call and the Slack message

COORDINATOR

Kimi-K2-Thinking

Plans the run, dispatches the other agents, writes the final postmortem.

Moonshot · $0.60/$2.50 · 45.7 t/s · eu-north1

DETECTIVE

Llama 3.1 8B + LoRA

Identifies the affected dataset and traces upstream lineage via DataHub GraphQL.

Meta · $0.03/$0.09 · 155 t/s · eu-north1 fast

REALITY-CHECKER

Llama 3.1 8B + LoRA

Diffs quality assertions across olist_source and the production instance.

Meta · same endpoint, different prompt

FIXER

MiniMax-M2.5

Writes incident annotations back to DataHub via Python SDK; drafts the Slack post.

MiniMax · $0.30/$1.20 · 36.8 t/s · us-central1

FINE-TUNE STORY

Base model

…

Fine-tuned

…

Training pairs

…

LoRA config

…

Job ID …

Job type …

Created …

Completed …

Duration …

Cost …

Training loss curve

Epoch	Train loss	Val loss

Accuracy on 8-pattern test set

Base + in-context

TBD

Fine-tuned LoRA

TBD

Improvement run measure_accuracy.py

Sample seed pair (one of 8 patterns)

NL:

Artifacts

🤗 HuggingFace — model weights 📈 Weights & Biases — training run
(may require login) 💻 GitHub — full source

The data: Olist + planted issues

The hackathon kit ships the Brazilian E-Commerce dataset by Olist — 99k orders across 9 base tables and 5 SQL views — as two SQLite files with identical schemas:

olist.db → ingested as platform instance olist_source: clean Kaggle data
olist_dirty.db → ingested as olist_dirty: same schema, three planted referential-integrity bugs

Table	Planted issue	Scope	Downstream blast radius
`olist_customers`	~8% of customer rows physically deleted	7,955 rows removed (99,441 → 91,486)	`v_order_details` drops orders or shows NULL customer fields. Orphan FKs from `olist_orders`.
`olist_order_items`	`seller_id` values truncated by 1 character	5,632 rows modified (32 → 31 chars)	`v_seller_performance` undercounts every affected seller's revenue.
`olist_products`	`product_category_name` set to NULL	988 rows modified	`v_product_sales` silently drops uncategorized products from category aggregations.

The cross-instance assertion diff

Run the 11 quality assertions against olist_source, run them again against olist_dirty, write both result sets back as DataHub assertion entities, then diff them via GraphQL. The Reality-Checker computes the diff in Python (deterministic set difference) and Llama writes the narrative around it.

Assertion	`olist_source`	`olist_dirty`	Verdict
`seller_id` length = 32	✅ pass	❌ 5,632 fail	production-only
`customer` row count = 99,441	✅ pass	❌ 7,955 missing	production-only
`product_category` not null	✅ pass	❌ 988 NULL	production-only
(8 other assertions across 3 tables)	✅ pass	✅ pass	baseline

No SQL execution at runtime, no data movement. The assertions were computed once during setup; the agents query assertion entities, not raw data. That's why the demo runs in seconds and scales to warehouse-size datasets.

▶ Live demo

Click TRIGGER to run the 4-agent loop against the live DataHub at Studio A and the live Nebius models. Wall time: ~65 seconds. Cost per run: ~$0.02. The Coordinator's reasoning trace streams in the leftmost pane.

stub mode (no LLM cost)

COORDINATOR Kimi-K2-Thinking

DETECTIVE Llama+LoRA

REALITY-CHECKER Llama+LoRA

FIXER MiniMax-M2.5

POSTMORTEM

Awaiting incident…

See it in DataHub

After a run completes, the Fixer agent has written incident annotations to the three affected datasets via Python SDK. Open them in DataHub UI to see the warning banner at the top of each page:

→ olist_dirty.olist_order_items Quality tab → olist_dirty.olist_customers Quality tab → olist_dirty.olist_products Quality tab

These links only work from inside the Tailscale network. The DataHub UI itself is not exposed publicly — only this dashboard is, via Tailscale Funnel.

Tech stack

Layer	What	Notes
LLM inference	Nebius Token Factory	OpenAI-compatible, one API for all 4 models
Models (3 + 1 LoRA)	Kimi-K2-Thinking · Llama 3.1 8B + LoRA · MiniMax-M2.5	Three labs (Moonshot, Meta, MiniMax)
Metadata catalog	DataHub Core	Self-hosted via the kit's `datahub docker quickstart` on Studio A
Data validation	Custom Python (sqlite3 + DataHub SDK)	Started with the GE 0.18 plugin; pivoted to direct SDK writes after a SQLite URN encoding mismatch (see Build notes)
DataHub writes	`acryl-datahub` Python SDK	DataHub recommends SDK over GraphQL mutations for programmatic writes
Backend	FastAPI + sse-starlette	SSE streams agent events to the dashboard in real time
Frontend	Vanilla HTML / CSS / JS	Single page, no framework, ~1k LOC. Prism.js for GraphQL highlighting
Network	Tailscale + Tailscale Funnel	Mac Mini reaches DataHub over the tailnet; Funnel exposes only the dashboard publicly, not DataHub
Hosting	`eliass-mac-mini.tail365038.ts.net:10001`	Funnel auto-provisions HTTPS via Let's Encrypt

The build

~3 hours from empty directory to running demo. Built via Claude Code's /spec → /yolo workflow: write the spec, decompose into 29 tasks, execute sequentially with quality gates and per-task commits. 8 commits on the feature branch, then merged to main. Full spec at docs/specs/data-oncall-execution-plan.md.

Notable gotchas (in chronological order)

Great Expectations 0.18.x ⊥ Python 3.13+. Pydantic v1's typing.Subscripted handling broke on Python 3.14. Pinned the venv to Python 3.12 via requires-python = ">=3.10,<3.13".
The DataHub GE plugin generates SQLite URNs that don't match the kit's ingestion. The plugin encodes the full .db file path as the database-name component. The kit's ingestion uses <instance>.main.<table> instead. Spent 30 minutes debugging, then pivoted from the GE plugin to direct SDK assertion writes — same outcome (assertions in DataHub for the Reality-Checker to query), bypasses the URN mismatch entirely.
Kimi-K2-Thinking returns reasoning in message.reasoning, not message.content. Initial test with max_tokens=20 returned content: None because Kimi spent its budget on reasoning. Fix: read both fields, stream reasoning as thinking SSE events to the dashboard. That's where the live "watch the model think" effect comes from.
DataHub's GraphQL schema doesn't expose AssertionInfo.customProperties. The Reality-Checker initially tried to filter assertions by check name from customProperties — but that field isn't queryable via GraphQL even though it's writable via the Python SDK. Fix: use info.datasetAssertion.nativeType instead.
Tailscale Funnel only allows ports 443/8443/10000–10999. Initially grabbed root 443 by accident, which clobbered an existing internal route to 127.0.0.1:18789 (one of Ben's other services). Detected immediately, restored the original mapping as a tailnet-only serve rule, and moved data-oncall to 10001.

Stats

~3 hours total build time
23+ tasks from spec, 8 commits
~3,500 LOC across Python + HTML/CSS/JS
54.8 seconds median end-to-end run time (verified via public Funnel URL)
~$0.02–0.05 cost per run (Nebius API)
56 SSE events per run, 28 of which are Kimi reasoning chunks