DATA-ONCALL page a team of agents instead of waking up your on-call engineer
IDLE ⏱ 00:00 💰 $0.0000

data-oncall

A 4-agent incident response loop for DataHub.

Four agents — Kimi-K2-Thinking, Llama 3.1 8B + LoRA, and MiniMax-M2.5 — query DataHub in parallel to identify an affected dataset, trace its upstream lineage, diff quality assertions across a clean snapshot and a production instance, and write the postmortem back via Python SDK. End-to-end in ~65 seconds. Built for the DataHub × Nebius hackathon, EF SF, April 10 2026.

▶ Live demo GitHub How it works

What it does

Given a natural-language incident — "revenue dashboard wrong" — four agents run in parallel. Detective identifies the affected dataset and traces upstream lineage. Reality-Checker queries quality assertions on the same datasets in two parallel DataHub instances (one clean, one production) and returns the diff. Fixer writes the postmortem back to the affected datasets via Python SDK. Coordinator orchestrates the other three and synthesizes the final report.

DataHub stores schema, lineage, and ownership — what should be true. The hackathon kit ships two parallel Olist snapshots: one clean, one with three planted referential-integrity bugs. Identical schemas, identical lineage; only the data inside differs. We run 11 quality assertions against both and write the results back as DataHub assertion entities. The diff between those assertion sets is the production incident — no SQL execution at runtime, no data movement. The four agents coordinate by reading and writing the same dataset URNs; DataHub is the shared memory layer. (Maps to L5 + L6 in the hackathon's rubric.)

Architecture

                      [ Trigger CLI ]                     [ Browser tab 1: Agent Console ]
                            │                             [ Browser tab 2: DataHub UI    ]
                            ▼                                       ▲
                  Coordinator (Kimi-K2-Thinking)                    │ SSE
                            │                                       │
                ┌───────────┼───────────┐                           │
                ▼           ▼           ▼                           │
            Detective  Reality-Chk    Fixer                         │
            (Llama+    (Llama+      (MiniMax                        │
             LoRA)      LoRA)        M2.5)                          │
                │           │           │                           │
                └───────────┼───────────┘                           │
                            ▼                                       │
                  DataHub @ Studio A 100.114.31.63                  │
                  GraphQL reads + Python SDK writes                 │
                            │                                       │
                            └───────────────────────────────────────┘
                                  (live event stream)
  

Coordinator dispatches Detective and Reality-Checker concurrently with asyncio.gather, awaits both, then dispatches Fixer with their output and produces the final synthesis. Reads go through DataHub's GraphQL API; writes go through the Python SDK (per DataHub's recommendation against GraphQL mutations for programmatic use).

Three labs · Four roles · One fine-tune

AgentLabModelPicked for
Coordinator Moonshot moonshotai/Kimi-K2-Thinking Long-horizon planning with visible <think> reasoning traces
Detective Meta meta-llama/Meta-Llama-3.1-8B-Instruct + LoRA Cheapest fast model on Nebius; narrow NL→GraphQL task fits a small LoRA
Reality-Checker Meta (same LoRA endpoint as Detective) Same model, different system prompt — shared compute, different role
Fixer MiniMax MiniMaxAI/MiniMax-M2.5 Trained for agentic code generation — drafts the Python SDK quarantine call and the Slack message
COORDINATOR
Kimi-K2-Thinking
Plans the run, dispatches the other agents, writes the final postmortem.
Moonshot · $0.60/$2.50 · 45.7 t/s · eu-north1
DETECTIVE
Llama 3.1 8B + LoRA
Identifies the affected dataset and traces upstream lineage via DataHub GraphQL.
Meta · $0.03/$0.09 · 155 t/s · eu-north1 fast
REALITY-CHECKER
Llama 3.1 8B + LoRA
Diffs quality assertions across olist_source and the production instance.
Meta · same endpoint, different prompt
FIXER
MiniMax-M2.5
Writes incident annotations back to DataHub via Python SDK; drafts the Slack post.
MiniMax · $0.30/$1.20 · 36.8 t/s · us-central1

FINE-TUNE STORY

Base model
Fine-tuned
Training pairs
LoRA config
Job ID
Job type
Created
Completed
Duration
Cost

Training loss curve

EpochTrain lossVal loss

Accuracy on 8-pattern test set

Base + in-context
TBD
Fine-tuned LoRA
TBD
Improvement run measure_accuracy.py

Sample seed pair (one of 8 patterns)
NL:

Artifacts

The data: Olist + planted issues

The hackathon kit ships the Brazilian E-Commerce dataset by Olist — 99k orders across 9 base tables and 5 SQL views — as two SQLite files with identical schemas:

TablePlanted issueScopeDownstream blast radius
olist_customers ~8% of customer rows physically deleted 7,955 rows removed (99,441 → 91,486) v_order_details drops orders or shows NULL customer fields. Orphan FKs from olist_orders.
olist_order_items seller_id values truncated by 1 character 5,632 rows modified (32 → 31 chars) v_seller_performance undercounts every affected seller's revenue.
olist_products product_category_name set to NULL 988 rows modified v_product_sales silently drops uncategorized products from category aggregations.

The cross-instance assertion diff

Run the 11 quality assertions against olist_source, run them again against olist_dirty, write both result sets back as DataHub assertion entities, then diff them via GraphQL. The Reality-Checker computes the diff in Python (deterministic set difference) and Llama writes the narrative around it.

Assertionolist_sourceolist_dirtyVerdict
seller_id length = 32✅ pass❌ 5,632 failproduction-only
customer row count = 99,441✅ pass❌ 7,955 missingproduction-only
product_category not null✅ pass❌ 988 NULLproduction-only
(8 other assertions across 3 tables)✅ pass✅ passbaseline

No SQL execution at runtime, no data movement. The assertions were computed once during setup; the agents query assertion entities, not raw data. That's why the demo runs in seconds and scales to warehouse-size datasets.

▶ Live demo

Click TRIGGER to run the 4-agent loop against the live DataHub at Studio A and the live Nebius models. Wall time: ~65 seconds. Cost per run: ~$0.02. The Coordinator's reasoning trace streams in the leftmost pane.

COORDINATOR Kimi-K2-Thinking
DETECTIVE Llama+LoRA
REALITY-CHECKER Llama+LoRA
FIXER MiniMax-M2.5

POSTMORTEM

Awaiting incident…

Tech stack

LayerWhatNotes
LLM inference Nebius Token Factory OpenAI-compatible, one API for all 4 models
Models (3 + 1 LoRA) Kimi-K2-Thinking · Llama 3.1 8B + LoRA · MiniMax-M2.5 Three labs (Moonshot, Meta, MiniMax)
Metadata catalog DataHub Core Self-hosted via the kit's datahub docker quickstart on Studio A
Data validation Custom Python (sqlite3 + DataHub SDK) Started with the GE 0.18 plugin; pivoted to direct SDK writes after a SQLite URN encoding mismatch (see Build notes)
DataHub writes acryl-datahub Python SDK DataHub recommends SDK over GraphQL mutations for programmatic writes
Backend FastAPI + sse-starlette SSE streams agent events to the dashboard in real time
Frontend Vanilla HTML / CSS / JS Single page, no framework, ~1k LOC. Prism.js for GraphQL highlighting
Network Tailscale + Tailscale Funnel Mac Mini reaches DataHub over the tailnet; Funnel exposes only the dashboard publicly, not DataHub
Hosting eliass-mac-mini.tail365038.ts.net:10001 Funnel auto-provisions HTTPS via Let's Encrypt

The build

~3 hours from empty directory to running demo. Built via Claude Code's /spec → /yolo workflow: write the spec, decompose into 29 tasks, execute sequentially with quality gates and per-task commits. 8 commits on the feature branch, then merged to main. Full spec at docs/specs/data-oncall-execution-plan.md.

Notable gotchas (in chronological order)

Stats