A 4-agent incident response loop for DataHub.
Four agents — Kimi-K2-Thinking, Llama 3.1 8B + LoRA, and MiniMax-M2.5 — query DataHub in parallel to identify an affected dataset, trace its upstream lineage, diff quality assertions across a clean snapshot and a production instance, and write the postmortem back via Python SDK. End-to-end in ~65 seconds. Built for the DataHub × Nebius hackathon, EF SF, April 10 2026.
Given a natural-language incident — "revenue dashboard wrong" — four agents run in parallel. Detective identifies the affected dataset and traces upstream lineage. Reality-Checker queries quality assertions on the same datasets in two parallel DataHub instances (one clean, one production) and returns the diff. Fixer writes the postmortem back to the affected datasets via Python SDK. Coordinator orchestrates the other three and synthesizes the final report.
DataHub stores schema, lineage, and ownership — what should be true. The hackathon kit ships two parallel Olist snapshots: one clean, one with three planted referential-integrity bugs. Identical schemas, identical lineage; only the data inside differs. We run 11 quality assertions against both and write the results back as DataHub assertion entities. The diff between those assertion sets is the production incident — no SQL execution at runtime, no data movement. The four agents coordinate by reading and writing the same dataset URNs; DataHub is the shared memory layer. (Maps to L5 + L6 in the hackathon's rubric.)
[ Trigger CLI ] [ Browser tab 1: Agent Console ]
│ [ Browser tab 2: DataHub UI ]
▼ ▲
Coordinator (Kimi-K2-Thinking) │ SSE
│ │
┌───────────┼───────────┐ │
▼ ▼ ▼ │
Detective Reality-Chk Fixer │
(Llama+ (Llama+ (MiniMax │
LoRA) LoRA) M2.5) │
│ │ │ │
└───────────┼───────────┘ │
▼ │
DataHub @ Studio A 100.114.31.63 │
GraphQL reads + Python SDK writes │
│ │
└───────────────────────────────────────┘
(live event stream)
Coordinator dispatches Detective and Reality-Checker concurrently with
asyncio.gather, awaits both, then dispatches Fixer with their output and
produces the final synthesis. Reads go through DataHub's GraphQL API; writes go through
the Python SDK (per DataHub's recommendation against GraphQL mutations for programmatic use).
| Agent | Lab | Model | Picked for |
|---|---|---|---|
| Coordinator | Moonshot | moonshotai/Kimi-K2-Thinking |
Long-horizon planning with visible <think> reasoning traces |
| Detective | Meta | meta-llama/Meta-Llama-3.1-8B-Instruct + LoRA |
Cheapest fast model on Nebius; narrow NL→GraphQL task fits a small LoRA |
| Reality-Checker | Meta | (same LoRA endpoint as Detective) | Same model, different system prompt — shared compute, different role |
| Fixer | MiniMax | MiniMaxAI/MiniMax-M2.5 |
Trained for agentic code generation — drafts the Python SDK quarantine call and the Slack message |
olist_source and the production instance.…| Epoch | Train loss | Val loss |
|---|
The hackathon kit ships the Brazilian E-Commerce dataset by Olist — 99k orders across 9 base tables and 5 SQL views — as two SQLite files with identical schemas:
olist.db → ingested as platform instance olist_source: clean Kaggle dataolist_dirty.db → ingested as olist_dirty: same schema, three planted referential-integrity bugs| Table | Planted issue | Scope | Downstream blast radius |
|---|---|---|---|
olist_customers |
~8% of customer rows physically deleted | 7,955 rows removed (99,441 → 91,486) | v_order_details drops orders or shows NULL customer fields. Orphan FKs from olist_orders. |
olist_order_items |
seller_id values truncated by 1 character |
5,632 rows modified (32 → 31 chars) | v_seller_performance undercounts every affected seller's revenue. |
olist_products |
product_category_name set to NULL |
988 rows modified | v_product_sales silently drops uncategorized products from category aggregations. |
Run the 11 quality assertions against olist_source, run them again against
olist_dirty, write both result sets back as DataHub assertion entities, then
diff them via GraphQL. The Reality-Checker computes the diff in Python (deterministic set
difference) and Llama writes the narrative around it.
| Assertion | olist_source | olist_dirty | Verdict |
|---|---|---|---|
seller_id length = 32 | ✅ pass | ❌ 5,632 fail | production-only |
customer row count = 99,441 | ✅ pass | ❌ 7,955 missing | production-only |
product_category not null | ✅ pass | ❌ 988 NULL | production-only |
| (8 other assertions across 3 tables) | ✅ pass | ✅ pass | baseline |
No SQL execution at runtime, no data movement. The assertions were computed once during setup; the agents query assertion entities, not raw data. That's why the demo runs in seconds and scales to warehouse-size datasets.
Click TRIGGER to run the 4-agent loop against the live DataHub at Studio A and the live Nebius models. Wall time: ~65 seconds. Cost per run: ~$0.02. The Coordinator's reasoning trace streams in the leftmost pane.
After a run completes, the Fixer agent has written incident annotations to the three affected datasets via Python SDK. Open them in DataHub UI to see the warning banner at the top of each page:
These links only work from inside the Tailscale network. The DataHub UI itself is not exposed publicly — only this dashboard is, via Tailscale Funnel.
| Layer | What | Notes |
|---|---|---|
| LLM inference | Nebius Token Factory | OpenAI-compatible, one API for all 4 models |
| Models (3 + 1 LoRA) | Kimi-K2-Thinking · Llama 3.1 8B + LoRA · MiniMax-M2.5 | Three labs (Moonshot, Meta, MiniMax) |
| Metadata catalog | DataHub Core | Self-hosted via the kit's datahub docker quickstart on Studio A |
| Data validation | Custom Python (sqlite3 + DataHub SDK) | Started with the GE 0.18 plugin; pivoted to direct SDK writes after a SQLite URN encoding mismatch (see Build notes) |
| DataHub writes | acryl-datahub Python SDK |
DataHub recommends SDK over GraphQL mutations for programmatic writes |
| Backend | FastAPI + sse-starlette | SSE streams agent events to the dashboard in real time |
| Frontend | Vanilla HTML / CSS / JS | Single page, no framework, ~1k LOC. Prism.js for GraphQL highlighting |
| Network | Tailscale + Tailscale Funnel | Mac Mini reaches DataHub over the tailnet; Funnel exposes only the dashboard publicly, not DataHub |
| Hosting | eliass-mac-mini.tail365038.ts.net:10001 |
Funnel auto-provisions HTTPS via Let's Encrypt |
~3 hours from empty directory to running demo. Built via Claude Code's
/spec → /yolo workflow: write the spec, decompose into 29 tasks, execute
sequentially with quality gates and per-task commits. 8 commits on the feature branch,
then merged to main. Full spec at docs/specs/data-oncall-execution-plan.md.
typing.Subscripted handling broke on Python 3.14. Pinned the venv to
Python 3.12 via requires-python = ">=3.10,<3.13".
.db file path as the database-name component.
The kit's ingestion uses <instance>.main.<table> instead.
Spent 30 minutes debugging, then pivoted from the GE plugin to direct SDK
assertion writes — same outcome (assertions in DataHub for the Reality-Checker
to query), bypasses the URN mismatch entirely.
message.reasoning,
not message.content. Initial test with max_tokens=20 returned
content: None because Kimi spent its budget on reasoning. Fix: read both
fields, stream reasoning as thinking SSE events to the
dashboard. That's where the live "watch the model think" effect comes from.
AssertionInfo.customProperties.
The Reality-Checker initially tried to filter assertions by check name from
customProperties — but that field isn't queryable via GraphQL even though it's writable
via the Python SDK. Fix: use info.datasetAssertion.nativeType instead.
443 by accident, which clobbered an existing internal route to
127.0.0.1:18789 (one of Ben's other services). Detected immediately,
restored the original mapping as a tailnet-only serve rule, and moved data-oncall to
10001.