AI data engineering assistant

Every pipeline has tribal knowledge that lives nowhere.

cairndata captures it. An AI assistant that learns your pipeline's business context, monitors upstream repos for breaking changes, and writes dbt models with knowledge no doc ever captured.

Get early access See the problem

claude code — ~/dbt-project

Loading context...

→ Knowledge graph: 847 entities

→ Last session: fixed stg_main__users after tier column migration

→ Schema cache: 3 days old (OK)

> check upstream PRs for breaking changes

I'll scan your 3 upstream repos (2 Tier 1, 1 Tier 2)...

PR #442 in backend-api — ALTER TABLE orders ADD source_channel

Impact: HIGH

• stg_main__orders — missing column (I can fix this)

• int_order_segments — groups by source (grain risk)

Want me to update the staging model?

The problem

Upstream changes break pipelines. Nobody connects the dots.

Backend adds a column to the users table. Nobody tells you.

Your staging model doesn't have it. Data is incomplete. You find out when an analyst asks "why is the tier field empty?"

Someone changes a status enum from 4 values to 6.

Your JOIN assumed 4 values. New values create fan-out. Metrics jump 30% overnight. Nobody knows why.

A migration runs Friday. Your pipeline runs Monday.

72 hours of bad data in the warehouse. The analyst already sent the report to C-level. Finance is "reconciling."

You join a new project. You spend 3 weeks just understanding the pipeline.

Because the real knowledge lives in Slack threads, PR comments from 2 years ago, and one person's head. No doc covers it.

You can't watch every PR in every upstream repo. The team that changed the schema doesn't even know your pipeline exists. This isn't a people problem — it's a tooling gap.

What cairndata does

Three things. All with persistent context.

cairndata is an AI assistant that gets better at understanding your pipeline with every session. It remembers what it learns.

Pipeline memory

cairndata builds a knowledge graph of your project — tables, sources, business meanings, gotchas, past incidents. On first run it scans your project and upstream repos. Then it learns from every session. After a month, it knows more about your pipeline than a new team member after a quarter.

knowledge graph entity

Entity: stg_main__orders Source: backend-api.public.orders Business: "All customer orders, incl. drafts" Gotcha: "status has 6 values since March, not 4 as the original doc says" Last change: 2026-02-15, added source_channel

Upstream monitoring

The review-upstream-prs skill scans open PRs in your upstream repos, filters by tier priority and keywords, and analyzes impact on your pipeline — including grain impact analysis that detects when enum expansion causes fan-out in downstream JOINs. Generates a prioritized impact report.

impact report excerpt

PR #442 — backend-api Change: ADD COLUMN source_channel to orders Impact: HIGH • stg_main__orders: missing column (fixable) • fct_revenue: joins on order_id (no grain risk) • int_order_segments: groups by source → NEW VALUES WILL CREATE NEW ROWS

Model writing with context

cairndata writes dbt models knowing your project's conventions (CTE structure, naming, test patterns), warehouse schema (from cache), and business context (from the knowledge graph). It debugs problems by checking what it knows about the model first — before querying the warehouse.

context-aware vs stateless

Without context: > "Write staging model for users table" → Generic model, misses status=3 means churned With cairndata: > "Write staging model for users table" → Knows status values (knowledge graph) → Knows naming convention (existing models) → Knows tier column added last week (journal) → Adds accepted_values test for status

Why cairndata

Not another dashboard. A partner that learns.

Your machine, your files

A repo you clone. Skills, config, knowledge graph — all files on your disk. No dashboards, no vendor APIs. Runs on top of Claude Code, so AI calls go through Anthropic, but your project data stays local.

Contextual, not just reactive

Data observability tools tell you data is bad — after the fact. cairndata reads upstream PRs and understands why data changed, because it has your pipeline's context and history.

Grows with you

Knowledge graph, schema cache, session journal — every session makes cairndata smarter about your pipeline. Tribal knowledge captured in code, not in people's heads.

Fully hackable

Skills are markdown files. Config is a text file. The knowledge graph is JSON. Everything is readable, editable, and version-controllable. Don't like how a skill works? Change it. Want to add your own? Drop a file.

Getting started

Five minutes. Three commands.

terminal

$ git clone https://github.com/cairndata/cairndata

$ cd cairndata && ./setup.sh ~/my-dbt-project

GCP Project ID: my-project

BigQuery datasets: analytics,staging

Upstream repos: org/backend,org/payments

Setup complete!

$ cd ~/my-dbt-project && claude

Knowledge graph empty. Want me to bootstrap it?

Clone the repo. Everything lives in ~/.cairndata/ and ~/.claude/skills/. No global installs.

Run setup.sh with your dbt project path. Configure GCP project, datasets, upstream repos. Creates your project config.

Say "Bootstrap the knowledge graph." cairndata scans your project, learns conventions, maps sources. You're productive from session one.

Requires: dbt, gh CLI, bq CLI, Claude Code

Your daily workflow

Start session. Work. cairndata remembers.

Start a session

cairndata loads last 3 journal entries, searches the knowledge graph, checks schema cache freshness. You know what changed since yesterday.

Say what you need

"Check upstream PRs." "Write a staging model for the new orders table." "Debug why fct_revenue spiked." cairndata picks the right skill.

Work with context

The agent doesn't start from zero. It knows the table, its history, its gotchas. It proposes fixes and asks for your approval.

cairndata remembers

Discoveries, decisions, edge cases — saved to the knowledge graph and session journal. Next session starts smarter.

FAQ

Questions you probably have

Claude Code without cairndata starts from zero every session. cairndata adds a persistent context layer — knowledge graph, schema cache, session journal — so the agent remembers your pipeline across sessions. Plus 9 specialized skills that know how to analyze upstream PRs, write dbt models to your conventions, and debug with historical context. The difference is the same as between a contractor who shows up every day with no memory vs one who's been on your team for months.

It runs locally. The schema cache is YAML on your disk. The knowledge graph is a JSON file. Nothing is sent anywhere beyond standard Claude API calls. You can inspect every file cairndata creates in ~/.cairndata/.

No. cairndata complements dbt Cloud (or dbt Core). Its focus is upstream monitoring and context persistence — not orchestration, scheduling, or IDE features. Think of it as the knowledge layer that sits alongside your existing dbt setup.

The architecture is warehouse-agnostic. BigQuery is supported out of the box (bq CLI for schema inspection). Snowflake and Redshift support is on the roadmap — the core skills (knowledge graph, upstream monitoring, model writing) work regardless of warehouse.

Pricing details will be announced at launch. Join the early access list to get notified — early adopters will get a significant discount. cairndata runs on top of Claude Code (Anthropic's CLI), which requires a separate subscription.

Currently GitHub only (via gh CLI). GitLab and Bitbucket support is planned. The skill architecture makes it straightforward to add new providers.