AI Usage Playbooks Portal Back to top

AI Usage Playbook — Data Engineering

Core Principles: All recommendations in this playbook align with Xebia's official Core Principles for Working with AI. Refer to that document for the foundational rules that govern every AI interaction at Xebia.


How This Playbook Is Organized

This playbook consists of two complementary parts.

Part I — Tools, Techniques, and Landscape covers the AI-assisted data engineering ecosystem: a capability map showing where AI is reliable and where it isn't, detailed workflows for each stage of the data engineering lifecycle, DE-specific tooling, privacy and data governance, and common pitfalls. Think of Part I as the reference library you come back to when you need to evaluate a specific use case, set up tooling for a new engagement, or make an informed recommendation to a client.

Part II — Quick Start is a hands-on, step-by-step guide for working with an AI agent on a real data engineering task. It walks you through a concrete workflow — from preparing your dbt project for AI all the way to committing reviewed, tested code — using Claude Code as the primary tool, dbt + Snowflake as the reference stack, and GitHub for version control. A separate Jupyter/notebook entry point covers the workflow for engineers who work primarily in Databricks or notebook environments. The goal: give you one clear path you can follow today, then adapt as you build confidence.

Before both parts, the Key Principles section establishes the operating model that everything else builds on. It's short and it frames every decision in the playbook.

Start with the Key Principles, then skim Part I to orient yourself. Move to Part II (Quick Start) when you have a real task in front of you.


What This Playbook Assumes

You're a data engineer who already knows the craft. You understand dbt, SQL, orchestration, and warehouse platforms. This playbook won't explain what a CTE is or how Airflow scheduling works — it assumes that knowledge and focuses entirely on how AI changes the way you do the work you already know how to do.

You've read the Core Principles for Working with AI. We won't repeat them here. Where a principle is particularly relevant to a section, we'll call it out briefly and link back.

General AI tooling reference: For AI model selection, coding assistant profiles, cost optimization, MCP fundamentals, and the agent skills standard, see the [AI Tooling Playbook] (link TBD when published). This DE playbook covers only what's specific to data engineering work; general tooling guidance lives there.


Reference Stack

Examples throughout use dbt + Snowflake as the default stack, with Airflow referenced for orchestration context. Where BigQuery or Databricks behave meaningfully differently, we note it. The concepts and workflows apply regardless of your specific platform — the stack choice keeps examples concrete rather than abstract.


Working with AI in Data Engineering — Key Principles

AI generates. You decide. That division of labour runs through everything in this playbook.

It applies whether you're scaffolding a staging model, defining a metric in your semantic layer, or reviewing a pipeline that an agent wrote end to end. AI can accelerate a large share of data engineering work — but the judgment calls about what's correct, what's complete, and what's safe to ship remain yours.


Use AI Effectively

Use AI to onboard faster. On a new engagement, point the agent at the existing project and ask it to summarise the structure, conventions, and key models. It won't replace proper onboarding — the agent doesn't know about the legacy system that stores prices in cents or the upstream table that's only reliable after 6am — but a few well-directed questions can save you days of reading through code. (Section 2.4 expands on the mechanics of AI-assisted onboarding.)

Structure your context in layers and build it incrementally. Organisation-wide knowledge (SQL style preferences, prompting patterns) persists across projects. Project-specific context (dbt conventions, platform configuration, naming standards) persists within an engagement. Task-specific input (the current ticket or schema) is session-only. Organise your context files to match. Start with core references and add more as the kit matures — treat these files as a long-term team asset, versioned and reviewed like any shared artifact.

Provide structured context, not vague prompts. The more structured your input — schema definitions, metric specifications, lineage, edge case decisions — the more reliable the output. Use MCP servers to give agents direct access to metadata from your platform, semantic layer, or catalogue. (Section 3.2 covers DE-specific MCP server setup and the security trade-offs involved.) Research consistently shows AI accuracy improves significantly with structured business context versus raw schemas. Feed the agent as much structured information as the client's policies allow.

Boilerplate goes to the agent. Business logic stays with you. Agents excel at repetitive, pattern-clear tasks: staging models, YAML schemas with standard tests, DAG scaffolding, model documentation, legacy SQL conversions during migrations. Where they struggle is anything requiring judgment — complex business logic, grain decisions, metric definitions, edge case handling. That's where your time is best spent. (The Capability Map in Section 1 provides a task-by-task breakdown of where AI is reliable and where it's not.)

Let AI draft the spec, not just the code. Have the agent produce a specification or data contract before implementation. Let it ask questions when it hits ambiguity — that's where hidden assumptions surface. Work through the contract until it's agreed, then implement from it with far fewer surprises. (Section 2.2 walks through a concrete data contract workflow.)

Use AI to cross-reference and verify consistency. Agents can trace lineage when you rename a column, spot conflicting metric definitions across teams, and scan for test coverage gaps. This is work you'd otherwise do manually by grepping through the project, and it's exactly where things get missed. The prerequisite is access — MCP servers and workspace-aware tools make this possible. (Section 2.8 covers cross-referencing workflows in detail.)

Know when to skip AI entirely. Not every task benefits from AI in the loop. A 5-line SQL fix you already understand, a debugging session where you know exactly where to look, a nuanced conversation about what a metric should mean — these are often faster done directly. The Capability Map (Section 1) flags where AI is unreliable, but even for tasks it handles well, there's a complexity floor: if prompting and reviewing takes longer than just writing the code, skip it.

Getting value from AI is the first half of working well with it. The second half is making sure you stay in control — because AI that produces plausible-looking but subtly wrong output is more dangerous than AI that obviously fails.

Maintain Human Oversight

Control what your agents can reach. CLI agents and IDE extensions read everything in your project directory. Use ignore mechanisms to exclude credentials, connection strings, and real data. Connect agents to metadata through read-only service accounts or MCP servers. Agents should never have write access to any environment — if a workflow requires it, scope it to a sandbox as an explicit decision.

AI can assist with data classification — but the decisions are yours. Agents can flag likely PII and suggest masking patterns — use that as a starting point. The actual classification and handling decisions need human sign-off aligned with the client's governance policies. AI is pattern-matching; you're making a compliance decision. (Section 4 covers the full data governance approach.)

Treat AI-generated code like a junior engineer's PR. Review everything. Test everything. AI-generated SQL can silently get the grain wrong, use the wrong column, or include records it shouldn't. Benchmarks consistently show AI code carries more issues than human-written code, particularly in logic and correctness. Clean-looking code demands the most scrutiny. (Section 6 catalogues the most common pitfalls in AI-generated data engineering work.)

Every AI-generated model needs validation before it moves forward. Validate grain, join logic, filter conditions, aggregation behaviour, and naming consistency. A model that's structurally correct but uses the wrong grain can pass every test and still produce silently wrong numbers. (Section 2.3 covers AI-assisted data quality and testing workflows.)

Metric definitions and business logic decisions don't get delegated. What "revenue" means, how returns are handled, which cost column to use — these are human decisions. AI can propose; the sign-off is yours. Getting a metric wrong compounds across every downstream consumer.

If the input isn't solid, the output won't be either. When requirements are shifting or business logic is undocumented, AI will confidently generate something anyway. If the humans haven't agreed on what the model should do, AI can't fill that gap. Resolve ambiguity before you prompt.

Domain knowledge is your best debugging tool. AI doesn't know about the business rule that changed last quarter, the upstream table that's had data quality issues since the migration, or the fact that the amount column in the legacy system is stored in cents. When the output looks plausible but something feels off, trust your instinct and dig in.


1. AI-Assisted DE Capability Map

This section is a quick reference. When you have a specific task and want to know whether AI can help — and how much to trust the output — scan the relevant table below.

The Key Principles section covers the mindset: AI generates, you decide. The workflow sections (Section 2) cover the detailed how-to for each task type. This section sits between the two — a lookup table that helps you set expectations before you start.

How to read the reliability ratings:

  • High — AI produces usable output with standard context. You still review, but the output typically needs minor adjustments rather than rewrites.
  • Medium — AI produces a solid draft that needs meaningful human review and editing. Useful as a starting point, not a finished product.
  • Low — AI output requires heavy validation or serves only as a rough scaffold. The value is in acceleration, not accuracy. Expect to rewrite significant portions.

SQL and Transformations

Task AI Reliability Context Needed Watch For
Generate staging model from source DDL High Source schema, naming conventions, example staging model Column type mapping, null handling, casting conventions
Generate intermediate/mart models from defined spec Medium-High Source models, grain definition, metric specs, join relationships Grain errors, wrong join type, missing filters
Write complex business logic (revenue calc, attribution, cohort analysis) Low — draft only Metric definitions, edge case decisions, example calculations, business rules doc Wrong assumptions about business rules, silent grain errors, incorrect handling of NULLs and edge cases
Refactor legacy SQL (CTEs, readability) High The SQL itself, style conventions Semantic changes disguised as refactors — verify output matches original results
Optimize query performance Medium Query profile/explain plan, table stats, platform (Snowflake vs. BigQuery vs. Databricks — advice diverges) Platform-specific recommendations that don't apply to your engine, optimisations that change result semantics
Translate SQL dialect (Oracle → Snowflake, SQL Server → BigQuery) Medium-High Source SQL, target dialect docs, known incompatibilities list Semantic differences masked by syntactic similarity (e.g., NULL handling, implicit casting, date arithmetic)
Generate dbt incremental model logic Medium Source change patterns, volume, SLA, merge keys, target platform Wrong merge strategy, missing handling for late-arriving records, delete detection gaps

Pipeline and Orchestration

Task AI Reliability Context Needed Watch For
Scaffold Airflow DAG from description High DAG conventions, operator preferences, connection names, schedule requirements Hardcoded values, missing error handling, wrong schedule expression, missing retries
Generate dbt YAML (sources, schema tests) High dbt project structure, existing YAML examples, column descriptions Generic descriptions that don't reflect actual semantics — "The ID column" is useless documentation
Design incremental loading strategy Medium Source change patterns, volume, SLA, target platform capabilities Assumptions about source reliability, missing late-arrival handling, no backfill strategy
Write infrastructure-as-code (Terraform, Pulumi) for data platform resources Medium-High Existing IaC patterns, provider versions, naming conventions, security policies Missing security configurations, overly permissive IAM policies, hardcoded values that should be variables
Generate CI/CD pipeline configuration for dbt Medium-High Existing CI/CD patterns, deployment strategy, environment structure Missing environment-specific configs, wrong job dependencies, missing dbt test step

Data Quality and Testing

Task AI Reliability Context Needed Watch For
Generate dbt generic tests (not_null, unique, relationships, accepted_values) High Schema, known constraints, existing test patterns Missing tests on critical columns, overly broad accepted_values lists
Generate dbt singular tests for business rules Medium Business rule documentation, example scenarios, edge cases Tests that verify structure but not semantics — a test can pass and still not catch the actual bug
Write Great Expectations / Soda checks Medium Schema, data profile, known data quality issues, check format examples Generic thresholds that don't reflect actual data distribution, missing row-level checks
Generate test data respecting referential integrity Medium Schema with foreign keys, cardinality expectations, realistic value ranges Synthetic data that's technically valid but unrealistic — all customers from the same country, all orders placed today
Identify test coverage gaps Medium-High dbt project manifest, existing test files, critical model list May miss domain-specific tests that aren't inferrable from schema alone
Design anomaly detection rules Low-Medium Historical data patterns, known seasonality, alerting thresholds, false positive tolerance Rules that are either too sensitive (alert fatigue) or too loose (miss real issues)

Documentation and Knowledge Management

Task AI Reliability Context Needed Watch For
Generate dbt model descriptions from SQL Medium-High The model SQL, upstream model descriptions, business context Descriptions of what the SQL does rather than what the data means — "Joins orders with customers" vs. "All completed orders enriched with customer segment for marketing attribution"
Generate column-level descriptions Medium Table context, business glossary, example descriptions Generic filler ("The created_at timestamp") instead of useful context ("UTC timestamp when the order was placed; populated by the checkout service")
Create data dictionary from schema Medium Schema DDL, business context, existing documentation Accurate structure, shallow semantics — the dictionary will describe types and nullability well but miss business meaning
Reverse-engineer intent from undocumented SQL Medium The SQL, upstream/downstream context, any available comments Plausible but wrong interpretation of intent — AI guesses what the author meant based on patterns, not actual knowledge
Write pipeline runbooks for failure scenarios Medium Common failure patterns, alerting setup, environment details, escalation paths Missing environment-specific steps, generic troubleshooting that doesn't account for your actual infrastructure
Generate lineage documentation from code Medium-High dbt project graph, model SQL, source definitions Accurate column-level lineage is still unreliable — model-level lineage is more trustworthy
Onboarding summaries for new team members Medium-High Full project access (dbt project, CLAUDE.md, schema context) Will miss institutional knowledge: why things are the way they are, historical decisions, known quirks

Migration and Modernization

Task AI Reliability Context Needed Watch For
Translate SQL between dialects at scale Medium-High Source SQL corpus, dialect mapping rules, known incompatibilities Syntactically correct but semantically different translations — test every migrated query against source results
Reverse-engineer ETL tool logic (Informatica, SSIS, Talend) into code Medium ETL tool export/XML, target framework conventions, mapping documentation Misinterpreted transformation logic, missed conditional branching, lost error handling
Generate schema mapping between source and target Medium-High Source schema, target schema, existing mappings, naming conventions Incorrect assumptions about column equivalence based on name similarity
Plan migration phases and dependency order Medium Full source system inventory, dependency graph, team capacity constraints Overly optimistic parallelism, missing dependencies that aren't explicit in metadata
Generate migration validation queries (source-to-target reconciliation) High Source and target schemas, key columns, aggregation logic Missing edge case validations — reconciliation counts match but data content differs

Streaming and Real-Time Pipelines

Task AI Reliability Context Needed Watch For
Scaffold Flink jobs and generate Flink SQL for aggregations, filtering, lookups Medium-High Flink version, project conventions, source/sink config, windowing requirements Missing error handling, incorrect serialisation, wrong window boundaries and watermark behaviour — correct syntax with wrong semantics is common
Write Avro/Protobuf schemas for streaming contracts High Domain model, compatibility requirements, existing schemas Missing backward/forward compatibility annotations, wrong default values
Translate batch SQL to streaming or design processing guarantees Low Batch SQL, streaming semantics, pipeline topology, sink capabilities, acceptable latency/accuracy trade-offs Batch-to-streaming is not a syntax translation — a GROUP BY and a tumbling window aggregation behave fundamentally differently. Processing guarantees (exactly-once, at-least-once) are architecture decisions, not generation tasks

How to Use This Map

Pick the task you're about to do. Check the reliability rating. If it's High, you can let the agent generate and focus your review on the "Watch For" column. If it's Medium, plan to spend meaningful time reviewing and editing. If it's Low, treat the output as a conversation starter — use it to accelerate your thinking, not as a draft to ship.

The "Context Needed" column tells you what to prepare before you prompt. The single biggest factor in output quality is not the model or the tool — it's whether you gave the agent enough structured context to work with. Section 2 covers the detailed workflows for each of these task types, including example prompts and review checklists.


2. AI-Assisted Data Engineering Workflows

This is the section you'll return to most. It covers each stage of the data engineering lifecycle with practical guidance: what AI does well, what to watch for, and example prompts you can adapt to your project.

Every workflow assumes you've read the Key Principles — especially the operating model (AI generates, you decide) and the context strategy (structured input produces reliable output, vague prompts produce plausible garbage). The Capability Map in Section 1 gives you a quick-reference rating for each task; this section goes deeper into the mechanics.


2.1 Pipeline Development

Pipeline development covers the structural work: generating transformation logic, scaffolding orchestration code, writing infrastructure-as-code, translating between platforms, and managing schema evolution. This is distinct from the SQL transformation work in Section 2.2 — here we're talking about the plumbing that connects and schedules transformations, not the transformations themselves.

Scaffolding Airflow DAGs

AI handles DAG scaffolding well. Give it your conventions (operator preferences, connection naming, retry policies, alert routing) and a description of what the DAG should do, and the output is typically close to production-ready. The main review points are schedule expressions (AI often guesses wrong on cron syntax for business calendars), error handling (missing on_failure_callback or no retry config), and hardcoded values that should come from variables or connections.

Example prompt:

Create an Airflow DAG that:
- Runs daily at 06:00 UTC
- Extracts data from the payments API (use the `payments_api` connection)
- Loads raw JSON into Snowflake stage `raw.payments.daily_extract`
- Triggers a dbt run for the payments model group
- Sends a Slack notification on failure to #data-alerts

Follow the DAG conventions in CLAUDE.md. Use our standard 
`SnowflakeOperator` wrapper, not the raw provider operator.

What to watch for: The output will look clean and follow the structure you asked for. Check that it actually uses your project's operator wrappers (not generic Airflow operators), that connection names match your connections.yaml, and that the schedule accounts for timezone and DST if relevant.

Infrastructure-as-Code

For Terraform or Pulumi modules that provision data platform resources (Snowflake warehouses, BigQuery datasets, IAM roles, storage buckets), AI generates solid boilerplate from a description. The risk is in security configuration — AI tends toward permissive defaults. Always review IAM policies, network rules, and encryption settings manually. Check that resources use variables for environment-specific values rather than hardcoding dev settings.

SQL Dialect Translation

Translating SQL between dialects (Oracle → Snowflake, SQL Server → BigQuery) is a common task during migrations. AI handles syntax translation well — function name mappings, type conversions, join syntax differences. Where it fails is semantic translation: differences in NULL handling, implicit type casting, date arithmetic behaviour, and window function edge cases between engines. The translated query may parse and run but produce different results on edge cases.

The validation discipline here is non-negotiable: run the translated query against the same data and compare results row-for-row against the source system. Section 2.6 covers the full migration workflow.

Schema Design and Evolution

AI can generate migration scripts (ALTER TABLE statements, new table DDL) from a description of what changed and why. It's also useful for impact analysis: "I'm adding a currency_code column to orders — what downstream models need updating?" This works well when the agent has access to the dbt project graph (via MCP server or manifest file). Without that access, it can only guess.


2.2 SQL and Transformation Work

This is where most data engineers spend most of their time, and where AI provides the most daily value. It's also where the most dangerous mistakes happen — because incorrect SQL runs without errors and produces results that look reasonable until someone checks the numbers.

Query Generation from Requirements

The baseline use case: you have a business requirement ("monthly revenue by product category, excluding refunds, for the last 12 months") and you need SQL. AI does this well when you provide sufficient context: the relevant table schemas, what "revenue" means in your context (gross vs. net, which column, how refunds are recorded), and the target platform.

Example prompt:

Using the schema context in `schema-context.md`, write a Snowflake SQL 
query that calculates monthly revenue by product category for the last 
12 complete months.

Business rules:
- Revenue = `order_items.unit_price * order_items.quantity`
- Exclude rows where `orders.status = 'refunded'`
- Use `orders.completed_at` (UTC) for the time dimension, not `created_at`
- A "complete month" means the month has ended — exclude the current month
- Product category comes from `products.category_l1`

Output columns: month (date truncated to first of month), category, 
total_revenue, order_count

The specificity matters. "Calculate revenue" without defining which column, which status filter, and which timestamp will produce SQL that looks right but uses the wrong assumptions. The more business rules you provide upfront, the fewer iterations you need.

Query Optimization

AI can analyze query plans and suggest optimizations, but platform-specific knowledge matters enormously here. Snowflake, BigQuery, and Databricks have fundamentally different execution models — clustering keys vs. partitioning vs. Z-ordering, micro-partitions vs. slots vs. shuffle partitions. Generic optimization advice ("add an index") is often wrong or irrelevant.

Provide the query profile or EXPLAIN output alongside the query. Specify the platform. Ask for platform-specific recommendations rather than generic advice. And always benchmark before and after — an "optimization" that looks theoretically correct can make things worse in practice depending on data distribution and cluster configuration.

Refactoring Legacy SQL

AI is reliably good at structural refactoring: converting nested subqueries to CTEs, extracting repeated logic into reusable CTEs, improving readability, adding comments. This is high-reliability work because the correctness criterion is clear — the refactored query must produce identical results.

The critical step: run both the original and refactored queries against the same data and diff the results. AI occasionally makes subtle semantic changes during "cleanup" — reordering a COALESCE, changing a LEFT JOIN to an INNER JOIN, or altering NULL handling. These look like style improvements but change the output.

Generating dbt Models

For dbt projects, AI accelerates the model development cycle. The pattern that works best:

  1. Provide the source schema (or point the agent at it via MCP)
  2. Describe the target model: what grain, what business logic, where it fits in the DAG (staging → intermediate → mart)
  3. Reference an existing model of the same type as a pattern to follow
  4. Ask for the model SQL, the YAML schema file with tests, and documentation — all in one pass
Example prompt:

Create a staging model `stg_payments__transactions` from the source 
`raw.stripe.transactions`.

Follow the pattern in `models/staging/stripe/stg_payments__charges.sql`.

Requirements:
- Rename columns to our snake_case convention
- Cast `amount` from cents to dollars (divide by 100, cast to decimal(12,2))
- Convert `created` (Unix epoch) to `created_at` (timestamp_ntz, UTC)
- Add `_loaded_at` from the metadata column
- Filter out test transactions (`livemode = true` only)

Also generate the YAML schema file with appropriate tests 
(unique, not_null on the primary key, accepted_values for `status`).

Referencing an existing model as a pattern is the most reliable way to get consistent output. The agent extracts naming conventions, casting patterns, and test structures from the example rather than relying on its own defaults.

Data Contracts

Drafting specs before code pays off especially well with data contracts.

A data contract defines what a data product exposes to its consumers: schema, grain, freshness SLA, quality guarantees, and ownership. Having AI draft the initial contract from a source schema surfaces assumptions early — before they're baked into SQL that three downstream teams depend on.

The workflow: provide the source schema and a description of what the data product should expose. Ask the agent to draft a data contract in your team's format. Review it with the team, especially grain definition, column semantics, and SLA commitments. Iterate until agreed. Then implement the model from the contract, not from ad-hoc requirements.

The contract becomes both a specification for implementation and a regression baseline — if the model's output violates the contract, something is wrong.


2.3 Data Quality and Testing

AI-generated tests share the same fundamental problem as AI-generated tests in application code: they can look thorough while testing nothing meaningful. A dbt test suite where every model has not_null and unique on the primary key feels complete but catches almost none of the real data quality issues — wrong grain, duplicate business keys across sources, values outside business-valid ranges, broken referential integrity with systems that don't enforce foreign keys.

Core Principle reminder: Ensure Test Coverage — AI-generated code should include or be accompanied by appropriate tests. Verify test quality — AI often generates superficial tests that pass but don't validate actual behaviour. Core Principles for Working with AI

Generating dbt Tests

Start by telling the agent what to test, not just "write tests." Specify the business rules that matter:

Example prompt:

Generate dbt tests for `fct_orders`. Focus on:

1. Grain: one row per order_id — test uniqueness
2. Referential integrity: every `customer_id` exists in `dim_customers`, 
   every `product_id` exists in `dim_products`
3. Business rules:
   - `total_amount` must equal sum of related `order_items.line_total`
   - `status` must be one of: 'pending', 'completed', 'refunded', 'cancelled'
   - `completed_at` must be NULL when status is 'pending', 
     and NOT NULL when status is 'completed'
   - `total_amount` > 0 for all non-cancelled orders
4. Freshness: data should not be older than 24 hours

Use generic tests where possible. Write singular tests for the 
business rule validations that can't be expressed as generic tests.

After generating, validate the tests by intentionally breaking the model — change a join condition, remove a filter, alter a calculation — and verify that the tests catch the breakage. If they don't, they're not testing what matters.

Test Data Generation

AI can generate synthetic test data that respects schema constraints and referential integrity. This is useful for local development, CI pipelines, and environments where production data isn't available. Provide the schema, cardinality expectations, and any business constraints (e.g., "order dates should span the last 2 years," "30% of customers should have no orders," "amounts should follow a realistic distribution, not all be $100").

Watch for synthetic data that's technically valid but unrealistic — uniform distributions, all timestamps on the same day, no NULL values where production data commonly has them. Unrealistic test data produces tests that pass in CI but miss production failures.

Coverage Gap Analysis

Point the agent at your dbt project and ask it to compare the set of models against the set of tested models. Which models have no tests at all? Which have only structural tests (not_null, unique) but no business rule validation? Which source freshness checks are missing?

This is a high-value, low-risk use of AI — it's scanning and comparing, not generating logic. The output is a prioritised list of gaps, not code. You decide which gaps matter enough to address.


2.4 Documentation and Knowledge Management

Onboarding with AI

Here's the practical onboarding workflow.

When you join a project, point the agent at the dbt project with the existing CLAUDE.md and schema context loaded. Ask it to:

  1. Summarise the model graph — how many sources, staging, intermediate, and mart models. What's the overall structure?
  2. Identify the most-used mart models (check which ones have the most downstream dependencies)
  3. Explain the testing strategy — what's tested, what isn't, what patterns are used
  4. Flag undocumented models — which models have no or minimal descriptions?
  5. Map the key data flows: "How does raw order data become the fct_orders mart?"

This won't replace a proper handover conversation with the team. The agent doesn't know about the upstream system that's only reliable after 6am, the business rule that changed last quarter, or the model that nobody touches because it's fragile. But it compresses the "reading through code" phase from days to hours, so you can spend your onboarding time asking better questions to the humans.

Documentation is one of the highest-ROI uses of AI in data engineering. The work is repetitive, pattern-clear, and chronically deprioritised under delivery pressure. AI doesn't produce documentation, it produces useful first drafts that you refine — and that's usually enough to get from 0% coverage to 70% in a fraction of the time it would take to write from scratch.

Generating dbt YAML Descriptions

The most immediate win. Point the agent at a dbt model's SQL and ask it to generate the YAML schema file with column descriptions. The result will be structurally correct and contextually reasonable — but it will describe what the SQL does, not what the data means. "Joins orders with customers and calculates total" is a description of the transformation. "Total revenue per customer, including only completed orders, refreshed daily at 06:00 UTC" is a description of the data. You need the latter.

Provide a few examples of good descriptions from your project. The agent extracts the pattern — the right level of detail, the right framing, whether to include freshness and grain information — and applies it consistently across new models.

Reverse-Engineering Undocumented Pipelines

On established projects, the most valuable documentation work is often reverse-engineering: taking existing SQL that nobody wrote docs for and producing a description of what it does and why. AI handles this well for structural understanding (what tables are joined, what filters are applied, what the grain is). It handles it less well for intent ("why is there a 30-day lookback filter?" — maybe it's a business rule, maybe it's a performance optimisation, maybe it was a bug that became a feature).

Use AI-generated documentation of undocumented code as a starting point, then fill in the intent from team knowledge. The structural analysis alone saves significant time.

Runbooks

Pipeline failure runbooks follow a consistent structure: what the alert means, what to check first, how to verify the issue, how to remediate, and who to escalate to. AI can generate runbook drafts from your alerting configuration and failure patterns. The drafts will be structurally sound but will miss environment-specific details (which Slack channel to escalate in, which service account to use for manual reruns, what the data team's on-call rotation looks like). Fill those in manually.


2.5 Debugging and Incident Response

When a pipeline breaks at 3am and you're staring at a wall of Airflow logs, AI becomes a surprisingly effective triage partner. Not because it understands your infrastructure — it doesn't — but because it's fast at parsing structured error output and correlating it with code.

Root Cause Analysis from Logs

Paste the error log and the relevant model or DAG code. Ask the agent to identify the likely root cause and suggest a fix. For common failure patterns — schema changes in source systems, permission errors, timeout issues, data type mismatches — AI is accurate and fast. For subtler issues (a query that succeeds but returns wrong results because an upstream table's data changed), AI needs more context: the expected vs. actual output, the upstream table's recent change history, and any data quality check results.

Example prompt:

This dbt model failed during the nightly run. Here's the error:

[paste Snowflake error output]

And here's the model SQL:

[paste model SQL]

The model ran successfully yesterday. What likely changed? 
Check for schema changes, data type issues, or NULL handling 
problems in the source tables referenced.

Failure Pattern Analysis

Over time, pipelines develop recurring failure patterns — the same model fails every month-end because of volume spikes, a specific source system is unreliable on Mondays, a particular join produces duplicates when a source table has late-arriving records. AI can help identify these patterns if you provide historical failure data (a list of failed runs with timestamps, affected models, and error types). It's looking for correlations, not causation — but the correlations are often enough to point you in the right direction.

Data Drift and Schema Change Impact

When a source system changes its schema — a column is renamed, a type changes, a new column appears — the impact can cascade through your pipeline. Point the agent at the source change and your dbt project graph and ask it to trace the impact: which staging models break, which downstream models are affected, which tests need updating.

This is a high-value use case because doing it manually means grepping through dozens of files and mentally tracing dependencies. The agent does this faster and more completely — provided it has access to the full project. MCP servers or workspace-aware tools are essential here.

Incident Postmortem Drafts

After resolving an incident, AI can draft the postmortem from the raw materials: the timeline of alerts and actions, the root cause analysis, the resolution steps, and the follow-up items. Provide a chronological log of what happened and when, and ask for a structured postmortem following your team's format. The draft will capture the facts accurately; you add the analysis of what should change to prevent recurrence, since that requires judgment about process, priorities, and trade-offs that AI doesn't have context for.


2.6 Migration and Modernization

Platform migrations — on-prem to cloud, one warehouse to another, ETL tools to code-based pipelines — are where AI delivers measurable ROI in data engineering. The work is repetitive, pattern-heavy, and voluminous: hundreds of SQL objects to translate, thousands of mappings to verify. AI won't get every translation right, but it compresses the bulk of the work and lets you focus review time on the edge cases.

SQL Dialect Translation at Scale

The core migration task: converting a corpus of SQL from one dialect to another (Oracle → Snowflake, SQL Server → BigQuery, Teradata → Databricks). AI handles the mechanical translation well — function name mappings, type conversions, syntax differences. It handles the semantic differences less reliably.

Every dialect has quirks that affect results without affecting syntax: how NULLs sort, how implicit casting works, how date arithmetic rounds, whether empty strings equal NULL. These differences produce queries that parse, run, and return results — just not the same results as the source system.

The non-negotiable validation step: for every migrated query, run source-to-target reconciliation. Compare row counts, key aggregates, and sample row-level data between the original and translated versions. AI can generate these reconciliation queries too (see the Capability Map — it's a High reliability task), which creates a useful workflow: AI translates the query, AI generates the reconciliation check, you review both and run them.

Reverse-Engineering ETL Tools

Legacy ETL tools (Informatica, SSIS, Talend, DataStage) store transformation logic in proprietary formats — XML definitions, visual mappings, metadata repositories. AI can parse these exports and generate equivalent code in dbt, Python, or Spark. The output is a draft, not a finished product. ETL tools encode conditional logic, error handling, and data routing in ways that don't map cleanly to SQL, and AI may miss or simplify branching paths.

Provide the ETL tool export alongside a description of the expected behaviour. Review the generated code against the original tool's documentation, not just against the export file — the export may not capture runtime configuration or server-level settings.

Schema Mapping

Generating the mapping between source and target schemas is repetitive manual work that AI accelerates. Provide both schemas and any existing mapping documentation. AI matches columns by name, type, and context, generates the mapping table, and flags columns that don't have obvious matches.

Review the flagged items — these are usually the cases that require business knowledge (the source column AMT in the legacy system maps to gross_amount in the target, not net_amount, and the conversion requires dividing by 100 because the legacy system stores cents). Name-based matching gets you 80% of the way; the remaining 20% is where the real domain knowledge lives.

Migration Planning

AI can help sequence migration phases by analysing dependency graphs — which objects need to migrate first because others depend on them, which can be parallelised, where the critical path is. Provide the full source system inventory and dependency information. The output is a useful starting point for planning, but expect it to underestimate complexity at dependency boundaries (cross-system dependencies, shared lookup tables, circular references that need to be broken).


2.7 Streaming and Real-Time Pipelines

Streaming is where the gap between "AI generates plausible code" and "the code is actually correct" is widest. Batch SQL has a clear correctness criterion: given the same input, does the query produce the expected output? Streaming adds temporal semantics, state management, ordering guarantees, and failure recovery — dimensions that AI handles unreliably because the failure modes are subtle and context-dependent.

That said, AI is still useful for streaming work. The key is knowing where to trust it and where to increase your review intensity.

The examples below use Apache Flink because of its explicit handling of time, state, and watermarks — which is where AI struggles most. The core principles apply equally to Kafka Streams, ksqlDB, and Spark Structured Streaming: review window semantics carefully, validate state management, and test failure recovery scenarios. AI's SQL generation strengths transfer to ksqlDB and Spark SQL, but the same caveats about temporal semantics and exactly-once guarantees apply across all frameworks.

Flink Job Scaffolding

AI generates solid Flink job structure — source/sink configuration, basic pipeline topology, serialisation setup, checkpoint configuration. This is boilerplate-heavy work where the patterns are well-established. Provide your Flink version, project conventions, and source/sink configurations. Review checkpoint intervals, state backend configuration, and error handling — these are the settings where wrong defaults cause problems that only surface under load.

Example prompt:

Scaffold a Flink job (Python/PyFlink) that:
- Consumes from Kafka topic `events.page_views` (Avro, schema registry at [URL])
- Computes a 5-minute tumbling window count of page views per user_id
- Writes results to Kafka topic `analytics.pageview_counts` (JSON)
- Uses RocksDB state backend with 1-minute checkpoint interval

Follow our project structure in CLAUDE.md. Include proper 
deserialization error handling (dead-letter queue for malformed messages).

Flink SQL

For simple streaming operations — filtering, projection, joins against relatively static lookup tables, tumbling window aggregations — Flink SQL generated by AI is reasonably reliable. The syntax is close enough to standard SQL that AI's strengths in SQL generation carry over.

For complex event processing — session windows with custom triggers, pattern detection with MATCH_RECOGNIZE, stateful operations that accumulate across windows — AI output needs heavy review. The semantics of streaming SQL diverge from batch SQL in ways that AI frequently gets wrong: how late data is handled, when windows fire, what happens when state exceeds memory limits, how watermarks interact with window boundaries.

Batch-to-Streaming Translation

A common task during modernisation: taking batch SQL logic and converting it to a streaming equivalent. This is where AI is least trustworthy in the streaming domain, because the translation is not mechanical.

A batch GROUP BY product_category aggregates a complete, bounded dataset. A streaming tumbling window aggregation over the same grouping key operates on an unbounded stream with late-arriving data, out-of-order events, and windows that fire at different times. The SQL looks similar. The semantics are fundamentally different. AI tends to produce syntactically equivalent translations that miss the semantic gap — no watermark configuration, no late data policy, no handling of what happens when events arrive after the window closes.

Treat batch-to-streaming translations from AI as conversation starters, not drafts. Use them to accelerate your thinking about the streaming design, then implement the streaming logic based on your understanding of the event ordering, latency requirements, and acceptable accuracy trade-offs.

Streaming Contracts (Avro/Protobuf)

Schema definition for streaming data — Avro schemas for Kafka topics, Protobuf definitions for gRPC services — is a high-reliability AI task. Provide the domain model and compatibility requirements (backward, forward, full). AI generates clean schemas with appropriate types, defaults, and documentation. Review backward/forward compatibility annotations and default values carefully — wrong defaults in a streaming schema propagate to every consumer and are painful to fix retroactively.

Monitoring and Alerting

AI can generate monitoring rule configurations for streaming pipelines: consumer lag thresholds, throughput anomaly detection, checkpoint duration alerts, error rate alerting. Provide your current SLAs, known failure patterns, and alerting infrastructure. The main review point is threshold calibration — AI tends to suggest thresholds that are either too tight (generating noise) or too loose (missing real incidents). Start with the AI-generated thresholds, then calibrate based on observed baseline behaviour over a few days.


2.8 Cross-Referencing and Consistency Verification

Agents excel at cross-referencing — the kind of consistency checking that humans do by grepping through files and mentally tracking dependencies. Here are the specific workflows where that capability pays off.

These workflows depend on agent access. With a workspace-aware tool (Claude Code, Copilot in agent mode) or MCP servers connected to your project, they're practical and fast. Without access, you're back to copy-pasting files and losing most of the benefit.

Column Rename Impact Analysis

You're renaming user_id to customer_id in a staging model. Before you make the change, you need to know every downstream model that references the old name. Feed the agent the proposed change and the dbt project, and ask it to trace every reference — model SQL, YAML schema files, test definitions, documentation.

Example prompt:

I'm planning to rename the column `user_id` to `customer_id` 
in `stg_accounts__users`. Before I make this change, find every 
reference to this column in downstream models, tests, YAML files, 
and documentation. List each reference with the file path and line.

This is faster and more complete than a manual grep because the agent understands SQL context — it can distinguish between a reference to stg_accounts__users.user_id and an unrelated user_id column in a different model.

Metric Definition Consistency

In larger projects, the same business concept (revenue, churn, active users) gets defined in multiple places: dbt metrics, BI tool semantic layers, documentation, ad-hoc queries shared in Slack. These definitions drift over time. Feed the agent all the definitions you can find for a given metric and ask it to compare them and flag inconsistencies.

This is analysis, not code generation — the output is a comparison table showing where definitions agree and where they diverge. The human decides which definition is correct and how to reconcile the others.

Test Coverage Gap Analysis

Point the agent at your dbt project and ask: which models have no tests? Which have only generic tests (not_null, unique) but no business rule tests? Which sources are missing freshness checks? Which relationships tests are defined but reference models that no longer exist?

The output is a prioritised gap list. You decide which gaps matter based on the model's criticality, how often it changes, and how many consumers depend on it. A mart model used by ten dashboards with no business rule tests is a higher priority than a staging model with no freshness check.

Documentation Completeness Audit

Similar to test coverage, but for documentation: which models have no descriptions? Which have descriptions that are obviously auto-generated filler ("This model joins table A with table B")? Which columns lack descriptions? Which sources have no documentation at all?

Run this periodically — documentation degrades as models evolve and new ones get added without descriptions. A quarterly audit with AI doing the scanning and you doing the prioritisation keeps coverage from sliding.

The limits of AI-assisted lineage. These cross-referencing workflows work well within a dbt project or a single codebase where the agent can read all the files. Real data environments span orchestrators, ingestion tools, warehouses, and BI layers — and AI can't trace lineage across those boundaries without explicit access to each system. If your organisation uses a data catalog (DataHub, Atlan, OpenMetadata), an MCP server for that catalog (see Section 3.2) can extend the agent's reach. But for most teams, cross-tool lineage remains a manual effort or a platform-level feature — not something you can hand to an agent.


3. Tooling for Data Engineers

This section covers AI tooling that's specific to data engineering work. General tool profiles, model selection, cost optimisation, and MCP fundamentals live in the external tooling playbook — we don't repeat them here.

📖 Reference: For AI model selection, coding assistant profiles (Claude Code, GitHub Copilot, Cursor, Codex), cost optimisation strategies, MCP fundamentals, and the agent skills standard — see the [AI Tooling Playbook] (link TBD when published).


3.1 IDE and Environment Considerations

Data engineers work across a wider range of environments than most software developers. Your daily workflow might span a terminal (dbt commands, git), a SQL editor (DataGrip, DBeaver, the Snowflake console), a notebook (Databricks, Jupyter), and an IDE (VS Code) — sometimes all in the same day. Where you bring AI into that workflow depends on what you're doing.

File-based development (dbt, Airflow, IaC): Terminal-based agents like Claude Code and Codex CLI work well here. Your project is a git repository with SQL files, Python scripts, and YAML configs. The agent navigates the project, reads files, runs commands (dbt compile, sqlfluff lint), and edits code — the same way it works for application development. IDE extensions (Copilot in VS Code, Cursor) add inline completions and chat within the editor. For dbt projects specifically, the terminal agent workflow described in the Quick Start (Part II) is the recommended starting point.

SQL editors and warehouse consoles: DataGrip, DBeaver, the Snowflake worksheet, and the BigQuery console are where many data engineers write and test queries. Most of these don't have deep AI integration yet. The practical approach: use a general-purpose AI chat (Claude.ai, ChatGPT) or a terminal agent in a side window. Paste your schema context, write the query collaboratively, then move the result into your SQL editor for execution. It adds a copy-paste step, but the AI assistance more than compensates.

DataGrip and other JetBrains SQL tools now support AI features through the JetBrains AI plugin — check the current feature set, as it evolves quickly.

Notebooks (Databricks, Jupyter, Colab): Covered in the Jupyter Quick Start (Part II). The short version: Databricks Assistant is built in and has catalog access; Jupyter needs an extension or a side-window AI chat; Colab has Gemini built in. Notebooks work well for prototyping and exploration with AI, less well for structured development that needs version control and review.


3.2 DE-Specific MCP Servers

📖 Reference: For MCP fundamentals, setup instructions, security risks, and the security checklist — see the [AI Tooling Playbook].

MCP servers give AI agents access to external tools and data sources at runtime. For data engineers, the most valuable MCP servers are the ones that provide schema and metadata access — eliminating the need to paste DDL into every session.

Database MCP Servers

Servers for Postgres, Snowflake, BigQuery, and other databases let the agent query information_schema, inspect table definitions, check row counts, and — if configured — run read-only queries. This is powerful: instead of maintaining schema summary files manually, the agent gets live metadata.

The trade-off is security. Connecting an AI agent to a database means the agent can see — and potentially include in its context — whatever the connection credentials allow. On a development database with synthetic data, this is low-risk and high-reward. On anything with real customer data, treat the MCP server connection like any new service account.

Recommended configuration for database MCP servers:

  • Use a dedicated read-only service account, not your personal credentials
  • Restrict access to information_schema and metadata views where possible — the agent rarely needs to query actual data rows
  • Never connect to production databases. Development or staging environments with synthetic or anonymised data only
  • If the agent does need sample data for debugging, create a curated sample dataset and expose only that
  • Log queries made through the MCP server for audit purposes

Orchestrator Integrations

Astronomer's official MCP server for Airflow (open-source, works with Airflow 2.x and 3.x) exposes DAG status, task logs, and management operations — trigger, pause/unpause, variable and connection handling. The security implications of an agent that can trigger pipeline reruns are real; scope its permissions carefully and treat it as a write-capable connection.

Data Catalog Connectors

If your organisation uses a data catalog (Atlan, DataHub, Alation, OpenMetadata), an MCP server that exposes catalog metadata to AI agents could provide rich context: table descriptions, data lineage, ownership information, quality scores. This space is emerging. Check whether your catalog vendor offers an MCP integration or API that could serve as one.

On trusting MCP servers and skills

MCP servers and agent skills execute with the permissions of the credentials they're given, and skills inject instructions directly into the agent's context in ways that aren't always visible. Prefer official integrations from verified providers. For community or custom skills, review the source before installing — understand what instructions they inject, what tools they invoke, and whether they have access to credentials or sensitive paths. A skill from an unknown source is no different from running an untrusted script.

For a curated list of official MCP servers and skills from data engineering tool providers, see Appendix F.


3.3 dbt Integrations

dbt sits at the centre of most modern data engineering stacks, and the AI integration points are maturing.

dbt Cloud AI features: dbt Cloud includes AI-powered documentation generation and query explanation. These features use the dbt project metadata (manifest, catalog) for context, which gives them an advantage over general-purpose tools for dbt-specific tasks. Check the dbt Cloud docs for current capabilities — they expand with each release.

dbt MCP servers: Community-maintained MCP servers for dbt expose project metadata (models, tests, lineage) to AI agents. These can be useful for cross-referencing workflows (Section 2.8) and documentation generation (Section 2.4). Apply the same caution as with any community-maintained tool: review the source code before installing, pin the version, and check for updates before upgrading. The external tooling playbook's MCP security guidance applies fully here.

dbt-specific agent skills: If your team uses Claude Code, consider creating skills for recurring dbt tasks: generating a new staging model from a source, adding standard tests to a model, generating YAML documentation for undocumented models. A well-crafted skill encodes your project's specific conventions and produces output that's consistent from the first invocation, saving the iteration time you'd spend teaching a generic agent your patterns.

dbt Power User (VS Code extension): Altimate AI's dbt Power User adds model generation, query explanation, test scaffolding, and lineage visualization directly in VS Code. If your team already works in VS Code, it's worth evaluating alongside terminal-based agents — the IDE integration means less context-switching for tasks like documentation and test generation. Altimate AI also publishes open-source data engineering skills for Claude Code and other agents (listed in Appendix F).


3.4 Semantic Layers and AI

Semantic layers define metrics, dimensions, and business logic in a single, governed place — so that "revenue" means the same thing whether it's queried from a dashboard, a notebook, or a Slack bot. They're increasingly relevant to AI workflows because they give models structured business context that's harder to get wrong than raw SQL.

dbt Semantic Layer uses MetricFlow to define metrics as code within your dbt project. AI agents with access to the dbt project can read metric definitions, understand how measures compose, and generate queries that use the semantic layer API rather than writing raw SQL against tables. This reduces the risk of metric definition drift — the agent uses the governed definition instead of guessing how to calculate revenue.

Snowflake Cortex Analyst works with semantic models that describe your tables, dimensions, measures, and relationships. Snowflake now recommends semantic views (SQL-based, defined in Snowflake) as the primary approach, though YAML-based semantic models still work for backward compatibility (as of March 2026 — check Snowflake docs for current guidance). When users ask natural language questions, Cortex Analyst generates SQL grounded in the semantic model rather than guessing from raw schemas. For data engineers, the work is in building and maintaining the semantic model — AI can help draft it from existing table structures, but the metric definitions and business logic are human decisions.

Databricks AI Functions and Unity Catalog take a different approach — Unity Catalog serves as the metadata layer, and AI functions like ai_query (the general-purpose function for calling LLMs from SQL), ai_classify, ai_extract, and ai_similarity operate within SQL queries. The catalog's table and column descriptions act as implicit semantic context. The richer your catalog metadata, the better the AI output. Databricks is also building deeper semantic layer capabilities through their acquisition of and investment in this space — check current docs for the latest.

Where AI fits in semantic layer work:

AI is useful for drafting semantic model definitions from existing schemas and dbt models — the structural scaffolding is repetitive and pattern-clear. It's also good at detecting inconsistencies between semantic layer definitions and the SQL that predates them. What it can't do is decide what a metric should mean. The semantic layer is where business logic gets codified, and that remains a human responsibility — same principle as metric definitions in the Key Principles section.


3.5 Platform-Native AI Features

Every major data platform now includes built-in AI features. These are worth knowing about, but they change fast — this section provides an honest snapshot with links to current documentation. Don't rely on the specifics here; check the vendor docs.

Platform Feature Good For Limitations
Snowflake Cortex AI SQL generation, data exploration, in-warehouse ML functions, document AI Tied to Snowflake ecosystem; model selection limited to what Snowflake offers; works best with well-structured metadata
BigQuery Gemini in BigQuery SQL generation and explanation, query optimisation suggestions, data exploration Quality depends heavily on dataset metadata completeness; best with well-documented tables
Databricks Databricks Assistant Notebook code generation (SQL, Python, Scala), data exploration, error explanation Works within Databricks notebooks; less useful for dbt or file-based workflows outside the Databricks environment

When to use platform-native tools vs. general-purpose tools:

Platform-native tools have one significant advantage: they already have access to your warehouse metadata. Snowflake Cortex knows your table schemas without you pasting DDL. Databricks Assistant can see your catalog. This makes them useful for quick, in-console exploration — "what tables contain customer data?", "explain this query," "suggest a more efficient join strategy for this specific table distribution."

General-purpose tools (Claude Code, Copilot, Cursor) are more flexible. They work across platforms, handle broader tasks (orchestration, IaC, documentation, multi-file refactors), and give you more control over context and prompting. For structured development work — building models, writing tests, creating documentation, managing pipelines — general-purpose tools are typically more effective because they can see your entire project, not just the warehouse.

In practice, most data engineers use both: platform-native tools for quick exploration and ad-hoc queries, general-purpose tools for project-level development work.


4. Security, Privacy, and Data Governance

Data engineers handle production data routinely — raw event streams, customer records, financial transactions, health data. This makes the privacy and security stakes higher than in most application development work. The "sanitise before pasting" guidance from the Core Principles applies, but the surface area is larger and the risks are more specific.

📖 Reference: For general AI tool data handling policies, privacy tiers, subscription vs. API security guarantees, and the base sanitisation checklist — see the [AI Tooling Playbook].

Core Principle reminder: Protect Sensitive Information and Control Data Usage — Understand your boundaries. Follow your organization's policies on what data can be shared with GenAI tools. Ensure your inputs aren't collected for training. Core Principles for Working with AI


Protecting Production Data from Accidental Destruction

AI-generated code can destroy production data. This isn't hypothetical — databases have been wiped because an AI-generated script ran a DROP TABLE, a DELETE without a WHERE clause, or a TRUNCATE against the wrong environment. But it's not just SQL. Maintenance scripts, Terraform runs, and migration tools have caused the same damage — any code path that touches infrastructure or data is a vector.

The core problem: AI generates write operations as confidently as read operations. It doesn't distinguish between a development sandbox and a production warehouse. A prompt like "clean up the staging table" can produce a statement that works perfectly — against whichever database the session happens to be connected to.

Models can also be surprisingly creative in how they reach resources. An agent told to accomplish a task may find non-obvious paths — invoking a CLI tool, calling an API, generating a script that shells out to another tool, or chaining permissions in ways you didn't anticipate. This means relying solely on command-by-command approval is not enough. If the credentials and network access exist, a sufficiently capable model will find a way to use them. The real safeguard is ensuring the access doesn't exist in the first place.

Practices that prevent this:

Enforce read-only access by default. Every service account, MCP connection, and agent credential should be read-only unless there's a specific, documented reason for write access. If a workflow genuinely requires writes, use a dedicated account scoped to exactly the tables and operations needed — never broad WRITE or ALL PRIVILEGES grants.

Route production writes through automated pipelines, not standing access. Ideally, no human or agent has standing write access to production. Changes reach production through CI/CD pipelines, orchestrators (Airflow, Dagster, Prefect), or managed deployment tools — with code review, approval gates, and audit trails. If someone needs to run a one-off fix against production, that should be a controlled, time-limited exception with a documented reason — not a credential that's always available. Where direct access is unavoidable, use service account impersonation rather than granting persistent credentials — it provides the same access with a time-bound scope and a clear audit trail tied to the requesting identity.

Separate credentials per environment. Production, staging, and development should use different credentials with different permission levels. Never store production connection strings in the same configuration files or environment variables that agents or local tools can access. If an agent can reach it, assume it eventually will.

Review every write operation as a destructive action. INSERT, UPDATE, DELETE, DROP, TRUNCATE, MERGE, ALTER — treat AI-generated versions of these with the same scrutiny you'd give someone rm -rf on a shared server. Read the WHERE clauses. Check the target schema. Verify the environment.

Never run AI-generated DDL or DML against production without a dry run. Use EXPLAIN plans, SELECT equivalents of DELETE statements (to see what would be affected), and transaction wrappers that you can roll back. On platforms that support it, use BEGIN ... ROLLBACK to validate before committing.

Use platform safeguards. Most data platforms offer protections — enable them. Snowflake's FAIL_SAFE and TIME_TRAVEL, BigQuery's table snapshots, Postgres pg_dump before migrations, dbt's --defer and --empty flags for safe testing. These aren't substitutes for careful review, but they're your safety net when review fails.

Tag and isolate agent-initiated operations. If agents execute queries, use session tags or query comments so that any destructive operation can be traced back to its origin. When something goes wrong at 2am, you need to know whether it was a scheduled job, a human, or an agent.

The Heightened Risk for Data Engineers

When an application developer pastes a code snippet into an AI tool, the risk is primarily about intellectual property and credentials. When a data engineer pastes a query result, the risk extends to customer data, financial records, and regulated information. The difference isn't just theoretical — a table scan result from a users table can contain thousands of PII records in a single paste.

Schema definitions (DDL) are almost always safe to share — they describe structure, not data. But even schemas can be sensitive: a table named suspected_fraud_cases or employee_termination_queue reveals information about internal processes that a client may not want exposed.

Data Classification Awareness

Before involving AI in any task that touches data, consider what tier of data you're working with.

Not all tables are equal. dim_date is a reference table with no privacy implications. raw.stripe.charges contains financial transaction data. users.pii is self-explanatory. Your awareness of data classification should determine how much context you provide to AI tools and through which channels.

If your project has a data classification scheme (most regulated industries do), reference it. If it doesn't, apply common sense: anything that could identify a person, reveal financial information, or expose health data requires the strictest handling. When in doubt, use schema-only context — no data rows.

What to Share and What to Withhold

✗ Production data rows containing PII, financial data, or health records
✗ Real customer IDs, account numbers, or transaction details
✗ Connection strings with embedded credentials
✗ Schema names that reveal confidential business structures (consider aliases)
✗ Query results from production — even aggregated data can be sensitive
✗ Sample data exports from production tables, even "just a few rows"

✓ Schema definitions (DDL) without data — these are almost always safe
✓ Synthetic/fake data rows that match the schema structure
✓ Anonymised samples where actual data patterns are needed for debugging
✓ Column names and types referenced generically when the schema itself is confidential
✓ dbt model SQL and YAML — code, not data
✓ Error messages and stack traces (check for embedded data values first)

Practical tip: When you need AI to help debug a data issue and sample data is necessary, create a synthetic dataset that reproduces the problem pattern. If a join is producing unexpected duplicates, create a minimal set of fake rows that demonstrates the duplication — don't paste the actual production rows where the issue occurs.

PII and AI Tool Context

AI tools retain conversation context for the duration of a session (and sometimes across sessions, depending on the platform and tier). Data rows pasted into an AI session become part of that context. Even with enterprise-tier tools that don't use data for training, the data still passes through the provider's infrastructure and may be logged for abuse detection or debugging.

The rule is simple: never paste production data rows containing PII into any AI tool. Use schema-only context. If you need data examples, use synthetic data. If you're debugging a data issue that requires real data patterns, anonymise the relevant columns before sharing.

For teams that regularly need AI assistance with data-specific debugging, consider maintaining a library of anonymised sample datasets for common source systems. This pays for itself quickly — every team member can load realistic (but safe) sample data into an AI session without the risk of accidentally sharing real customer records.

Lineage and Governance

AI-generated transformations need to be traceable. If your organisation tracks data lineage — which transformations produce which outputs, where data originates, how it flows through the pipeline — AI-generated models should be documented in exactly the same way as human-written ones.

This means: clear model descriptions in dbt YAML, documented business logic in comments, and lineage metadata that accurately reflects the transformation chain. "AI generated this model" is not a substitute for proper documentation. The model's documentation should describe what the model does and why, regardless of who (or what) wrote it.

Client and Regulatory Constraints

Different clients and industries impose different constraints on AI tool usage with their data.

GDPR, HIPAA, SOX, PCI-DSS, and industry-specific regulations may restrict what data context you can share with AI tools — even with enterprise-tier data handling guarantees. Some clients require that all data processing stays within specific geographic regions (EU, US). Some prohibit the use of AI tools on their data entirely. Some allow it with specific approved tools and configurations.

Verify the constraints at the start of every engagement. Check the client's AI usage policies and data processing agreements. Confirm which tools are approved, that model training on client data is off, and that built-in platform AI features (Snowflake Cortex, Databricks Assistant, BigQuery Gemini) are explicitly covered. Don't assume that because you used a tool on the last project, it's approved for this one.

MCP Server Security for Data Access

Connecting AI agents to databases via MCP servers introduces a specific risk vector that deserves separate attention. When you configure a database MCP server, the agent can query whatever the connection credentials allow. The recommendations from Section 3.2 apply: dedicated read-only service accounts, restricted to metadata where possible, never connected to production.

The additional governance consideration: queries made through MCP servers should be auditable. If your organisation requires logging of all queries against a given database, MCP server queries need to be included in that logging. Check whether the MCP server implementation you're using supports query logging, and if not, ensure the database's own audit logging captures the MCP service account's activity.


5. Data Products and Domain Ownership

AI doesn't know where your domain ends and another team's begins. It will happily suggest a join to a table your domain shouldn't depend on, or generate a model that duplicates logic another team owns. When your team owns a data product end-to-end, the boundaries that matter most are exactly the ones AI is blind to — and that changes how you work with it.

The tools stay the same. What changes is the context you provide, the scope of what you let AI touch, and where you focus your review.


Context Boundaries Get Tighter — and That's Better

In a domain-owned model, your CLAUDE.md and context files describe your domain's data products, not the entire warehouse. You don't need the agent to understand every table in the organisation — just your sources, your models, and your consumers.

This is actually good news for AI. Smaller, more focused context produces more reliable output. A schema summary that covers 15 tables in your domain is more useful than a 500-table warehouse DDL dump that the agent will struggle to reason about coherently.

Structure context files per data product or domain. If your team owns the orders domain and the payments domain, maintain separate context files for each. When working on orders, load the orders context. This keeps the agent focused and reduces the chance of cross-contamination — the agent suggesting a join to a payments table that your domain shouldn't depend on.

Data Contracts Become First-Class AI Use Cases

If your organisation uses data contracts — schema definitions, SLA commitments, quality guarantees published to downstream consumers — AI can help draft and review them.

The workflow: provide the source schema, a description of what the data product should expose, and existing contracts as examples. Ask the agent to draft a contract in your team's format. The agent extracts the pattern (structure, level of detail, what's included) from the examples and applies it to the new contract.

The human validates three things that AI can't judge reliably: whether the contract's semantics accurately reflect the data (the "revenue" column actually contains gross revenue, not net), whether the SLA commitments are achievable given the pipeline's actual behaviour, and whether the quality guarantees are specific enough to be meaningful to consumers.

Data contracts are also where the "draft the spec before the code" principle from the Key Principles pays off most. A contract that's been reviewed and agreed before implementation catches misunderstandings that would otherwise surface as production incidents reported by the consuming team.

Ownership Means Accountability for AI Output

In a data mesh, your team is the published owner of a data product that other teams depend on. When a consumer team reports that the numbers are wrong, the response can't be "the AI generated that model." You reviewed it, you tested it, you shipped it — it's yours.

This isn't different from the standard "you own what you merge" principle that applies everywhere. But domain ownership makes it more visible. Your data product has consumers, an SLA, and quality expectations. AI-generated code that doesn't meet those expectations is your problem to fix, regardless of who wrote the first draft.

The practical implication: invest more review time in AI-generated models that feed data products, not less. The Key Principles guidance about treating AI output like a junior engineer's PR applies with extra weight here.

Discovery and Documentation

Data products need to be discoverable and trustworthy. That requires documentation: catalog entries, column descriptions, usage examples, quality metrics, freshness guarantees, known limitations.

This is the kind of documentation that gets deprioritised under delivery pressure — and where AI delivers high ROI. Generating draft descriptions, column-level documentation, and usage examples is a high-reliability AI task (see the Capability Map and Section 2.4). The effort to review and refine AI-generated documentation is a fraction of the effort to write it from scratch.

For teams adopting data products, a practical first step: run an AI-assisted documentation audit (Section 2.8) across your domain's models. Generate descriptions for everything that's undocumented. Refine them. Publish them. Then maintain them as the models evolve — documentation that accurately describes last quarter's schema is worse than no documentation at all.

Cross-Domain Impact Analysis

When your domain's schema changes — a column is renamed, a type changes, a model is deprecated — the impact extends beyond your project to every consuming domain.

AI can help assess that impact, but only if you provide the consumer contracts or dependency graph as context. Without that information, the agent can trace dependencies within your project (Section 2.8) but can't know what external consumers depend on.

If your organisation maintains a central data catalog or contract registry, connect it to your AI workflow (via MCP server or exported metadata). If not, maintain a simple consumer list per data product — which teams use it, what they use it for, which columns they depend on. Even a manually maintained markdown file is better than nothing when you need to assess the blast radius of a breaking change.


6. Common Pitfalls in AI-Generated Data Engineering Work

AI-generated data engineering code has a specific failure mode that makes it more dangerous than buggy application code: it runs without errors and produces results that look reasonable. A wrong API response throws an exception. A wrong SQL query returns a number — just not the right number. You find out when a stakeholder asks why the dashboard doesn't match the spreadsheet, or worse, when a business decision gets made on incorrect data.

Core Principle reminder: Review, Test, and Verify All Outputs — Critically examine every line of AI-generated code. Test thoroughly for functionality, security vulnerabilities, logic errors, and edge cases. Never commit code you don't understand. Core Principles for Working with AI


SQL That Runs but Produces Wrong Results

The most common and costly category. AI gets join types wrong (especially composite keys and date-range joins), silently introduces grain errors through unresolved many-to-many relationships, misses or misapplies filter conditions (wrong status values, wrong date column), and mishandles NULLs in WHERE clauses. The output looks plausible — the numbers are just wrong. Always validate against known-good results before promoting.


Transformations That Work on Sample Data but Fail at Scale

Queries that run fine on 10,000 development rows fail or degrade on 500 million production rows. Common causes: missing partition pruning, joins that explode from one-to-many to many-to-many at real cardinality, and window functions that spill to disk. Always check query profiles at representative data volumes.


Pipeline Code That Doesn't Handle Real-World Conditions

AI-generated pipelines assume clean, first-run scenarios. They lack idempotency (reruns produce duplicates), miss late-arriving data (simple MAX(loaded_at) patterns skip records that arrive out of order), and ignore schema evolution (no defensive handling for source schema drift).


Data Quality Rules That Don't Catch Real Issues

AI-generated tests default to structural checks (not_null, unique, accepted_values) that confirm table shape but miss business logic errors. Watch for overly broad accepted values that include data quality issues as valid, missing behavioural tests that verify calculations and business rules, and absent source freshness checks.


Streaming-Specific Pitfalls

AI-generated Flink SQL commonly misconfigures window boundaries and watermarks, neglects state cleanup and TTL (causing unbounded state growth in production), and produces jobs that lack end-to-end exactly-once semantics because sink connectors aren't configured for transactional writes.


Over-Reliance on AI for Performance Optimisation

AI's optimisation advice is often generically correct but specifically wrong for your platform, data distribution, or cluster configuration. Always benchmark before and after with representative data volumes — check the query profile, not just the result.


The Review Checklist

Before approving any AI-generated data engineering code, check:

✓ Grain: does the model produce the expected number of rows per entity per time period?
✓ Joins: is every join type correct? Are join conditions complete (no missing composite key columns)?
✓ Filters: do WHERE clauses match the actual data values in the source? Is NULL handling correct?
✓ Aggregations: are all non-aggregated columns in the GROUP BY? Is the aggregation level correct?
✓ Incremental logic: is the model idempotent? Does it handle late-arriving data?
✓ Tests: do the tests verify business behaviour, not just table structure?
✓ Performance: does the query perform acceptably at production data volumes?
✓ Naming: do column names, model names, and file locations follow project conventions?
✓ Validation: has the output been compared against known-good results?

7. Measuring AI Impact in Data Engineering

You'll be asked whether AI is making a difference. The honest answer requires measurable signals, appropriate caveats, and a baseline established before AI adoption is fully embedded.


Observable Metrics

  • Pipeline development velocity — cycle time from task start to production deployment, averaged across sprints
  • Data incident rate — data quality incidents per sprint or release (schema violations, failed checks, consumer-reported issues)
  • Defect escape rate — proportion of issues caught in CI/staging vs. discovered in production
  • Test coverage — percentage of models with meaningful tests (business rule validation, not just not_null/unique)
  • Documentation coverage — percentage of models and columns with descriptions, extractable from dbt manifest/catalog
  • Migration throughput — objects migrated per sprint (SQL translated, reconciliation tests written, validations completed)
  • Incident response time — time from alert to root cause identification, and from root cause to resolution

What to Be Honest About

These metrics show direction, not precision. Sprint load varies, task complexity varies, team composition changes. Isolating AI's contribution from other factors (experience growth, tooling improvements, project phase) is genuinely hard.

Present signals as evidence of a trend, not proof of a specific improvement factor. "We observe a consistent reduction in cycle time and data incident rate over three sprints since adopting AI-assisted workflows" is credible. "AI saves us exactly 3.2 hours per model" is not — unless you have detailed task-level data behind it.

Start tracking before AI is fully embedded. The baseline you build now is the only one that supports an honest before-and-after comparison.


Part II — Quick Start

Introduction

You've read Part I. You understand where AI helps, where it doesn't, and what to watch for. Now you have a real data engineering task in front of you and one question: "What do I actually do, step by step?"

This Quick Start answers that. It gives you a concrete, repeatable workflow for agentic data engineering — one that works well as a starting point and that you can adapt as you build confidence.

The toolset for this Quick Start:

  • Claude Code — Anthropic's terminal-native agentic coding assistant
  • dbt + Snowflake — transformation framework and warehouse
  • GitHub — version control and collaboration
  • Airflow — referenced for orchestration context, but the Quick Start focuses on dbt model development as the most common entry point

This isn't the only way to work. But it's a concrete path you can follow today. Once you're comfortable with it, swap components, add spec-driven frameworks (see Part I, Section 4 of the SE playbook), or integrate other tools.

The setup steps are specific to this stack, but the underlying strategy applies broadly. Whether you're working with Databricks notebooks, a BigQuery + Dataform project, a Spark-based pipeline, or a different orchestrator entirely — the same principles hold: load structured context, plan before implementing, review AI output as you would a junior engineer's PR. The commands and config syntax change; the workflow doesn't.

Using a different CLI agent? The workflow in this Quick Start translates directly to other terminal-based agents — Codex CLI, Gemini CLI, Aider, and others follow the same pattern: load context, plan, implement, review. The tool-specific differences are in configuration syntax, not in how you work. CLAUDE.md is Claude Code's project instructions file; the equivalent in other tools is AGENTS.md (an emerging cross-tool convention), .aider.conf, or similar. Whatever the filename, the content is the same: project structure, conventions, MCP connections, and what the agent should and shouldn't do.

Reference: For Claude Code installation, authentication, and general configuration — see the [AI Tooling Playbook]. This Quick Start covers only DE-specific setup.

Prerequisites: Claude Pro, Max, or Team subscription (or Anthropic Console account), Git and GitHub set up, Node.js 18+ installed, dbt CLI configured for your project, access to a Snowflake development environment.


Step 1 — Prepare Your Data Project for AI

1.1 Install and Configure Claude Code

Follow the installation guide in the [AI Tooling Playbook] for base setup. Once installed, launch Claude Code from your dbt project root:

cd ~/projects/your-dbt-project
claude

Grant DE-specific permissions. Claude Code asks for approval before running commands. For a dbt project, you'll want to pre-approve the commands you run frequently (this uses Claude Code's permission syntax — other agents have equivalent configuration):

/permissions
Bash(dbt *)           — allow all dbt commands
Bash(sqlfluff *)      — allow SQL linting
Edit(/models/**)      — allow edits within the models directory
Edit(/tests/**)       — allow edits within the tests directory
Edit(/macros/**)      — allow edits within the macros directory

Keep approvals scoped to your project directories. Don't grant blanket write access — the guardrails exist for good reasons, especially when working with client code.

1.2 Create Your CLAUDE.md

The CLAUDE.md file tells the agent about your project. It's loaded into every session automatically, so keep it concise and specific.

Bootstrap it with Claude Code:

> /init

Review the generated file, then refine it. A good CLAUDE.md for a dbt project covers:

# Example CLAUDE.md for a dbt + Snowflake project

## Stack
- dbt Core [your version], Snowflake adapter
- Snowflake (Enterprise edition), database: ANALYTICS
- Airflow 2.x for orchestration (DAGs in separate repo)
- SQLFluff for linting (Snowflake dialect)

## Project Structure
- models/staging/ — one subdirectory per source system, prefix `stg_`
- models/intermediate/ — shared transformations, prefix `int_`
- models/marts/ — business-facing models, prefix `fct_` or `dim_`
- tests/singular/ — custom business rule tests
- macros/ — project-specific Jinja macros

## SQL Conventions
- CTEs over subqueries, always
- One model per file
- Model names match file names
- snake_case for all identifiers
- Explicit column lists in SELECT (no SELECT *)
- COALESCE for NULL handling — never rely on implicit NULL behaviour
- Timestamps as TIMESTAMP_NTZ in UTC, suffixed `_at`
- Booleans prefixed with `is_` or `has_`
- Money as DECIMAL(12,2), converted from cents at the staging layer

## dbt Conventions
- Incremental models use merge strategy with `unique_key`
- All models have schema YAML with at minimum: unique + not_null on PK
- Staging models: rename, cast, filter test data — no business logic
- Sources defined in `_sources.yml` per source system

## Key Commands
- `dbt compile` — validate SQL without running
- `dbt run -s <model>` — run a specific model
- `dbt test -s <model>` — test a specific model
- `dbt run && dbt test` — full build + test
- `sqlfluff lint models/` — lint all models
- `sqlfluff fix models/<file>` — auto-fix a specific file

What not to include: Don't explain dbt concepts, SQL fundamentals, or Snowflake features. The agent knows those. State your preferences and constraints — the things that make your project different from defaults.

1.3 Provide Schema Context

The harder part of DE-specific AI setup: giving the agent enough schema and metric knowledge to generate correct SQL without pasting thousands of lines of DDL every session. The right approach depends on what your project has in place.

Preferred: semantic layer + MCP. If your project uses a semantic layer (dbt Semantic Layer with MetricFlow, Snowflake Cortex Analyst, or Databricks Unity Catalog), this is the best source of context for the agent. Metric definitions, grain, dimensions, and relationships are already codified — the agent uses governed definitions rather than guessing how revenue is calculated. Connect the agent to it via an MCP server so it can query metadata on demand.

Even without a full semantic layer, a database MCP server connected to your development warehouse lets the agent query information_schema, inspect table structures, check row counts, and explore relationships live. No manual maintenance, always current. See Section 3.2 for setup and security configuration — use a read-only service account against dev/staging only, never production.

The dbt manifest and catalog JSON files are another strong option: they give the agent model definitions, column descriptions, lineage, and test coverage without any live database connection.

Fallback: manual schema summary files. If none of the above are set up yet, create a docs/schema-context.md that describes your key tables:

## Source: raw.stripe

### raw.stripe.charges
- Grain: one row per charge attempt
- Primary key: id (varchar)
- Key columns: amount (integer, cents), currency (varchar),
  status (varchar: succeeded/failed/pending),
  created (integer, Unix epoch UTC)
- Volume: ~500K rows/day
- Refresh: real-time via Fivetran, typical latency <5 minutes

### raw.stripe.refunds
- Grain: one row per refund
- Primary key: id (varchar)
- Foreign key: charge_id → charges.id (one refund per charge)
- Key columns: amount (integer, cents), reason (varchar, nullable),
  created (integer, Unix epoch UTC)

Cover the tables the agent will work with most — it doesn't need to be exhaustive. Treat this as a stepping stone: manual files are faster to set up, but MCP or manifest-based approaches are more reliable and worth investing in as the engagement matures.

1.4 Set Up Linters and Formatters

You can configure SQLFluff for your Snowflake dialect and hook it into Claude Code's PostToolUse so every written SQL file gets formatted automatically:

{
  "hooks": {
    "PostToolUse": [{
      "matcher": "Write",
      "hooks": [{
        "type": "command",
        "command": "sqlfluff fix --dialect snowflake $CLAUDE_FILE_PATH"
      }]
    }]
  }
}

For Python files (macros, scripts), add a similar hook for ruff or black.


Step 2 — Plan Before You Build

You have a task: build a new mart model, add a data source, create a set of quality checks, or refactor an existing model. Before writing any SQL, plan.

2.1 Enter Plan Mode

Press Shift+Tab twice to enter plan mode. In this mode, Claude reads files and reasons about your project but won't edit anything or run commands.

2.2 Write Your Prompt

Give Claude the task with data-specific context. Be explicit about grain, business rules, and constraints.

Example prompt:

Build a mart model `fct_customer_lifetime_value` that calculates 
lifetime value per customer.

Source tables (see schema-context.md):
- stg_stripe__charges (one row per charge, amount in dollars)
- stg_stripe__refunds (one row per refund, amount in dollars)
- dim_customers (one row per customer)

Business rules:
- LTV = total charges minus total refunds per customer
- Include only charges with status = 'succeeded'
- Include only customers with at least one succeeded charge
- Use charge `completed_at` for time-based calculations
- Calculate: total_revenue, total_refunds, net_ltv, first_order_at, 
  last_order_at, total_order_count

Grain: one row per customer_id
Materialization: table (not incremental — the table is small enough)

Follow the patterns in `fct_orders.sql`. Clarify anything before 
proposing a plan.

2.3 Clarify Through Dialogue

Claude will likely ask about edge cases. Answer them:

Claude: "How should I handle customers whose only charges have 
         been fully refunded? Include them with net_ltv = 0, 
         or exclude them?"
You:    "Include them with net_ltv = 0. They're still customers."

Claude: "Should refunds be capped at the charge amount, or can a 
         refund exceed the original charge (e.g., goodwill credits)?"
You:    "Refunds can exceed charge amount — don't cap."

2.4 Review and Save the Plan

Once the plan looks right, ask Claude to save it:

> Save this plan as docs/plans/fct-customer-ltv-plan.md

Data-specific plan review checklist:

  • Does the plan address the correct grain? (one row per customer, not per order)
  • Are joins specified with the right cardinality? (charges to refunds is one-to-many)
  • Is the materialisation appropriate for the data volume?
  • Are there assumptions about data freshness or completeness that need validating?
  • Are edge cases explicitly addressed (NULLs, zero values, negative amounts)?

Step 3 — Implement with the Agent

3.1 Switch to Implementation Mode

Press Escape or Shift+Tab to return to normal mode. If context is getting heavy from the planning dialogue, start a fresh session:

claude
> Read docs/plans/fct-customer-ltv-plan.md and implement it step by step.
  After each step, run dbt compile and dbt test. Commit each completed 
  step with a descriptive message.

3.2 Work in Small Steps

Don't let the agent build everything at once. A typical progression:

Step 1: Create the model SQL. Claude writes fct_customer_lifetime_value.sql following the plan and the pattern from fct_orders.sql. Run dbt compile to verify syntax. Review the SQL: check join logic, filter conditions, aggregation grain.

Step 2: Run and validate. Run dbt run -s fct_customer_lifetime_value. Check the results against known-good data — compare total revenue to a manual calculation, spot-check a few customers, verify row count matches your expectation for the customer base size.

Step 3: Add tests. Claude generates the YAML schema file with tests: uniqueness on customer_id, not_null on key columns, a singular test verifying that net_ltv = total_revenue - total_refunds for a sample of customers. Run dbt test -s fct_customer_lifetime_value.

Step 4: Add documentation. Claude generates column descriptions in the YAML file. Review them — replace generic descriptions with business-meaningful ones.

Step 5: Commit. Atomic commit with a descriptive message.

3.3 Validate Against Known-Good Data

This step is worth emphasising because it's the one most people skip. After the model runs successfully and tests pass, compare the output against an independent source of truth:

  • Does total revenue across all customers match the Stripe dashboard (approximately)?
  • Pick 3–5 specific customers and verify their LTV manually
  • Are there customers with negative LTV? Is that expected?
  • Does the row count match the number of customers with at least one succeeded charge?

If there's no existing baseline, build a quick one — even a few manual spot checks provide confidence that the AI-generated SQL is correct.


Step 4 — Review and Finalize

4.1 Run In-Session Review

Before creating a PR, run multiple review passes:

Pass 1: Ask the agent to review the changes for general correctness, style, and potential issues. Once you settle on a prompt that works well for your project, turn it into a custom slash command (.claude/commands/code-review.md) so the team can reuse it consistently — /code-review is a common, intuitive name for this command.

Pass 2: If you have a project-specific review skill, run it. This catches violations of conventions that the general review doesn't know about.

Pass 3 (for complex models): Ask specifically about performance: "Review this model for performance at production scale. Check for full table scans, missing cluster key utilisation, and inefficient join patterns for Snowflake."

4.2 Run the Full Test Suite

dbt test

Not just the new model's tests — the full suite. Verify that existing models still pass. Regressions in unrelated models are a common AI pitfall when changes touch shared macros or staging models.

4.3 Create the Pull Request

> Create a pull request for the customer LTV model. Reference the 
  plan in docs/plans/. Include a summary of business rules implemented 
  and how the output was validated.

A good data engineering PR description includes: what the model does (business purpose, not just technical description), what data it affects, how it was validated, and any assumptions or limitations.

4.4 Address Review Feedback

If reviewers leave comments — human or CI-based — address them within Claude Code:

> Read the review comments on PR #42 and address each one. 
  Run dbt test after each fix.

Step 5 — Evolve Your Workflow

Once the basic workflow feels natural, build on it.

Custom slash commands for recurring patterns. If you frequently create new staging models, create a command that generates the SQL, YAML, and tests in one pass:

<!-- .claude/commands/new-staging-model.md -->
Create a staging model for the specified source table.

Steps:
1. Generate the staging SQL: rename to snake_case, cast types, 
   filter test data, add _loaded_at metadata
2. Generate the YAML schema with not_null and unique on PK, 
   and column descriptions
3. Add source freshness definition if not already present
4. Run dbt compile and dbt test
5. Follow the patterns in the most recent staging model in the 
   same source directory

Source table: $ARGUMENTS

Invoke with: /new-staging-model raw.stripe.disputes

Build skills for project-specific workflows. A staging model skill, a migration validation skill, a documentation audit skill — each encoding your team's exact conventions so the output is consistent from the first run. These are most valuable for workflows that have no equivalent in officially available tooling and would otherwise require repeating the same detailed prompt every time.

Learn from mistakes. After a session where the agent struggled and eventually found the right approach, open a second session: "In another session we worked on X and the agent had trouble with Y. Find that session, analyse what went wrong, and update CLAUDE.md or create a skill to prevent the same mistake next time."

Scale to the team. Once your personal workflow is solid, make it a team asset. Check your CLAUDE.md and custom skills into the repo so everyone benefits from the same context and conventions. Agree on which skills are team-standard versus personal experiments. When reviewing AI-assisted PRs, focus on the same things you always would — grain, joins, business logic — but also check whether the AI was given adequate context. A PR where the model is structurally correct but uses the wrong source table often means the CLAUDE.md is missing a key convention, not that the author was careless.


Appendix


A. Model Selection for DE Tasks

Match the AI model tier to the task complexity (as of March 2026). Model deprecation cycles are now measured in months — verify current availability before selecting. Use the Coding and Agentic indexes to compare specific models — don't rely on headline rankings.

📖 Reference: For detailed model selection strategy, the quality-vs-price scatter plot approach, and provider comparison — see the [AI Tooling Playbook].

Task Type Optimise For Recommended Tier Examples
Complex migration analysis, cross-system lineage, architecture review, multi-step debugging Quality and depth — cost is secondary Frontier Claude Opus 4, GPT-4.1, Gemini 2.5 Pro
dbt model generation, SQL refactoring, Airflow DAG scaffolding, quality rule generation, code review Quality-to-cost ratio Mid-tier Claude Sonnet 4, GPT-4.1-mini, Gemini 2.5 Flash
YAML generation, documentation, column descriptions, simple SQL formatting, boilerplate Speed and price — quality threshold, not maximum Lightweight Claude Haiku 4, GPT-4.1-nano, Gemini 2.0 Flash-Lite

For most daily data engineering work, a mid-tier model is the right choice. Reserve frontier models for tasks where depth and reasoning matter — complex planning, migration analysis, or debugging subtle data issues across multiple models.


B. Context File Templates (CLAUDE.md / AGENTS.md)

Copy, paste, and adapt to your project. Delete what doesn't apply, add what's missing. Aim for under 100 lines — every line costs tokens on every session. The filename depends on your agent: CLAUDE.md for Claude Code, AGENTS.md for the cross-tool convention, .aider.conf for Aider.

dbt + Snowflake project

# [Project Name]

## Stack
- dbt Core [version], Snowflake adapter
- Snowflake [edition], database: [DATABASE_NAME]
- Airflow [version] for orchestration
- SQLFluff for linting (Snowflake dialect)
- Python 3.11 for macros and scripts

## Project Structure
- models/staging/[source]/ — one subdirectory per source, prefix `stg_`
- models/intermediate/ — shared transformations, prefix `int_`
- models/marts/ — business-facing models, prefix `fct_` or `dim_`
- tests/singular/ — custom business rule tests
- macros/ — project-specific Jinja macros
- docs/schema-context.md — key table descriptions and grain
- docs/business-rules.md — metric definitions and edge cases

## SQL Conventions
- CTEs over subqueries, always
- One model per file, model name matches filename
- snake_case for all identifiers
- Explicit column lists (no SELECT *)
- COALESCE for NULL handling
- Timestamps: TIMESTAMP_NTZ in UTC, suffix `_at`
- Booleans: prefix `is_` or `has_`
- Money: DECIMAL(12,2), convert from cents at staging layer
- Dates: DATE type, suffix `_date`
- IDs: VARCHAR, suffix `_id`

## dbt Conventions
- Incremental models: merge strategy, define `unique_key`
- All models: schema YAML with unique + not_null on PK at minimum
- Staging: rename, cast, filter test data — no business logic
- Sources: defined in `_sources.yml` per source system
- Freshness: defined for all sources, warn 12h / error 24h default
- Tags: `daily`, `hourly`, or `manual` on every model

## Testing
- Generic tests on every model PK (unique, not_null)
- Relationship tests for all foreign keys
- Singular tests for business rules (see tests/singular/)
- Run: `dbt test -s <model>` after each model change
- Run: `dbt test` before opening a PR

## Key Commands
- `dbt compile` — validate SQL
- `dbt run -s <model>+` — run model and downstream
- `dbt test -s <model>` — test specific model
- `dbt run && dbt test` — full build + test
- `sqlfluff lint models/` — lint all
- `sqlfluff fix <file>` — auto-fix one file

## What NOT to Do
- No SELECT * in any model
- No hardcoded dates or IDs
- No new packages without team discussion
- No direct references to raw tables from marts — always go through staging

Airflow project

# [Project Name] Airflow

## Stack
- Apache Airflow [version] on [infrastructure: MWAA / Composer / self-hosted]
- Python 3.11
- Snowflake as primary data warehouse
- dbt triggered via BashOperator / dbt Cloud API operator

## Project Structure
- dags/ — one file per DAG, named `dag_[domain]_[purpose].py`
- dags/common/ — shared utilities, operators, sensors
- plugins/ — custom operators and hooks
- tests/ — DAG validation and unit tests
- config/ — environment-specific configuration

## DAG Conventions
- DAG ID matches filename (without `dag_` prefix)
- Schedule: use cron expressions, never `@daily` shorthand
- Timezone: all schedules in UTC
- Default args: retries=2, retry_delay=300s, email_on_failure=True
- Owner: team name, not individual
- Tags: domain tag + frequency tag on every DAG
- Catchup: False unless explicitly needed for backfill

## Operator Preferences
- SnowflakeOperator: use the project wrapper in `common/operators.py`
- BashOperator for dbt CLI commands
- PythonOperator only when Snowflake/Bash won't work
- No SubDagOperator — use TaskGroup instead

## Connections and Variables
- Connection names: `[system]_[environment]` (e.g., `snowflake_prod`)
- Variables: `[domain]_[name]` (e.g., `payments_api_batch_size`)
- Never hardcode credentials, connection strings, or environment-specific values

## Error Handling
- All DAGs must have `on_failure_callback` pointing to Slack alerting
- Critical DAGs: add SLA miss detection
- Failed tasks: alert to #data-alerts channel

## Testing
- DAG validation: `python -m pytest tests/test_dag_integrity.py`
- Task unit tests for custom operators
- Run all tests before opening a PR

## Key Commands
- `airflow dags test [dag_id] [date]` — test a DAG locally
- `python -m pytest tests/` — run all tests
- `ruff check dags/` — lint all DAG files

C. Prompt Library

Tested prompts for common data engineering tasks. Each includes the prompt, required context, and what to watch for in the output.

Generate a staging model

Create a staging model `stg_[source]__[table]` from `raw.[schema].[table]`.
Follow the pattern in [existing staging model path].
Requirements: [rename rules, type casts, filters, metadata columns].
Generate the YAML schema file with appropriate tests.

Context: Source DDL or schema context, existing staging model as pattern. Watch for: Missing type casts, wrong timestamp handling, missing _loaded_at.

Translate SQL dialect

Translate this [source dialect] SQL to [target dialect].
Preserve exact semantics — same results, same edge case behaviour.
Flag any constructs where the translation might change behaviour
(NULL handling, implicit casting, date arithmetic).

[paste SQL]

Context: Source SQL, any known dialect-specific gotchas. Watch for: Semantic differences. Run source-to-target reconciliation on every translated query.

Generate business rule tests

Generate dbt tests for `[model_name]` covering these business rules:
1. [Rule 1 — be specific about expected behaviour]
2. [Rule 2]
3. [Rule 3]

Use generic tests where possible. Write singular tests for rules
that can't be expressed generically. Follow the patterns in tests/singular/.

Context: Model SQL, business rules documentation, existing test examples. Watch for: Tests that verify structure but not behaviour. Intentionally break the model and verify tests catch it.

Test coverage gap analysis

Analyse this dbt project for test coverage gaps.
For each model, report: number of generic tests, number of singular tests,
whether relationships tests exist for foreign keys, whether source
freshness is defined. Flag models with no business rule tests.
Prioritise gaps by model criticality (number of downstream dependents).

Context: Full dbt project access. Watch for: The output is a gap list, not code. You decide priorities.

Project onboarding summary

I'm new to this project. Analyse the dbt project and provide:
1. Overall structure: sources, staging, intermediate, mart counts
2. Key mart models (most downstream dependents)
3. Testing strategy: what's tested, what patterns are used, major gaps
4. Documentation status: what's documented, what isn't
5. Key data flows: how does raw data become the main mart models?

Context: Full project access, CLAUDE.md, schema context. Watch for: Missing institutional knowledge. Supplement with human onboarding.

Generate reconciliation queries

Generate source-to-target reconciliation queries for the migrated
[object_name]. Compare:
- Row counts
- Key aggregate values: [list specific metrics]
- Sample row-level comparison on [key columns]

Source: [source system/dialect], Target: [target system/dialect]

Context: Source and target schemas, key columns, business-critical metrics. Watch for: Missing edge case validations. Row counts can match while data values differ.

Pipeline failure diagnosis

This dbt model failed during the nightly run. Here's the error:
[paste error]

And here's the model SQL:
[paste SQL or reference model path]

The model ran successfully yesterday. What likely changed?
Check for schema changes, data type issues, NULL handling,
or upstream data quality problems.

Context: Error output, model SQL, any recent changes to upstream models. Watch for: AI may suggest the most common cause, not the actual cause. Verify its hypothesis before applying the fix.


D. Notebook / Jupyter Quick Start

This appendix supplements the main Quick Start for data engineers working primarily in Databricks or Jupyter rather than a file-based dbt project. The thinking process is the same — plan, implement in small steps, validate — but the mechanics differ. This section covers where notebook environments diverge; the main Quick Start has the full workflow.

Environment setup

  • Databricks — Databricks Assistant is built in and has Unity Catalog access. It already knows your table schemas. For structured work that needs review and version control, use Databricks-Git integration (Repos) to sync notebooks to a repo, then apply the main Quick Start workflow for the production version.
  • Jupyter (local or JupyterHub) — install jupyter-ai for in-notebook AI, or use a side-window chat (Claude.ai, ChatGPT) with manual copy-paste. VS Code with the Jupyter extension gives you Copilot and agent mode natively in .ipynb files.
  • Google Colab — Gemini is integrated by default.

The workflow

Load context first (paste schema summary or CLAUDE.md content into the session), ask for a plan before writing code, implement and validate cell by cell. Each cell is a checkpoint — if output looks wrong, stop and investigate before moving on.

Promote to production

Notebook logic should graduate to version-controlled files before going to production. Once validated, use AI to translate, e.g.:

Convert the transformation logic in this notebook into a dbt model
following our project conventions (see CLAUDE.md). Name it
`[model_name]`, include tests and documentation, handle incremental loading.

E. Data Anonymisation Libraries

When you need realistic but safe data for debugging, testing, or AI context, these libraries cover most scenarios.

Fake data generation - Faker (Python) — generate realistic names, emails, addresses, dates, IDs, and custom providers. The standard choice for seeding test datasets and replacing PII in sample data. - Mimesis — faster than Faker for bulk generation; useful for large synthetic datasets with locale-specific data.

PII detection and anonymisation - Microsoft Presidio — detect and anonymise PII in text and structured data. Supports custom recognisers, configurable anonymisation operators (redact, replace, hash, encrypt), and integrates with pandas DataFrames. Practical for automating sanitisation of query results or log outputs before sharing with AI tools.

Synthetic tabular data - SDV — Synthetic Data Vault — generate synthetic tabular data that preserves statistical properties, distributions, and referential integrity across related tables. Useful when you need sample data that behaves like production (realistic distributions, correlated columns, foreign key relationships) rather than just structurally valid rows.

In-warehouse anonymisation Most data platforms also provide native functions for hashing, masking, and tokenisation — Snowflake's SHA2/MD5, BigQuery's AEAD encryption functions, and Databricks Unity Catalog's column masking policies. These are preferable when the data never needs to leave the warehouse.


F. Skills and MCP Servers from Official Providers

New official integrations are shipping monthly — most major data platforms now have or are building MCP servers, and the skills ecosystem is growing faster than any static list can track. Treat this as a starting point: check the provider's official docs and GitHub for the current state before setting anything up. The VoltAgent/awesome-agent-skills and mcpservers.org directories are useful for finding integrations not listed here.

See Section 3.2 for guidance on trusting MCP servers and skills before installing.

Skills

Name Provider Capabilities
data-engineering-skills Altimate AI Skills for dbt and Snowflake: model creation, query optimisation, test generation. Benchmarked on real-world dbt tasks (ADE-bench).
astronomer/agents — skills Astronomer Airflow-specific skills covering DAG patterns, testing strategies, and debugging workflows. Works with Claude Code, Cursor, Copilot, and 25+ other agents.
airbytehq/skills Airbyte Skills giving agents access to 21+ third-party APIs (Salesforce, HubSpot, Stripe, Jira, Slack, and more) for connector and ingestion workflows.

MCP Servers

Name Provider Capabilities
dbt MCP Server dbt Labs Exposes dbt project metadata: lineage, model definitions, semantic layer metrics, and test results. Platform-agnostic — works with any supported warehouse.
astro-airflow-mcp Astronomer MCP server for Airflow: DAG status, task logs, trigger and pause/unpause, variable and connection management. Works with open-source Airflow 2.x and 3.x.
Databricks managed MCP Databricks Managed servers for Genie Space (NL-to-SQL over curated tables), Databricks SQL (query execution), Vector Search, and Unity Catalog Functions. Custom self-hosted servers also supported.
DataHub MCP Acryl Data Data catalog access: dataset discovery, lineage, usage patterns, and NL-to-SQL generation grounded in governed metadata.
Airbyte knowledge MCP Airbyte Connector knowledge base for AI agents. Separate PyAirbyte MCP manages connectors and sync operations across hundreds of sources without custom integration code.
BigQuery MCP Google Cloud Fully managed remote MCP server for BigQuery: SQL execution, dataset analysis, and reporting. Preview; auto-enabled with BigQuery since March 2026.

A Note on Where We Are

This playbook is a snapshot, not a finished map. The tooling and practices are evolving fast — what's here reflects what works now. Experiment, share what you learn with the team, and update this document as things change.

Core Principles for Working with AI: Read more