AI Usage Playbook — Data Engineering
How This Playbook Is Organized
Part I — Tools, Techniques, and Landscape covers the AI-assisted data engineering ecosystem: a capability map showing where AI is reliable and where it isn't, detailed workflows for each stage of the data engineering lifecycle, DE-specific tooling, privacy and data governance, and common pitfalls. Think of Part I as the reference library you come back to when you need to evaluate a specific use case, set up tooling for a new engagement, or make an informed recommendation to a client.
Part II — Quick Start is a hands-on, step-by-step guide for working with an AI agent on a real data engineering task. It walks you through a concrete workflow — from preparing your dbt project for AI all the way to committing reviewed, tested code — using Claude Code as the primary tool, dbt + Snowflake as the reference stack, and GitHub for version control. A separate Jupyter/notebook entry point covers the workflow for engineers who work primarily in Databricks or notebook environments. The goal: give you one clear path you can follow today, then adapt as you build confidence.
Before both parts, the Key Principles section establishes the operating model that everything else builds on. It's short and it frames every decision in the playbook. After Part II, the Appendix collects reference material — context file templates, a prompt library, and tool directories — to look up as needed.
Start with the Key Principles, then skim Part I to orient yourself. Move to Part II (Quick Start) when you have a real task in front of you.
What This Playbook Assumes
You're a data engineer who already knows the craft. You understand dbt, SQL, orchestration, and warehouse platforms. This playbook won't explain what a CTE is or how Airflow scheduling works — it assumes that knowledge and focuses entirely on how AI changes the way you do the work you already know how to do.
📖 Reference: You've read the Core Principles for Working with AI. We won't repeat them here. Where a principle is particularly relevant to a section, we'll call it out briefly and link back.
📖 Reference: For AI model selection, coding assistant profiles, cost optimization, MCP fundamentals, and the agent skills standard, see the AI Common Playbook. This DE playbook covers only what's specific to data engineering work; general tooling guidance lives there.
Reference Stack
Examples throughout use dbt + Snowflake as the default stack, with Airflow referenced for orchestration context. Where BigQuery or Databricks behave meaningfully differently, we note it. The concepts and workflows apply regardless of your specific platform — the stack choice keeps examples concrete rather than abstract.
Working with AI in Data Engineering — Key Principles
AI generates. You decide. That division of labour runs through everything in this playbook — whether you're scaffolding a staging model, defining a metric, or reviewing a pipeline an agent wrote end to end.
Use AI Effectively
Provide structured, DE-specific context. Schema definitions, metric specs, grain decisions, and edge case documentation produce far more reliable output than vague prompts. Use MCP servers to give agents direct access to metadata rather than pasting DDL manually. Organise context files in layers — org-wide SQL conventions, project-specific dbt standards, session-only task input — and version them as team assets.
Boilerplate goes to the agent. Business logic stays with you. Staging models, YAML schemas, DAG scaffolding, documentation, legacy SQL conversions — agents handle these well. Complex business logic, grain decisions, metric definitions, and edge cases require your judgment. The Capability Map breaks this down task by task.
Let AI draft the spec, not just the code. Have the agent produce a data contract before implementation. Let it ask questions when it hits ambiguity — that's where hidden assumptions surface. (Section 2.2 walks through this workflow.)
Use AI to cross-reference and verify. Agents can trace lineage on column renames, spot conflicting metric definitions, and scan for test coverage gaps — work you'd otherwise do manually. (Section 2.8 covers these workflows.)
Maintain Human Oversight
📖 For general agent security, access control, and the "treat AI output like a junior engineer's PR" principle — see the AI Common Playbook and the Core Principles. What follows is DE-specific.
Validate every AI-generated model. Check grain, join logic, filters, aggregation, and naming. A structurally correct model with the wrong grain can pass every test and still produce silently wrong numbers.
Metric definitions don't get delegated. What "revenue" means, how returns are handled, which cost column to use — these are human decisions. Getting a metric wrong compounds across every downstream consumer.
Resolve ambiguity before you prompt. When requirements are shifting or business logic is undocumented, AI will confidently generate something anyway. If the humans haven't agreed on what the model should do, AI can't fill that gap.
Part I — Tools, Techniques, and Landscape
1. AI-Assisted DE Capability Map
This section is a quick reference. When you have a specific task and want to know whether AI can help — and how much to trust the output — scan the relevant table below.
The workflow sections cover the detailed how-to. This section is the lookup table — set expectations before you start.
How to read the reliability ratings:
- High — AI produces usable output with standard context. You still review, but the output typically needs minor adjustments rather than rewrites.
- Medium — AI produces a solid draft that needs meaningful human review and editing. Useful as a starting point, not a finished product.
- Low — AI output requires heavy validation or serves only as a rough scaffold. The value is in acceleration, not accuracy. Expect to rewrite significant portions.
SQL and Transformations
| Task | AI Reliability | Context Needed | Watch For |
|---|---|---|---|
| Generate staging model from source DDL | High | Source schema, naming conventions, example staging model | Column type mapping, null handling, casting conventions |
| Generate intermediate/mart models from defined spec | Medium-High | Source models, grain definition, metric specs, join relationships | Grain errors, wrong join type, missing filters |
| Write complex business logic (revenue calc, attribution, cohort analysis) | Low — draft only | Metric definitions, edge case decisions, example calculations, business rules doc | Wrong assumptions about business rules, silent grain errors, incorrect handling of NULLs and edge cases |
| Refactor legacy SQL (CTEs, readability) | High | The SQL itself, style conventions | Semantic changes disguised as refactors — verify output matches original results |
| Optimize query performance | Medium | Query profile/explain plan, table stats, platform (Snowflake vs. BigQuery vs. Databricks — advice diverges) | Platform-specific recommendations that don't apply to your engine, optimisations that change result semantics |
| Translate SQL dialect — DML and DDL | Medium-High | Source SQL, target dialect docs, known incompatibilities list | Semantic differences masked by syntactic similarity (e.g., NULL handling, implicit casting, date arithmetic) |
| Translate SQL dialect — procedural code (stored procs, cursors, dynamic SQL, PL/SQL packages) | Medium | Source code, target platform procedural capabilities, known construct gaps | Procedural logic, error handling patterns, and control flow don't map cleanly across platforms — expect significant manual rework |
| Generate dbt incremental model logic | Medium | Source change patterns, volume, SLA, merge keys, target platform | Wrong merge strategy, missing handling for late-arriving records, delete detection gaps |
Pipeline and Orchestration
| Task | AI Reliability | Context Needed | Watch For |
|---|---|---|---|
| Scaffold Airflow DAG from description | High | DAG conventions, operator preferences, connection names, schedule requirements | Hardcoded values, missing error handling, wrong schedule expression, wrong retry config (too few, too short delay) |
| Generate dbt YAML (sources, schema tests) | High | dbt project structure, existing YAML examples, column descriptions | Generic descriptions that don't reflect actual semantics — "The ID column" is useless documentation |
| Design incremental loading strategy | Low | Source change patterns, volume, SLA, target platform capabilities | Assumptions about source reliability, missing late-arrival handling, no backfill strategy |
| Write infrastructure-as-code (Terraform, Pulumi) for data platform resources | Medium-High | Existing IaC patterns, provider versions, naming conventions, security policies | Missing security configurations, overly permissive IAM policies, hardcoded values that should be variables |
| Generate CI/CD pipeline configuration for dbt | Medium-High | Existing CI/CD patterns, deployment strategy, environment structure | Missing environment-specific configs, wrong job dependencies, missing dbt test step |
Data Quality and Testing
| Task | AI Reliability | Context Needed | Watch For |
|---|---|---|---|
| Generate dbt generic tests (not_null, unique, relationships, accepted_values) | High | Schema, known constraints, existing test patterns | Missing tests on critical columns, overly broad accepted_values lists |
| Generate dbt singular tests for business rules | Medium | Business rule documentation, example scenarios, edge cases | Tests that verify structure but not semantics — a test can pass and still not catch the actual bug |
| Write Great Expectations / Soda checks | Medium | Schema, data profile, known data quality issues, check format examples | Generic thresholds that don't reflect actual data distribution, missing row-level checks |
| Generate test data respecting referential integrity | Medium | Schema with foreign keys, cardinality expectations, realistic value ranges | Synthetic data that's technically valid but unrealistic — all customers from the same country, all orders placed today |
| Identify test coverage gaps | Medium-High | dbt project manifest, existing test files, critical model list | May miss domain-specific tests that aren't inferrable from schema alone |
| Design anomaly detection rules | Low-Medium | Historical data patterns, known seasonality, alerting thresholds, false positive tolerance | Rules that are either too sensitive (alert fatigue) or too loose (miss real issues) |
Documentation and Knowledge Management
| Task | AI Reliability | Context Needed | Watch For |
|---|---|---|---|
| Generate dbt model descriptions from SQL | Medium-High | The model SQL, upstream model descriptions, business context | Descriptions of what the SQL does rather than what the data means — "Joins orders with customers" vs. "All completed orders enriched with customer segment for marketing attribution" |
| Generate column-level descriptions | Medium | Table context, business glossary, example descriptions | Generic filler ("The created_at timestamp") instead of useful context ("UTC timestamp when the order was placed; populated by the checkout service") |
| Create data dictionary from schema | Medium | Schema DDL, business context, existing documentation | Accurate structure, shallow semantics — the dictionary will describe types and nullability well but miss business meaning |
| Reverse-engineer intent from undocumented SQL | Medium | The SQL, upstream/downstream context, any available comments | Plausible but wrong interpretation of intent — AI guesses what the author meant based on patterns, not actual knowledge |
| Write pipeline runbooks for failure scenarios | Medium | Common failure patterns, alerting setup, environment details, escalation paths | Missing environment-specific steps, generic troubleshooting that doesn't account for your actual infrastructure |
| Generate lineage documentation from code | Medium-High | dbt project graph, model SQL, source definitions | Accurate column-level lineage is still unreliable — model-level lineage is more trustworthy |
| Onboarding summaries for new team members | Medium-High | Full project access (dbt project, CLAUDE.md, schema context) | Will miss institutional knowledge: why things are the way they are, historical decisions, known quirks |
Migration and Modernization
| Task | AI Reliability | Context Needed | Watch For |
|---|---|---|---|
| Translate SQL between dialects at scale | Medium-High | Source SQL corpus, dialect mapping rules, known incompatibilities | Syntactically correct but semantically different translations — test every migrated query against source results |
| Reverse-engineer ETL tool logic (Informatica, SSIS, Talend) into code | Medium | ETL tool export/XML, target framework conventions, mapping documentation | Misinterpreted transformation logic, missed conditional branching, lost error handling |
| Generate schema mapping between source and target | Medium-High | Source schema, target schema, existing mappings, naming conventions | Incorrect assumptions about column equivalence based on name similarity |
| Plan migration phases and dependency order | Medium | Full source system inventory, dependency graph, team capacity constraints | Overly optimistic parallelism, missing dependencies that aren't explicit in metadata |
| Generate migration validation queries (source-to-target reconciliation) | High | Source and target schemas, key columns, aggregation logic | Missing edge case validations — reconciliation counts match but data content differs |
Streaming and Real-Time Pipelines
| Task | AI Reliability | Context Needed | Watch For |
|---|---|---|---|
| Scaffold Flink jobs and generate Flink SQL for aggregations, filtering, lookups | Medium-High | Flink version, project conventions, source/sink config, windowing requirements | Missing error handling, incorrect serialisation, wrong window boundaries and watermark behaviour — correct syntax with wrong semantics is common |
| Write Avro/Protobuf schemas for streaming contracts | High | Domain model, compatibility requirements, existing schemas | Missing backward/forward compatibility annotations, wrong default values |
| Translate batch SQL to streaming or design processing guarantees | Low | Batch SQL, streaming semantics, pipeline topology, sink capabilities, latency/accuracy trade-off decisions | Batch-to-streaming is not a syntax translation — a GROUP BY and a tumbling window aggregation behave fundamentally differently. Processing guarantees (exactly-once, at-least-once) are architecture decisions, not generation tasks |
2. AI-Assisted Data Engineering Workflows
Each stage of the data engineering lifecycle with practical guidance: what AI does well, what to watch for, and example prompts you can adapt to your project.
Every workflow assumes you've read the Key Principles — especially the operating model (AI generates, you decide) and the context strategy (structured input produces reliable output, vague prompts produce plausible garbage). The Capability Map gives you a quick-reference rating for each task; this section goes deeper into the mechanics.
2.1 Pipeline Development
The plumbing: scaffolding orchestration code, writing infrastructure-as-code, translating between platforms, and managing schema evolution. SQL transformation work lives in Section 2.2.
Scaffolding Airflow DAGs
AI handles DAG scaffolding well. Give it your conventions (operator preferences, connection naming, retry policies, alert routing) and a description of what the DAG should do, and the output is typically close to production-ready. The main review points are schedule expressions (AI often guesses wrong on cron syntax for business calendars), error handling (missing on_failure_callback or no retry config), and hardcoded values that should come from variables or connections.
Example prompt:
Create an Airflow DAG that:
- Runs daily at 06:00 UTC
- Extracts data from the payments API (use the `payments_api` connection)
- Loads raw JSON into Snowflake stage `raw.payments.daily_extract`
- Triggers a dbt run for the payments model group
- Sends a Slack notification on failure to #data-alerts
Follow the DAG conventions in CLAUDE.md. Use our standard
`SnowflakeOperator` wrapper, not the raw provider operator.
What to watch for: The output will look clean and follow the structure you asked for. Check that it actually uses your project's operator wrappers (not generic Airflow operators), that connection names match your connections.yaml, and that the schedule accounts for timezone and DST if relevant.
Infrastructure-as-Code
For Terraform or Pulumi modules that provision data platform resources (Snowflake warehouses, BigQuery datasets, IAM roles, storage buckets), AI generates solid boilerplate from a description. The risk is in security configuration — AI tends toward permissive defaults. Always review IAM policies, network rules, and encryption settings manually. Check that resources use variables for environment-specific values rather than hardcoding dev settings.
SQL Dialect Translation — Covered in Section 2.6. The key risk: syntax translates cleanly but semantics differ (NULL handling, implicit casting, date arithmetic). Row-for-row validation is non-negotiable.
Schema Design and Evolution
AI can generate migration scripts (ALTER TABLE statements, new table DDL) from a description of what changed and why. It's also useful for impact analysis: "I'm adding a currency_code column to orders — what downstream models need updating?" This works well when the agent has access to the dbt project graph (via MCP server or manifest file). Without that access, it can only guess.
2.2 SQL and Transformation Work
SQL and transformation work is where AI provides the most daily value — and where the most dangerous mistakes happen — because incorrect SQL runs without errors and produces results that look reasonable until someone checks the numbers.
Query Generation from Requirements
The baseline use case: you have a business requirement ("monthly revenue by product category, excluding refunds, for the last 12 months") and you need SQL. AI does this well when you provide sufficient context: the relevant table schemas, what "revenue" means in your context (gross vs. net, which column, how refunds are recorded), and the target platform.
Example prompt:
Using the schema context in `schema-context.md`, write a Snowflake SQL
query that calculates monthly revenue by product category for the last
12 complete months.
Business rules:
- Revenue = `order_items.unit_price * order_items.quantity`
- Exclude rows where `orders.status = 'refunded'`
- Use `orders.completed_at` (UTC) for the time dimension, not `created_at`
- A "complete month" means the month has ended — exclude the current month
- Product category comes from `products.category_l1`
Output columns: month (date truncated to first of month), category,
total_revenue, order_count
The specificity matters. "Calculate revenue" without defining which column, which status filter, and which timestamp will produce SQL that looks right but uses the wrong assumptions. The more business rules you provide upfront, the fewer iterations you need.
Query Optimization
Platform-specific knowledge matters enormously — Snowflake, BigQuery, and Databricks have fundamentally different execution models. Provide the query profile or EXPLAIN output, specify the platform, and ask for platform-specific recommendations. Always benchmark before and after.
Refactoring Legacy SQL
High-reliability work — the correctness criterion is clear. Always diff results between the original and refactored queries. AI makes subtle semantic changes during "cleanup" (reordering a COALESCE, changing a LEFT JOIN to INNER, altering NULL handling) that look like style improvements but change the output.
Generating dbt Models
For dbt projects, AI accelerates the model development cycle. The pattern that works best:
- Provide the source schema (or point the agent at it via MCP)
- Describe the target model: what grain, what business logic, where it fits in the DAG (staging → intermediate → mart)
- Reference an existing model of the same type as a pattern to follow
- Ask for the model SQL, the YAML schema file with tests, and documentation — all in one pass
Example prompt:
Create a staging model `stg_payments__transactions` from the source
`raw.stripe.transactions`.
Follow the pattern in `models/staging/stripe/stg_payments__charges.sql`.
Requirements:
- Rename columns to our snake_case convention
- Cast `amount` from cents to dollars (divide by 100, cast to decimal(12,2))
- Convert `created` (Unix epoch) to `created_at` (timestamp_ntz, UTC)
- Add `_loaded_at` from the metadata column
- Filter out test transactions (`livemode = true` only)
Also generate the YAML schema file with appropriate tests
(unique, not_null on the primary key, accepted_values for `status`).
Referencing an existing model as a pattern is the most reliable way to get consistent output. The agent extracts naming conventions, casting patterns, and test structures from the example rather than relying on its own defaults.
Data Contracts
Drafting specs before code pays off especially well with data contracts.
A data contract defines what a data product exposes to its consumers: schema, grain, freshness SLA, quality guarantees, and ownership. Having AI draft the initial contract from a source schema surfaces assumptions early — before they're baked into SQL that three downstream teams depend on.
The workflow: provide the source schema and a description of what the data product should expose. Ask the agent to draft a data contract in your team's format. Review it with the team, especially grain definition, column semantics, and SLA commitments. Iterate until agreed. Then implement the model from the contract, not from ad-hoc requirements.
The contract becomes both a specification for implementation and a regression baseline — if the model's output violates the contract, something is wrong.
2.3 Data Quality and Testing
AI-generated tests share the same fundamental problem as AI-generated tests in application code: they can look thorough while testing nothing meaningful. A dbt test suite where every model has not_null and unique on the primary key feels complete but catches almost none of the real data quality issues — wrong grain, duplicate business keys across sources, values outside business-valid ranges, broken referential integrity with systems that don't enforce foreign keys.
Core Principle reminder: Ensure Test Coverage — AI-generated code should include or be accompanied by appropriate tests. Verify test quality — AI often generates superficial tests that pass but don't validate actual behaviour. Core Principles for Working with AI
Generating dbt Tests
Start by telling the agent what to test, not just "write tests." Specify the business rules that matter:
Example prompt:
Generate dbt tests for `fct_orders`. Focus on:
1. Grain: one row per order_id — test uniqueness
2. Referential integrity: every `customer_id` exists in `dim_customers`,
every `product_id` exists in `dim_products`
3. Business rules:
- `total_amount` must equal sum of related `order_items.line_total`
- `status` must be one of: 'pending', 'completed', 'refunded', 'cancelled'
- `completed_at` must be NULL when status is 'pending',
and NOT NULL when status is 'completed'
- `total_amount` > 0 for all non-cancelled orders
4. Freshness: data should not be older than 24 hours
Use generic tests where possible. Write singular tests for the
business rule validations that can't be expressed as generic tests.
After generating, validate the tests by intentionally breaking the model — change a join condition, remove a filter, alter a calculation — and verify that the tests catch the breakage. If they don't, they're not testing what matters.
Test Data Generation
AI can generate synthetic test data that respects schema constraints and referential integrity. This is useful for local development, CI pipelines, and environments where production data isn't available. Provide the schema, cardinality expectations, and any business constraints (e.g., "order dates should span the last 2 years," "30% of customers should have no orders," "amounts should follow a realistic distribution, not all be $100").
Watch for synthetic data that's technically valid but unrealistic — uniform distributions, all timestamps on the same day, no NULL values where production data commonly has them. Unrealistic test data produces tests that pass in CI but miss production failures.
Coverage Gap Analysis
A high-value, low-risk use of AI — scanning and comparing, not generating logic. See Section 2.8 for the detailed workflow.
2.4 Documentation and Knowledge Management
Onboarding with AI. When you join a project, point the agent at the dbt project with CLAUDE.md and schema context loaded. Ask it to summarise the model graph, identify the most-used mart models, explain the testing strategy, flag undocumented models, and map key data flows ("How does raw order data become fct_orders?"). This compresses the "reading through code" phase from days to hours — so you spend onboarding time asking better questions to the humans, not navigating files. It won't replace a proper handover; the agent doesn't know about the upstream quirks or the business rule that changed last quarter.
Generating dbt YAML description stubs. Point the agent at a model's SQL and ask for the YAML schema with column descriptions. The result will describe what the SQL does, not what the data means — "Joins orders with customers" vs. "Total revenue per customer, completed orders only, refreshed daily at 06:00 UTC." Provide a few examples of good descriptions from your project and the agent will match the pattern consistently.
Reverse-engineering undocumented pipelines. AI handles structural understanding well (joins, filters, grain). It handles intent less well ("why is there a 30-day lookback?" — business rule, performance hack, or legacy bug?). Use AI output as a starting point, fill in intent from team knowledge.
Runbooks. AI generates structurally sound runbook drafts from alerting configuration and failure patterns. Fill in environment-specific details manually (escalation channels, service accounts, on-call rotation).
2.5 Debugging and Incident Response
AI is fast at parsing structured error output and correlating it with code — useful when you're triaging a pipeline failure at 3am.
Root cause analysis from logs. Paste the error log and the relevant model or DAG code. For common failures (schema changes, permission errors, timeouts, type mismatches) AI is accurate and fast. For subtler issues — a query that succeeds but returns wrong results — provide expected vs. actual output and upstream change history.
Failure pattern analysis. Provide historical failure data (timestamps, affected models, error types) and let AI find correlations: the model that fails every month-end due to volume spikes, the source that's unreliable on Mondays. Correlations, not causation — but often enough to point you in the right direction.
Schema change impact. Point the agent at the source change and your dbt project graph to trace which staging models break, which downstream models are affected, and which tests need updating. Faster and more complete than manual grepping — provided the agent has access to the full project.
2.6 Migration and Modernization
Platform migrations — on-prem to cloud, one warehouse to another, ETL tools to code-based pipelines — are where AI earns its keep. The work is repetitive, pattern-heavy, and voluminous: hundreds of SQL objects to translate, thousands of mappings to verify. AI won't get every translation right, but it compresses the bulk of the work and lets you focus review time on the edge cases.
SQL Dialect Translation at Scale
AI handles mechanical translation well (function mappings, type conversions, syntax). Semantic differences are the risk — NULL sorting, implicit casting, date arithmetic, empty-string-vs-NULL behaviour produce queries that parse and run but return different results.
Non-negotiable: run source-to-target reconciliation on every migrated query. AI can generate the reconciliation checks too (High reliability in the Capability Map): AI translates, AI generates the validation, you review both.
Reverse-Engineering ETL Tools
Legacy ETL tools (Informatica, SSIS, Talend, DataStage) store transformation logic in proprietary formats — XML definitions, visual mappings, metadata repositories. AI can parse these exports and generate equivalent code in dbt, Python, or Spark. The output is a draft, not a finished product. ETL tools encode conditional logic, error handling, and data routing in ways that don't map cleanly to SQL, and AI may miss or simplify branching paths.
Provide the ETL tool export alongside a description of the expected behaviour. Review the generated code against the original tool's documentation, not just against the export file — the export may not capture runtime configuration or server-level settings.
Schema Mapping
Generating the mapping between source and target schemas is repetitive manual work that AI accelerates. Provide both schemas and any existing mapping documentation. AI matches columns by name, type, and context, generates the mapping table, and flags columns that don't have obvious matches.
Review the flagged items — these are usually the cases that require business knowledge (the source column AMT in the legacy system maps to gross_amount in the target, not net_amount, and the conversion requires dividing by 100 because the legacy system stores cents). Name-based matching gets you 80% of the way; the remaining 20% is where the real domain knowledge lives.
Migration Planning
AI can help sequence migration phases by analysing dependency graphs — which objects need to migrate first because others depend on them, which can be parallelised, where the critical path is. Provide the full source system inventory and dependency information. The output is a useful starting point for planning, but expect it to underestimate complexity at dependency boundaries (cross-system dependencies, shared lookup tables, circular references that need to be broken).
2.7 Streaming and Real-Time Pipelines
Streaming is where the gap between "AI generates plausible code" and "the code is actually correct" is widest. Batch SQL has a clear correctness criterion: given the same input, does the query produce the expected output? Streaming adds temporal semantics, state management, ordering guarantees, and failure recovery — dimensions that AI handles unreliably because the failure modes are subtle and context-dependent.
That said, AI is still useful for streaming work. The key is knowing where to trust it and where to increase your review intensity.
The examples below use Apache Flink because of its explicit handling of time, state, and watermarks — which is where AI struggles most. The core principles apply equally to Kafka Streams, ksqlDB, and Spark Structured Streaming: review window semantics carefully, validate state management, and test failure recovery scenarios. AI's SQL generation strengths transfer to ksqlDB and Spark SQL, but the same caveats about temporal semantics and exactly-once guarantees apply across all frameworks.
Flink Job Scaffolding
AI generates solid Flink job structure — source/sink configuration, basic pipeline topology, serialisation setup, checkpoint configuration. This is boilerplate-heavy work where the patterns are well-established. Provide your Flink version, project conventions, and source/sink configurations. Review checkpoint intervals, state backend configuration, and error handling — these are the settings where wrong defaults cause problems that only surface under load.
Example prompt:
Scaffold a Flink job (Python/PyFlink) that:
- Consumes from Kafka topic `events.page_views` (Avro, schema registry at [URL])
- Computes a 5-minute tumbling window count of page views per user_id
- Writes results to Kafka topic `analytics.pageview_counts` (JSON)
- Uses RocksDB state backend with 1-minute checkpoint interval
- Uses event-time time characteristic, input data out-of-orderness is 10 seconds
- Sets allowedLateness = 1 minute in window aggregation
Follow our project structure in CLAUDE.md. Include proper
deserialization error handling (dead-letter queue for malformed messages).
Flink SQL
For simple streaming operations — filtering, projection, joins against relatively static lookup tables, tumbling window aggregations — Flink SQL generated by AI is reasonably reliable. The syntax is close enough to standard SQL that AI's strengths in SQL generation carry over.
For complex event processing — session windows with custom triggers, pattern detection with MATCH_RECOGNIZE, stateful operations that accumulate across windows — AI output needs heavy review. The semantics of streaming SQL diverge from batch SQL in ways that AI frequently gets wrong: missing watermark configuration, no state time-to-live management, or producing a changelog output stream where an append-only one is expected.
Batch-to-Streaming Translation
AI is least trustworthy here because the translation isn't mechanical. A batch GROUP BY aggregates bounded data; a streaming window aggregation handles late-arriving data, out-of-order events, and window firing semantics. AI produces syntactically equivalent translations that miss the semantic gap — no watermark configuration, no late data policy, no handling of post-window-close arrivals. Treat the output as a conversation starter, not a draft.
Streaming Contracts (Avro/Protobuf)
High-reliability AI task. Provide the domain model and compatibility requirements (backward, forward, full). Review backward/forward compatibility annotations and default values carefully — wrong defaults propagate to every consumer and are painful to fix retroactively.
Monitoring and Alerting
AI can generate consumer lag thresholds, throughput anomaly detection, and checkpoint duration alerts. The main review point is threshold calibration — AI tends toward thresholds that are either too tight or too loose. Start with AI-generated thresholds, then calibrate against observed baseline behaviour.
2.8 Cross-Referencing and Consistency Verification
Agents excel at cross-referencing — the kind of consistency checking that humans do by grepping through files and mentally tracking dependencies. Here are the specific workflows where that capability pays off.
These workflows depend on agent access. With a workspace-aware tool (Claude Code, Copilot in agent mode) or MCP servers connected to your project, they're practical and fast. Without access, you're back to copy-pasting files and losing most of the benefit.
Column Rename Impact Analysis
You're renaming user_id to customer_id in a staging model. Before you make the change, you need to know every downstream model that references the old name. Feed the agent the proposed change and the dbt project, and ask it to trace every reference — model SQL, YAML schema files, test definitions, documentation.
Example prompt:
I'm planning to rename the column `user_id` to `customer_id`
in `stg_accounts__users`. Before I make this change, find every
reference to this column in downstream models, tests, YAML files,
and documentation. List each reference with the file path and line.
This is faster and more complete than a manual grep because the agent understands SQL context — it can distinguish between a reference to stg_accounts__users.user_id and an unrelated user_id column in a different model.
Metric Definition Consistency
In larger projects, the same business concept (revenue, churn, active users) gets defined in multiple places: dbt metrics, BI tool semantic layers, documentation, ad-hoc queries shared in Slack. These definitions drift over time. Feed the agent all the definitions you can find for a given metric and ask it to compare them and flag inconsistencies.
This is analysis, not code generation — the output is a comparison table showing where definitions agree and where they diverge. The human decides which definition is correct and how to reconcile the others.
Test Coverage Gap Analysis
Point the agent at your dbt project and ask: which models have no tests? Which have only generic tests (not_null, unique) but no business rule tests? Which sources are missing freshness checks? Which relationships tests are defined but reference models that no longer exist?
The output is a prioritised gap list. You decide which gaps matter based on the model's criticality, how often it changes, and how many consumers depend on it. A mart model used by ten dashboards with no business rule tests is a higher priority than a staging model with no freshness check.
Documentation completeness audit
Same approach as test coverage: scan for models with no descriptions, auto-generated filler, missing column docs. Run quarterly — documentation degrades as models evolve.
The limits of AI-assisted lineage
These cross-referencing workflows work well within a dbt project or a single codebase where the agent can read all the files. Real data environments span orchestrators, ingestion tools, warehouses, and BI layers — and AI can't trace lineage across those boundaries without explicit access to each system. If your organisation uses a data catalog (DataHub, Atlan, OpenMetadata), an MCP server for that catalog (see Section 3.2) can extend the agent's reach. But for most teams, cross-tool lineage remains a manual effort or a platform-level feature — not something you can hand to an agent.
3. Tooling for Data Engineers
This section covers AI tooling that's specific to data engineering work. General tool profiles, model selection, cost optimisation, and MCP fundamentals live in the external tooling playbook — we don't repeat them here.
📖 Reference: For tool profiles (Claude Code, GitHub Copilot, Cursor, Codex) and general setup guidance — see AI Common Playbook, 3. AI Coding Assistants.
3.1 IDE and Environment Considerations
Data engineers work across a wider range of environments than most software developers. Your daily workflow might span a terminal (dbt commands, git), a SQL editor (DataGrip, DBeaver, the Snowflake console), a notebook (Databricks, Jupyter), and an IDE (VS Code) — sometimes all in the same day. Where you bring AI into that workflow depends on what you're doing.
File-based development (dbt, Airflow, IaC): Terminal agents and IDE extensions work here the same way as in application development. For dbt projects, the terminal agent workflow in Part II is the recommended starting point.
SQL editors and warehouse consoles: Most (DataGrip, DBeaver, Snowflake worksheet, BigQuery console) don't have deep AI integration yet. Use a terminal agent or general-purpose chat in a side window — paste schema context, write the query collaboratively, move the result to your SQL editor. JetBrains SQL tools have growing AI features via their AI plugin.
Notebooks: See Appendix D. Short version: Databricks Assistant has built-in catalog access; Jupyter needs jupyter-ai or a side-window chat; Colab has Gemini. Notebooks suit prototyping; promote to version-controlled files for production.
3.2 DE-Specific MCP Servers
📖 Reference: For MCP fundamentals, setup instructions, security risks, and the security checklist — see the AI Common Playbook, Model Context Protocol (MCP).
MCP servers give AI agents access to external tools and data sources at runtime. For data engineers, the most valuable MCP servers are the ones that provide schema and metadata access — eliminating the need to paste DDL into every session.
Database MCP Servers
Servers for Postgres, Snowflake, BigQuery, and other databases let the agent query information_schema, inspect table definitions, check row counts, and — if configured — run read-only queries. This is powerful: instead of maintaining schema summary files manually, the agent gets live metadata.
The trade-off is security. Connecting an AI agent to a database means the agent can see — and potentially include in its context — whatever the connection credentials allow. On a development database with synthetic data, this is low-risk and high-reward. On anything with real customer data, treat the MCP server connection like any new service account.
Recommended configuration for database MCP servers:
- Use a dedicated read-only service account, not your personal credentials
- Restrict access to
information_schemaand metadata views where possible — the agent rarely needs to query actual data rows - Avoid connecting to production databases. If production metadata access is necessary (e.g.,
information_schemaonly), use a tightly scoped read-only account with no access to data rows. Default to development or staging environments with synthetic or anonymised data - If the agent does need sample data for debugging, create a curated sample dataset and expose only that
- Log queries made through the MCP server for audit purposes
Orchestrator Integrations
Astronomer's official MCP server for Airflow (open-source, works with Airflow 2.x and 3.x) exposes DAG status, task logs, and management operations — trigger, pause/unpause, variable and connection handling. The security implications of an agent that can trigger pipeline reruns are real; scope its permissions carefully and treat it as a write-capable connection.
Data Catalog Connectors
If your organisation uses a data catalog (Atlan, DataHub, Alation, OpenMetadata), an MCP server that exposes catalog metadata to AI agents could provide rich context: table descriptions, data lineage, ownership information, quality scores. This space is emerging. Check whether your catalog vendor offers an MCP integration or API that could serve as one.
On trusting MCP servers and skills
MCP servers and agent skills execute with the permissions of the credentials they're given, and skills inject instructions directly into the agent's context in ways that aren't always visible. Prefer official integrations from verified providers. For community or custom skills, review the source before installing — understand what instructions they inject, what tools they invoke, and whether they have access to credentials or sensitive paths. A skill from an unknown source is no different from running an untrusted script.
For a curated list of official MCP servers and skills from data engineering tool providers, see Appendix F.
3.3 dbt Integrations
dbt sits at the centre of most modern data engineering stacks, and the AI integration points are maturing.
dbt Cloud AI features: dbt Cloud includes AI-powered documentation generation and query explanation. These features use the dbt project metadata (manifest, catalog) for context, which gives them an advantage over general-purpose tools for dbt-specific tasks. Check the dbt Cloud docs for current capabilities — they expand with each release.
dbt MCP servers: Community-maintained MCP servers for dbt expose project metadata (models, tests, lineage) to AI agents. These can be useful for cross-referencing workflows (Section 2.8) and documentation generation (Section 2.4). Apply the same caution as with any community-maintained tool: review the source code before installing, pin the version, and check for updates before upgrading. The external tooling playbook's MCP security guidance applies fully here.
dbt-specific agent skills: If your team uses Claude Code, consider creating skills for recurring dbt tasks: generating a new staging model from a source, adding standard tests to a model, generating YAML documentation for undocumented models. A well-crafted skill encodes your project's specific conventions and produces output that's consistent from the first invocation, saving the iteration time you'd spend teaching a generic agent your patterns.
dbt Power User (VS Code extension): Altimate AI's dbt Power User adds model generation, query explanation, test scaffolding, and lineage visualization directly in VS Code. If your team already works in VS Code, it's worth evaluating alongside terminal-based agents — the IDE integration means less context-switching for tasks like documentation and test generation. Altimate AI also publishes open-source data engineering skills for Claude Code and other agents (listed in Appendix F).
3.4 Semantic Layers and AI
Semantic layers define metrics, dimensions, and business logic in a single, governed place — so that "revenue" means the same thing whether it's queried from a dashboard, a notebook, or a Slack bot. For AI workflows, this matters in both directions.
Semantic layers make AI more reliable. When an LLM generates a query, it's guessing at business logic unless you give it something better. A semantic layer provides exactly that — governed metric definitions, join paths, and query instructions that constrain the model to produce correct SQL rather than plausible SQL. Instead of the LLM deciding how to calculate revenue, it follows the definition you've already codified. This is the most practical lever for making natural-language-to-SQL workflows trustworthy.
The major platforms each approach this differently: dbt Semantic Layer (MetricFlow-based metrics as code), Snowflake Cortex Analyst (SQL generation grounded in a YAML-defined semantic model), and Databricks Unity Catalog with AI Functions (catalog metadata + in-SQL AI operations). The richer your semantic metadata, the better the AI output across all of them.
AI can also help build the semantic layer. Drafting metric definitions from existing schemas is repetitive, pattern-clear work that agents handle well. They can also detect inconsistencies between semantic layer definitions and legacy SQL. What they can't do is decide what a metric should mean — that's still a human decision.
3.5 Platform-Native AI Features
Every major data platform now includes built-in AI features. These are worth knowing about, but they change fast — treat this as a snapshot and check the vendor docs for current state.
| Platform | Feature | Good For | Limitations |
|---|---|---|---|
| Snowflake | Cortex AI | SQL generation, data exploration, in-warehouse ML functions, document AI | Tied to Snowflake ecosystem; model selection limited to what Snowflake offers; works best with well-structured metadata |
| BigQuery | Gemini in BigQuery | SQL generation and explanation, query optimisation suggestions, data exploration | Quality depends heavily on dataset metadata completeness; best with well-documented tables |
| Databricks | Databricks Assistant | Notebook code generation (SQL, Python, Scala), data exploration, error explanation | Works within Databricks notebooks; less useful for dbt or file-based workflows outside the Databricks environment |
When to use platform-native vs. general-purpose tools: Platform-native tools already have your warehouse metadata — useful for quick in-console exploration and ad-hoc queries. General-purpose tools (Claude Code, Copilot, Cursor) see your entire project and handle broader tasks (orchestration, IaC, documentation, multi-file refactors). In practice, use both.
4. Security, Privacy, and Data Governance
Data engineers handle production data routinely — raw event streams, customer records, financial transactions, health data. This makes the privacy and security stakes higher than in most application development work. The "sanitise before pasting" guidance from the Core Principles applies, but the surface area is larger and the risks are more specific.
📖 Reference: For general AI tool data handling policies, privacy tiers, subscription vs. API security guarantees, and the base sanitisation checklist — see the AI Common Playbook, 4. Privacy and Data Handling.
Core Principle reminder: Protect Sensitive Information and Control Data Usage — Understand your boundaries. Follow your organization's policies on what data can be shared with GenAI tools. Ensure your inputs aren't collected for training. Core Principles for Working with AI
Protecting Production Data from Accidental Destruction
AI-generated code can destroy production data. This isn't hypothetical — databases have been wiped because an AI-generated script ran a DROP TABLE, a DELETE without a WHERE clause, or a TRUNCATE against the wrong environment. But it's not just SQL. Maintenance scripts, Terraform runs, and migration tools have caused the same damage — any code path that touches infrastructure or data is a vector.
The core problem: AI generates write operations as confidently as read operations. It doesn't distinguish between a development sandbox and a production warehouse. A prompt like "clean up the staging table" can produce a statement that works perfectly — against whichever database the session happens to be connected to.
Models can also be surprisingly creative in how they reach resources. An agent told to accomplish a task may find non-obvious paths — invoking a CLI tool, calling an API, generating a script that shells out to another tool, or chaining permissions in ways you didn't anticipate. This means relying solely on command-by-command approval is not enough. If the credentials and network access exist, a sufficiently capable model will find a way to use them. The real safeguard is ensuring the access doesn't exist in the first place.
Practices that prevent this:
Enforce read-only access by default. Every service account, MCP connection, and agent credential should be read-only unless there's a specific, documented reason for write access. If a workflow genuinely requires writes, use a dedicated account scoped to exactly the tables and operations needed — never broad WRITE or ALL PRIVILEGES grants.
Route production writes through automated pipelines, not standing access. Ideally, no human or agent has standing write access to production. Changes reach production through CI/CD pipelines, orchestrators (Airflow, Dagster, Prefect), or managed deployment tools — with code review, approval gates, and audit trails. If someone needs to run a one-off fix against production, that should be a controlled, time-limited exception with a documented reason — not a credential that's always available. Where direct access is unavoidable, use service account impersonation rather than granting persistent credentials — it provides the same access with a time-bound scope and a clear audit trail tied to the requesting identity.
Separate credentials per environment. Production, staging, and development should use different credentials with different permission levels. Never store production connection strings in the same configuration files or environment variables that agents or local tools can access. If an agent can reach it, assume it eventually will.
Review every write operation as a destructive action. INSERT, UPDATE, DELETE, DROP, TRUNCATE, MERGE, ALTER — treat AI-generated versions of these with the same scrutiny you'd give someone rm -rf on a shared server. Read the WHERE clauses. Check the target schema. Verify the environment.
Never run AI-generated DDL or DML against production without a dry run. Use EXPLAIN plans, SELECT equivalents of DELETE statements (to see what would be affected), and transaction wrappers that you can roll back. On platforms that support it, use BEGIN ... ROLLBACK to validate before committing.
Use platform safeguards. Most data platforms offer protections — enable them. Snowflake's FAIL_SAFE and TIME_TRAVEL, BigQuery's table snapshots, Postgres pg_dump before migrations, dbt's --defer and --empty flags for safe testing. These aren't substitutes for careful review, but they're your safety net when review fails.
Tag and isolate agent-initiated operations. If agents execute queries, use session tags or query comments so that any destructive operation can be traced back to its origin. When something goes wrong at 2am, you need to know whether it was a scheduled job, a human, or an agent.
Data Classification Awareness
Not all tables are equal. dim_date is a reference table with no privacy implications. raw.stripe.charges contains financial transaction data. users.pii is self-explanatory. Your awareness of data classification should determine how much context you provide to AI tools and through which channels.
If your project has a data classification scheme (most regulated industries do), reference it. If it doesn't, apply common sense: anything that could identify a person, reveal financial information, or expose health data requires the strictest handling. When in doubt, use schema-only context — no data rows.
What to Share and What to Withhold
✗ Production data rows containing PII, financial data, or health records
✗ Real customer IDs, account numbers, or transaction details
✗ Connection strings with embedded credentials
✗ Schema names that reveal confidential business structures (consider aliases)
✗ Query results from production — even aggregated data can be sensitive
✗ Sample data exports from production tables, even "just a few rows"
✓ Schema definitions (DDL) without data — these are almost always safe
✓ Synthetic/fake data rows that match the schema structure
✓ Anonymised samples where actual data patterns are needed for debugging
✓ Column names and types referenced generically when the schema itself is confidential
✓ dbt model SQL and YAML — code, not data
✓ Error messages and stack traces (check for embedded data values first)
Practical tip: When you need AI to help debug a data issue and sample data is necessary, create a synthetic dataset that reproduces the problem pattern. If a join is producing unexpected duplicates, create a minimal set of fake rows that demonstrates the duplication — don't paste the actual production rows where the issue occurs.
Client and Regulatory Constraints
GDPR, HIPAA, SOX, PCI-DSS, and industry-specific regulations may restrict what data context you can share with AI tools — even with enterprise-tier data handling guarantees. Some clients require that all data processing stays within specific geographic regions (EU, US). Some prohibit the use of AI tools on their data entirely. Some allow it with specific approved tools and configurations.
Verify the constraints at the start of every engagement. Check the client's AI usage policies and data processing agreements. Confirm which tools are approved, that model training on client data is off, and that built-in platform AI features (Snowflake Cortex, Databricks Assistant, BigQuery Gemini) are explicitly covered. Don't assume that because you used a tool on the last project, it's approved for this one.
5. Common Pitfalls in AI-Generated Data Engineering Work
AI-generated data engineering code has a specific failure mode that makes it more dangerous than buggy application code: it runs without errors and produces results that look reasonable. A wrong API response throws an exception. A wrong SQL query returns a number — just not the right number. You find out when a stakeholder asks why the dashboard doesn't match the spreadsheet, or worse, when a business decision gets made on incorrect data.
Core Principle reminder: Review, Test, and Verify All Outputs — Critically examine every line of AI-generated code. Test thoroughly for functionality, security vulnerabilities, logic errors, and edge cases. Never commit code you don't understand. Core Principles for Working with AI
SQL That Runs but Produces Wrong Results
The most common and costly category. AI gets join types wrong (especially composite keys and date-range joins), silently introduces grain errors through unresolved many-to-many relationships, misses or misapplies filter conditions (wrong status values, wrong date column), and mishandles NULLs in WHERE clauses. The output looks plausible — the numbers are just wrong. Always validate against known-good results before promoting.
Transformations That Work on Sample Data but Fail at Scale
Queries that run fine on 10,000 development rows fail or degrade on 500 million production rows. Common causes: missing partition pruning, joins that explode from one-to-many to many-to-many at real cardinality, and window functions that spill to disk. Always check query profiles at representative data volumes.
Pipeline Code That Doesn't Handle Real-World Conditions
AI-generated pipelines assume clean, first-run scenarios. They lack idempotency (reruns produce duplicates), miss late-arriving data (simple MAX(loaded_at) patterns skip records that arrive out of order), and ignore schema evolution (no defensive handling for source schema drift).
Data Quality Rules That Don't Catch Real Issues
AI-generated tests default to structural checks (not_null, unique, accepted_values) that confirm table shape but miss business logic errors. Watch for overly broad accepted values that include data quality issues as valid, missing behavioural tests that verify calculations and business rules, and absent source freshness checks.
Streaming-Specific Pitfalls
AI-generated Flink SQL commonly misconfigures window boundaries and watermarks, neglects state cleanup and TTL (causing unbounded state growth in production), and produces jobs that lack end-to-end exactly-once semantics because sink connectors aren't configured for transactional writes.
Over-Reliance on AI for Performance Optimisation
AI's optimisation advice is often generically correct but specifically wrong for your platform, data distribution, or cluster configuration. Always benchmark before and after with representative data volumes — check the query profile, not just the result.
The Review Checklist
Before approving any AI-generated data engineering code, check:
✓ Grain: does the model produce the expected number of rows per entity per time period?
✓ Joins: is every join type correct? Are join conditions complete (no missing composite key columns)?
✓ Filters: do WHERE clauses match the actual data values in the source? Is NULL handling correct?
✓ Aggregations: are all non-aggregated columns in the GROUP BY? Is the aggregation level correct?
✓ Incremental logic: is the model idempotent? Does it handle late-arriving data?
✓ Tests: do the tests verify business behaviour, not just table structure?
✓ Performance: does the query perform acceptably at production data volumes?
✓ Naming: do column names, model names, and file locations follow project conventions?
✓ Validation: has the output been compared against known-good results?
6. Measuring AI Impact in Data Engineering
You'll be asked whether AI is making a difference. That requires measurable signals, appropriate caveats, and a baseline established before AI adoption is fully embedded.
📖 Reference: Universal metrics (ROI, Developer Experience) — see AI Common Playbook, 6. Measuring AI Impact — Universal Metrics.
Observable Metrics
- Pipeline development velocity — cycle time from task start to production deployment, averaged across sprints
- Data incident rate — data quality incidents per sprint or release (schema violations, failed checks, consumer-reported issues)
- Defect escape rate — proportion of issues caught in CI/staging vs. discovered in production
- Test coverage — percentage of models with meaningful tests (business rule validation, not just
not_null/unique) - Documentation coverage — percentage of models and columns with descriptions, extractable from dbt manifest/catalog
- Migration throughput — objects migrated per sprint (SQL translated, reconciliation tests written, validations completed)
- Incident response time — time from alert to root cause identification, and from root cause to resolution
These metrics show direction, not precision — isolating AI's contribution from other factors is genuinely hard. Present signals as evidence of a trend, not proof of a specific improvement factor. Start tracking before AI is fully embedded: the baseline you build now is the only one that supports an honest before-and-after comparison.
Part II — Quick Start
Introduction
You have a real data engineering task in front of you and one question: "What do I actually do, step by step?"
This Quick Start answers that. It gives you a concrete, repeatable workflow for agentic data engineering — one that works well as a starting point and that you can adapt as you build confidence.
The toolset for this Quick Start:
- Claude Code — Anthropic's terminal-native agentic coding assistant
- dbt + Snowflake — transformation framework and warehouse
- GitHub — version control and collaboration
- Airflow — referenced for orchestration context, but the Quick Start focuses on dbt model development as the most common entry point
It's a concrete path you can follow today. Once you're comfortable with it, swap components, add spec-driven frameworks (see Development Playbook, Part I, 2. Spec-Driven Development), or integrate other tools.
The setup steps are specific to this stack, but the underlying strategy applies broadly. Whether you're working with Databricks notebooks, a BigQuery + Dataform project, a Spark-based pipeline, or a different orchestrator entirely — the same principles hold: load structured context, plan before implementing, review AI output as you would a junior engineer's PR. The commands and config syntax change; the workflow doesn't.
Using a different CLI agent? The workflow in this Quick Start translates directly to other terminal-based agents — Codex CLI, Gemini CLI, Aider, and others follow the same pattern: load context, plan, implement, review. The tool-specific differences are in configuration syntax, not in how you work. CLAUDE.md is Claude Code's project instructions file; the equivalent in other tools is AGENTS.md (an emerging cross-tool convention), .aider.conf, or similar. Whatever the filename, the content is the same: project structure, conventions, MCP connections, and what the agent should and shouldn't do.
📖 Reference: For Claude Code installation, authentication, and general configuration — see the AI Common Playbook, Claude Code. This Quick Start covers only DE-specific setup.
Prerequisites: Claude Pro, Max, or Team subscription (or Anthropic Console account), Git and GitHub set up, Node.js 18+ installed, dbt CLI configured for your project, access to a Snowflake development environment.
Step 1 — Prepare Your Data Project for AI
1.1 Install and Configure Claude Code
Follow the installation guide in the AI Common Playbook for base setup. Once installed, launch Claude Code from your dbt project root:
cd ~/projects/your-dbt-project
claude
Grant DE-specific permissions. Claude Code asks for approval before running commands. For a dbt project, pre-approve the commands you run frequently by adding them to your project's .claude/settings.json:
{
"permissions": {
"allow": [
"Bash(dbt *)",
"Bash(sqlfluff *)",
"Edit(models/**)",
"Edit(tests/**)",
"Edit(macros/**)"
]
}
}
Keep approvals scoped to your project directories. Don't grant blanket write access — the guardrails exist for good reasons, especially when working with client code.
1.2 Create Your CLAUDE.md
The CLAUDE.md file tells the agent about your project. It's loaded into every session automatically, so keep it concise and specific.
Bootstrap it with Claude Code:
> /init
Review the generated file, then refine it. A good CLAUDE.md covers: stack and versions, project structure and naming conventions, SQL and dbt coding standards, key commands, and what not to do. See Appendix B for ready-to-use templates for dbt and Airflow projects.
What not to include: Don't explain dbt concepts, SQL fundamentals, or Snowflake features. The agent knows those. State your preferences and constraints — the things that make your project different from defaults.
1.3 Provide Schema Context
The harder part of DE-specific AI setup: giving the agent enough schema and metric knowledge to generate correct SQL without pasting thousands of lines of DDL every session. The right approach depends on what your project has in place.
Preferred: semantic layer + MCP. If your project uses a semantic layer (dbt Semantic Layer with MetricFlow, Snowflake Cortex Analyst, or Databricks Unity Catalog), this is the best source of context for the agent. Metric definitions, grain, dimensions, and relationships are already codified — the agent uses governed definitions rather than guessing how revenue is calculated. Connect the agent to it via an MCP server so it can query metadata on demand.
Even without a full semantic layer, a database MCP server connected to your development warehouse lets the agent query information_schema, inspect table structures, check row counts, and explore relationships live. No manual maintenance, always current. See Section 3.2 for setup and security configuration — use a read-only service account against dev/staging only, never production.
The dbt manifest and catalog JSON files are another strong option: they give the agent model definitions, column descriptions, lineage, and test coverage without any live database connection.
Fallback: manual schema summary files. If none of the above are set up yet, create a docs/schema-context.md that describes your key tables:
## Source: raw.stripe
### raw.stripe.charges
- Grain: one row per charge attempt
- Primary key: id (varchar)
- Key columns: amount (integer, cents), currency (varchar),
status (varchar: succeeded/failed/pending),
created (integer, Unix epoch UTC)
- Volume: ~500K rows/day
- Refresh: real-time via Fivetran, typical latency <5 minutes
### raw.stripe.refunds
- Grain: one row per refund
- Primary key: id (varchar)
- Foreign key: charge_id → charges.id (one refund per charge)
- Key columns: amount (integer, cents), reason (varchar, nullable),
created (integer, Unix epoch UTC)
Cover the tables the agent will work with most — it doesn't need to be exhaustive. Treat this as a stepping stone: manual files are faster to set up, but MCP or manifest-based approaches are more reliable and worth investing in as the engagement matures.
1.4 Set Up Linters and Formatters
Configure SQLFluff for your Snowflake dialect and set up a Claude Code PostToolUse hook to auto-format SQL files after every Write or Edit. Do the same with ruff or black for Python files (macros, scripts). This catches formatting issues before they reach review.
Step 2 — Plan Before You Build
You have a task: build a new mart model, add a data source, create a set of quality checks, or refactor an existing model. Before writing any SQL, plan.
2.1 Enter Plan Mode
Press Shift+Tab twice to enter plan mode. In this mode, Claude reads files and reasons about your project but won't edit anything or run commands.
2.2 Write Your Prompt
Give Claude the task with data-specific context. Be explicit about grain, business rules, and constraints.
Example prompt:
Build a mart model `fct_customer_lifetime_value` that calculates
lifetime value per customer.
Source tables (see schema-context.md):
- stg_stripe__charges (one row per charge, amount in dollars)
- stg_stripe__refunds (one row per refund, amount in dollars)
- dim_customers (one row per customer)
Business rules:
- LTV = total charges minus total refunds per customer
- Include only charges with status = 'succeeded'
- Include only customers with at least one succeeded charge
- Use charge `completed_at` for time-based calculations
- Calculate: total_revenue, total_refunds, net_ltv, first_order_at,
last_order_at, total_order_count
Grain: one row per customer_id
Materialization: table (not incremental — the table is small enough)
Follow the patterns in `fct_orders.sql`. Clarify anything before
proposing a plan.
2.3 Clarify Through Dialogue
Claude will likely ask about edge cases. Answer them:
Claude: "How should I handle customers whose only charges have
been fully refunded? Include them with net_ltv = 0,
or exclude them?"
You: "Include them with net_ltv = 0. They're still customers."
Claude: "Should refunds be capped at the charge amount, or can a
refund exceed the original charge (e.g., goodwill credits)?"
You: "Refunds can exceed charge amount — don't cap."
2.4 Review and Save the Plan
Once the plan looks right, ask Claude to save it:
> Save this plan as docs/plans/fct-customer-ltv-plan.md
Data-specific plan review checklist:
- Does the plan address the correct grain? (one row per customer, not per order)
- Are joins specified with the right cardinality? (charges to refunds is one-to-many)
- Is the materialisation appropriate for the data volume?
- Are there assumptions about data freshness or completeness that need validating?
- Are edge cases explicitly addressed (NULLs, zero values, negative amounts)?
Step 3 — Implement with the Agent
3.1 Switch to Implementation Mode
Press Escape or Shift+Tab to return to normal mode. If context is getting heavy from the planning dialogue, start a fresh session:
claude
> Read docs/plans/fct-customer-ltv-plan.md and implement it step by step.
After each step, run dbt compile and dbt test. Commit each completed
step with a descriptive message.
3.2 Work in Small Steps
Don't let the agent build everything at once. A typical progression:
Step 1: Create the model SQL. Claude writes fct_customer_lifetime_value.sql following the plan and the pattern from fct_orders.sql. Run dbt compile to verify syntax. Review the SQL: check join logic, filter conditions, aggregation grain.
Step 2: Run and validate. Run dbt run -s fct_customer_lifetime_value. Check the results against known-good data — compare total revenue to a manual calculation, spot-check a few customers, verify row count matches your expectation for the customer base size.
Step 3: Add tests. Claude generates the YAML schema file with tests: uniqueness on customer_id, not_null on key columns, a singular test verifying that net_ltv = total_revenue - total_refunds for a sample of customers. Run dbt test -s fct_customer_lifetime_value.
Step 4: Add documentation. Claude generates column descriptions in the YAML file. Review them — replace generic descriptions with business-meaningful ones.
Step 5: Commit. Atomic commit with a descriptive message.
3.3 Validate Against Known-Good Data
After tests pass, compare against an independent source of truth. For this example: does total revenue match the Stripe dashboard? Do 3–5 individual customer LTVs look correct? Does the row count match expectations? If there's no baseline, a few manual spot checks are enough to catch the most common AI SQL errors.
Step 4 — Review and Finalize
4.1 Run In-Session Review
Before creating a PR: ask the agent to review for correctness, style, and potential issues. For complex models, add a performance pass: "Check for full table scans, missing cluster key utilisation, and inefficient join patterns for Snowflake." Once you settle on a review prompt that works, save it as .claude/commands/code-review.md for the team.
4.2 Run the Full Test Suite
dbt test
Run the full suite, not just the new model's tests — regressions in shared macros or staging models are a common AI pitfall.
4.3 Create the Pull Request and Address Feedback
> Create a pull request for the customer LTV model. Reference the
plan in docs/plans/. Include a summary of business rules implemented
and how the output was validated.
A good DE PR description: what the model does (business purpose), how it was validated, and any assumptions or limitations. Address review comments within Claude Code: > Read the review comments on PR #42 and address each one. Run dbt test after each fix.
Step 5 — Evolve Your Workflow
Custom slash commands for recurring patterns. Create commands in .claude/commands/ for tasks you repeat frequently — a staging model generator, a documentation auditor, a migration validator. Each command encodes your project's exact conventions so the output is consistent from the first run. For example, a /new-staging-model command would generate SQL, YAML, and tests in one pass, invoked with /new-staging-model raw.stripe.disputes.
Learn from mistakes. After a session where the agent struggled, open a fresh session and ask it to analyse what went wrong and update CLAUDE.md or create a skill to prevent the same mistake.
Scale to the team. Check CLAUDE.md and custom commands into the repo so everyone benefits. When reviewing AI-assisted PRs, check whether the AI was given adequate context — a model that uses the wrong source table usually means the CLAUDE.md is missing a key convention, not that the author was careless.
Appendix
What follows is supplementary reference material — not part of the playbook itself. Use these templates, prompt libraries, and tool directories as a lookup when you need them.
A. Model Selection for DE Tasks
📖 Reference: For model selection strategy, the quality-vs-price scatter plot approach, and provider comparison — see the AI Common Playbook, 2. Cost & Efficiency.
| Task Type | Optimise For | Recommended Tier | Examples |
|---|---|---|---|
| Complex migration analysis, cross-system lineage, architecture review, multi-step debugging | Quality and depth — cost is secondary | Frontier | Claude Opus, GPT-5.x, Gemini 3.x Pro |
| dbt model generation, SQL refactoring, Airflow DAG scaffolding, quality rule generation, code review | Quality-to-cost ratio | Mid-tier | Claude Sonnet, GPT-5.x-mini, Gemini 3.x Flash |
| YAML generation, documentation, column descriptions, simple SQL formatting, boilerplate | Speed and price — quality threshold, not maximum | Lightweight | Claude Haiku, GPT-5.x-nano, Gemini 3.x Flash Lite |
For most daily data engineering work, a mid-tier model is the right choice. Reserve frontier models for tasks where depth and reasoning matter — complex planning, migration analysis, or debugging subtle data issues across multiple models.
B. Context File Templates (CLAUDE.md / AGENTS.md)
Copy, paste, and adapt to your project. Delete what doesn't apply, add what's missing. Aim for under 100 lines — every line costs tokens on every session. The filename depends on your agent: CLAUDE.md for Claude Code, AGENTS.md for the cross-tool convention, .aider.conf for Aider.
📖 Reference: For detailed reference table and loading behavior — see the AI Common Playbook, Project Instruction Files.
dbt + Snowflake project
# [Project Name]
## Stack
- dbt Core [version], Snowflake adapter
- Snowflake [edition], database: [DATABASE_NAME]
- Airflow [version] for orchestration
- SQLFluff for linting (Snowflake dialect)
- Python 3.11 for macros and scripts
## Project Structure
- models/staging/[source]/ — one subdirectory per source, prefix `stg_`
- models/intermediate/ — shared transformations, prefix `int_`
- models/marts/ — business-facing models, prefix `fct_` or `dim_`
- tests/singular/ — custom business rule tests
- macros/ — project-specific Jinja macros
- docs/schema-context.md — key table descriptions and grain
- docs/business-rules.md — metric definitions and edge cases
## SQL Conventions
- CTEs over subqueries, always
- One model per file, model name matches filename
- snake_case for all identifiers
- Explicit column lists (no SELECT *)
- COALESCE for NULL handling
- Timestamps: TIMESTAMP_NTZ in UTC, suffix `_at`
- Booleans: prefix `is_` or `has_`
- Money: DECIMAL(12,2), convert from cents at staging layer
- Dates: DATE type, suffix `_date`
- IDs: VARCHAR, suffix `_id`
## dbt Conventions
- Incremental models: merge strategy, define `unique_key`
- All models: schema YAML with unique + not_null on PK at minimum
- Staging: rename, cast, filter test data — no business logic
- Sources: defined in `_sources.yml` per source system
- Freshness: defined for all sources, warn 12h / error 24h default
- Tags: `daily`, `hourly`, or `manual` on every model
## Testing
- Generic tests on every model PK (unique, not_null)
- Relationship tests for all foreign keys
- Singular tests for business rules (see tests/singular/)
- Run: `dbt test -s <model>` after each model change
- Run: `dbt test` before opening a PR
## Key Commands
- `dbt compile` — validate SQL
- `dbt run -s <model>+` — run model and downstream
- `dbt test -s <model>` — test specific model
- `dbt run && dbt test` — full build + test
- `sqlfluff lint models/` — lint all
- `sqlfluff fix <file>` — auto-fix one file
## What NOT to Do
- No SELECT * in any model
- No hardcoded dates or IDs
- No new packages without team discussion
- No direct references to raw tables from marts — always go through staging
Airflow project
# [Project Name] Airflow
## Stack
- Apache Airflow [version] on [infrastructure: MWAA / Composer / self-hosted]
- Python 3.11
- Snowflake as primary data warehouse
- dbt triggered via BashOperator / dbt Cloud API operator
## Project Structure
- dags/ — one file per DAG, named `dag_[domain]_[purpose].py`
- dags/common/ — shared utilities, operators, sensors
- plugins/ — custom operators and hooks
- tests/ — DAG validation and unit tests
- config/ — environment-specific configuration
## DAG Conventions
- DAG ID matches filename (without `dag_` prefix)
- Schedule: use cron expressions, never `@daily` shorthand
- Timezone: all schedules in UTC
- Default args: retries=2, retry_delay=300s, email_on_failure=True
- Owner: team name, not individual
- Tags: domain tag + frequency tag on every DAG
- Catchup: False unless explicitly needed for backfill
## Operator Preferences
- SnowflakeOperator: use the project wrapper in `common/operators.py`
- BashOperator for dbt CLI commands
- PythonOperator only when Snowflake/Bash won't work
- No SubDagOperator — use TaskGroup instead
## Connections and Variables
- Connection names: `[system]_[environment]` (e.g., `snowflake_prod`)
- Variables: `[domain]_[name]` (e.g., `payments_api_batch_size`)
- Never hardcode credentials, connection strings, or environment-specific values
## Error Handling
- All DAGs must have `on_failure_callback` pointing to Slack alerting
- Critical DAGs: add SLA miss detection
- Failed tasks: alert to #data-alerts channel
## Testing
- DAG validation: `python -m pytest tests/test_dag_integrity.py`
- Task unit tests for custom operators
- Run all tests before opening a PR
## Key Commands
- `airflow dags test [dag_id] [date]` — test a DAG locally
- `python -m pytest tests/` — run all tests
- `ruff check dags/` — lint all DAG files
C. Prompt Library
Tested prompts for common data engineering tasks. Each includes the prompt, required context, and what to watch for in the output.
Generate a staging model
Create a staging model `stg_[source]__[table]` from `raw.[schema].[table]`.
Follow the pattern in [existing staging model path].
Requirements: [rename rules, type casts, filters, metadata columns].
Generate the YAML schema file with appropriate tests.
Context: Source DDL or schema context, existing staging model as pattern.
Watch for: Missing type casts, wrong timestamp handling, missing _loaded_at.
Translate SQL dialect
Translate this [source dialect] SQL to [target dialect].
Preserve exact semantics — same results, same edge case behaviour.
Flag any constructs where the translation might change behaviour
(NULL handling, implicit casting, date arithmetic).
[paste SQL]
Context: Source SQL, any known dialect-specific gotchas. Watch for: Semantic differences. Run source-to-target reconciliation on every translated query.
Generate business rule tests
Generate dbt tests for `[model_name]` covering these business rules:
1. [Rule 1 — be specific about expected behaviour]
2. [Rule 2]
3. [Rule 3]
Use generic tests where possible. Write singular tests for rules
that can't be expressed generically. Follow the patterns in tests/singular/.
Context: Model SQL, business rules documentation, existing test examples. Watch for: Tests that verify structure but not behaviour. Intentionally break the model and verify tests catch it.
Test coverage gap analysis
Analyse this dbt project for test coverage gaps.
For each model, report: number of generic tests, number of singular tests,
whether relationships tests exist for foreign keys, whether source
freshness is defined. Flag models with no business rule tests.
Prioritise gaps by model criticality (number of downstream dependents).
Context: Full dbt project access. Watch for: The output is a gap list, not code. You decide priorities.
Project onboarding summary
I'm new to this project. Analyse the dbt project and provide:
1. Overall structure: sources, staging, intermediate, mart counts
2. Key mart models (most downstream dependents)
3. Testing strategy: what's tested, what patterns are used, major gaps
4. Documentation status: what's documented, what isn't
5. Key data flows: how does raw data become the main mart models?
Context: Full project access, CLAUDE.md, schema context. Watch for: Missing institutional knowledge. Supplement with human onboarding.
Generate reconciliation queries
Generate source-to-target reconciliation queries for the migrated
[object_name]. Compare:
- Row counts
- Key aggregate values: [list specific metrics]
- Sample row-level comparison on [key columns]
Source: [source system/dialect], Target: [target system/dialect]
Context: Source and target schemas, key columns, business-critical metrics. Watch for: Missing edge case validations. Row counts can match while data values differ.
Pipeline failure diagnosis
This dbt model failed during the nightly run. Here's the error:
[paste error]
And here's the model SQL:
[paste SQL or reference model path]
The model ran successfully yesterday. What likely changed?
Check for schema changes, data type issues, NULL handling,
or upstream data quality problems.
Context: Error output, model SQL, any recent changes to upstream models. Watch for: AI may suggest the most common cause, not the actual cause. Verify its hypothesis before applying the fix.
D. Notebook / Jupyter Development
This appendix supplements the main Quick Start for data engineers working primarily in Databricks or Jupyter rather than a file-based dbt project. The thinking process is the same — plan, implement in small steps, validate — but the mechanics differ. This section covers where notebook environments diverge; the main Quick Start has the full workflow.
Environment setup
- Databricks — Databricks Assistant is built in and has Unity Catalog access. It already knows your table schemas. For structured work that needs review and version control, use Databricks-Git integration (Repos) to sync notebooks to a repo, then apply the main Quick Start workflow for the production version.
- Jupyter (local or JupyterHub) — install
jupyter-aifor in-notebook AI, or use a side-window chat (Claude.ai, ChatGPT) with manual copy-paste. VS Code with the Jupyter extension gives you Copilot and agent mode natively in.ipynbfiles. - Google Colab — Gemini is integrated by default.
The workflow
Load context first (paste schema summary or CLAUDE.md content into the session), ask for a plan before writing code, implement and validate cell by cell. Each cell is a checkpoint — if output looks wrong, stop and investigate before moving on.
Promote to production
Notebook logic should graduate to version-controlled files before going to production. Once validated, use AI to translate, e.g.:
Convert the transformation logic in this notebook into a dbt model
following our project conventions (see CLAUDE.md). Name it
`[model_name]`, include tests and documentation, handle incremental loading.
E. Data Anonymisation Libraries
When you need realistic but safe data for debugging, testing, or AI context, these libraries cover most scenarios.
Fake data generation
- Faker (Python) — generate realistic names, emails, addresses, dates, IDs, and custom providers. The standard choice for seeding test datasets and replacing PII in sample data.
- Mimesis — faster than Faker for bulk generation; useful for large synthetic datasets with locale-specific data.
PII detection and anonymisation
- Microsoft Presidio — detect and anonymise PII in text and structured data. Supports custom recognisers, configurable anonymisation operators (redact, replace, hash, encrypt), and integrates with pandas DataFrames. Practical for automating sanitisation of query results or log outputs before sharing with AI tools.
Synthetic tabular data
- SDV — Synthetic Data Vault — generate synthetic tabular data that preserves statistical properties, distributions, and referential integrity across related tables. Useful when you need sample data that behaves like production (realistic distributions, correlated columns, foreign key relationships) rather than just structurally valid rows.
In-warehouse anonymisation
Most data platforms also provide native functions for hashing, masking, and tokenisation — Snowflake's SHA2/MD5, BigQuery's AEAD encryption functions, and Databricks Unity Catalog's column masking policies. These are preferable when the data never needs to leave the warehouse.
F. Skills and MCP Servers from Official Providers
New official integrations are shipping monthly — most major data platforms now have or are building MCP servers, and the skills ecosystem is growing faster than any static list can track. Treat this as a starting point: check the provider's official docs and GitHub for the current state before setting anything up. The VoltAgent/awesome-agent-skills and mcpservers.org directories are useful for finding integrations not listed here.
See Section 3.2 for guidance on trusting MCP servers and skills before installing.
📖 Reference: For the agent skills standard and what well-structured skills provide — see AI Common Playbook, Agent Skills Standard.
Skills
| Name | Provider | Capabilities |
|---|---|---|
| data-engineering-skills | Altimate AI | Skills for dbt and Snowflake: model creation, query optimisation, test generation. Benchmarked on real-world dbt tasks (ADE-bench). |
| astronomer/agents — skills | Astronomer | Airflow-specific skills covering DAG patterns, testing strategies, and debugging workflows. Works with Claude Code, Cursor, Copilot, and 25+ other agents. |
| airbytehq/skills | Airbyte | Skills giving agents access to 21+ third-party APIs (Salesforce, HubSpot, Stripe, Jira, Slack, and more) for connector and ingestion workflows. |
MCP Servers
| Name | Provider | Capabilities |
|---|---|---|
| dbt MCP Server | dbt Labs | Exposes dbt project metadata: lineage, model definitions, semantic layer metrics, and test results. Platform-agnostic — works with any supported warehouse. |
| astro-airflow-mcp | Astronomer | MCP server for Airflow: DAG status, task logs, trigger and pause/unpause, variable and connection management. Works with open-source Airflow 2.x and 3.x. |
| Databricks managed MCP | Databricks | Managed servers for Genie Space (NL-to-SQL over curated tables), Databricks SQL (query execution), Vector Search, and Unity Catalog Functions. Custom self-hosted servers also supported. |
| DataHub MCP | Acryl Data | Data catalog access: dataset discovery, lineage, usage patterns, and NL-to-SQL generation grounded in governed metadata. |
| Airbyte knowledge MCP | Airbyte | Connector knowledge base for AI agents. Separate PyAirbyte MCP manages connectors and sync operations across hundreds of sources without custom integration code. |
| BigQuery MCP | Google Cloud | Fully managed remote MCP server for BigQuery: SQL execution, dataset analysis, and reporting. Auto-enabled with BigQuery. |
This playbook evolves alongside the tools and practices it covers. Contributions, corrections, and suggestions are welcome.
Core Principles for Working with AI: Read more