AI Common Playbook

Universal guidance for working with AI effectively — for all roles: backend, frontend, DevOps, UX, QA, and beyond.

Core Principles: All recommendations in this playbook align with Xebia's official Core Principles for Working with AI. Refer to that document for the foundational rules that govern every AI interaction at Xebia.

1. AI Models and Providers

Not every model excels at every job. Using a frontier reasoning model for a simple text classification task wastes both time and money — matching capability to task is an engineering decision worth getting right.

Core Principle reminder: Choose the Right Model for the Task — Don't chase every new release, but periodically research which models are current best-in-class for your use cases. Core Principles for Working with AI

How to Compare and Evaluate AI Models

The AI model market moves fast — new releases weekly, bold benchmark claims daily. Artificial Analysis offers independent, vendor-neutral evaluations across all major providers. Use it before making any model decision.

The site measures every model across three dimensions that map directly to engineering trade-offs:

Intelligence — A composite score built from multiple benchmarks. Don't stop at the headline number. Switch to the Coding Index (code generation, completion, debugging) and Agentic Index (multi-step tool use, self-correction, planning) tabs — these are far more relevant to development work than the general ranking. A model ranked #4 overall might rank #1 for coding.
Speed — Output tokens per second across API providers. Critical for agentic workflows and interactive assistants: latency compounds across tool calls. Check this whenever you're building anything that chains multiple model calls.
Price — USD per 1M tokens (input and output shown separately). At CI/CD or high-volume scale, a 5x price difference is an architectural decision, not a detail.

How to get the most out of the site:

Don't treat the default leaderboard ranking as a universal truth. The most useful view is the Quality vs. Price scatter plot — it shows you the "efficient frontier," the cluster of models that deliver near-top performance at a fraction of the frontier cost. For most development tasks, the right model lives there, not at the very top.

On individual model pages, check the provider comparison table: the same model (e.g., Gemini 3.1 Pro) is often served by multiple API providers at meaningfully different prices and latency profiles. When you integrate a model into a product, this table tells you which provider to route through.

Model Selection Strategy

Match capability to the task — overkill wastes money, underpowered models waste time and quality.

Start with the right index, not the headline score. For coding tasks, use the Coding Index. For multi-step agents, use the Agentic Index. A model that ranks #5 overall may outperform everything else on the task you actually care about.

Match model tier to task complexity:

Task type	What to optimize for	Where to look
Architecture analysis, security review, complex refactoring, long-context reasoning	Quality and depth — cost is secondary	Frontier models at the top of the Coding/Agentic Index
Feature implementation, code review, debugging, test generation	Quality-to-cost ratio	Models near the efficient frontier on the scatter plot
CI/CD automation, inline autocomplete, classification, short-context generation	Speed and price — quality threshold, not maximum	Lightweight models; check Speed and Price tabs

Practical team workflow:

Identify your top use cases and open the relevant Artificial Analysis index (Coding, Agentic, or general Intelligence)
Shortlist two or three models clustering near the efficient frontier on the scatter plot
Run a quick eval on representative samples from your actual codebase or tasks
Record your findings in a shared decision log — include the date, since rankings shift
Revisit quarterly or when a major new model drops

A model that is 5x cheaper and only marginally less capable for your specific task is almost always the correct call at volume. Use data to make that argument, not intuition.

AI Assistants — Use Cases Beyond Coding

Research and analysis — Quickly survey a technology, compare frameworks, or understand a new domain. Feed documentation and ask the model to summarize trade-offs.

Explaining difficult concepts — Use AI to break down complex topics (distributed consensus, event sourcing, Kubernetes networking) for different audiences. Specify the expertise level of your target audience for best results.

Consultation and design review — Share your architecture sketch, API design, or database schema and ask the model to critique it. Provide your constraints and standards as context.

Client interview preparation — Brush up on knowledge for the role, summarize a client's domain, generate likely technical questions.

Pair learning — Use AI as a study partner to learn difficult topics, explain concepts from multiple angles.

Official Prompt Engineering Guides

Every major AI provider publishes prompt engineering documentation — these are the key references:

Provider	Guide	Focus
Anthropic	Prompting best practices	Comprehensive reference for Claude models: clarity, examples, XML structuring, thinking, agentic systems
Anthropic	Effective context engineering for AI agents	Context management strategies for building reliable agents — the evolution beyond prompt engineering
OpenAI	Prompt engineering guide	Strategies for GPT and reasoning models, including agentic and coding-specific patterns
Google	Prompt design strategies (Gemini)	Zero-shot, few-shot, system instructions, and multimodal prompting for Gemini models
Microsoft / Azure	Prompt engineering techniques (Azure OpenAI)	Practical prompt construction for enterprise Azure deployments

Why this matters: The Core Principles document (principles 1-9) covers what to do. These official guides from the model providers cover exactly how to do it with their specific models. Techniques that work well with one model family may need adaptation for another.

Agent Skills Standard

Without a shared format for agent instructions, every team writes its own ad-hoc prompts for things like code review or test generation — the results vary between people, stay locked to one tool, and are hard to share across projects. The Agent Skills open standard solves this by defining a portable format for packaging reusable agent capabilities as skills.

A skill is a folder with a required SKILL.md file (the instructions the agent follows) plus optional scripts, references, and assets. For example, a "code-review" skill might contain review criteria, a checklist of common issues to flag, and a script that collects the diff — any agent that supports the standard can pick it up and run it the same way. Install skills per-user or per-project to share tested workflows across teams.

Standard website: agentskills.io
Anthropic implementation: Claude Agent Skills | Skill authoring best practices
OpenAI implementation: Codex Skills (built into the Codex app and CLI)

2. Cost & Efficiency

Every AI interaction has a price tag. At team and client scale, knowing where the money goes matters more than individual awareness.

Subscription vs. API Pricing

Subscriptions (ChatGPT Plus, Claude Pro, Copilot Pro, Cursor Pro) charge a flat monthly fee with usage caps. Best for individual daily productivity. When evaluating plans, check: message/request limits and overage costs, which models are included, privacy guarantees (free tiers may train on your data), IP indemnity, and team management features. Client work requires business-tier plans for compliance — check current pricing on each provider's page.

API / pay-per-use charges per token (≈ 4 characters), with output tokens typically 3-5x more expensive than input. Cost per call: (input_tokens / 1M × input_price) + (output_tokens / 1M × output_price). Watch for hidden multipliers: oversized context (some providers double rates above 200K tokens), reasoning/thinking tokens (billed as output even when invisible), agentic loops (each cycle is a full API call), and verbose output.

Cost Reduction Levers

Lever	Impact
Model tiering	Highest impact. Frontier models for complex tasks only; mid-tier for daily work; lightweight for autocomplete, classification, simple generation
Prompt caching	~90% discount on repeated context blocks (system prompts, project instructions, RAG knowledge bases)
Batch API	50% discount for non-urgent workloads (nightly analysis, bulk generation) with 24-hour turnaround
Effort controls	Dial down reasoning depth per request — a frontier model at reduced effort can match mid-tier cost
Context window management	In agentic sessions, cost per message grows linearly with conversation length. Clear context between tasks, compact it proactively, delegate verbose operations to subagents
Prompt engineering	Shorter prompts, structured output formats, referencing instruction files instead of re-pasting context

For teams: estimate token consumption before integrating any API-based workflow, set spending alerts on every provider dashboard, log AI costs per project for ROI tracking, and revisit model choices quarterly as pricing trends consistently downward.

Agentic Session Economics

A single agentic coding session is not a single API call — it is a loop of dozens to hundreds of calls, each carrying the full (and growing) conversation context. This cost multiplier matters most when you adopt agentic workflows at scale.

Subscription vs. API — the first decision:

For individual developers, a subscription plan is almost always more cost-effective than API billing. Agentic sessions consume far more tokens than single-call pricing suggests — every turn re-sends the growing conversation context, and active developers burn through tokens quickly. Most providers offer tiered subscription plans that absorb this cost at a flat rate, making them significantly cheaper than pay-per-token billing for daily agentic work.

Rule of thumb: If you use an agentic coding tool daily, compare subscription tiers against your estimated API spend — subscriptions typically offer a 5-25x cost advantage for active users. Reserve API billing for CI/CD automation, programmatic integrations, and team-wide deployments where you need fine-grained cost control and spending limits.

Important caveat: Subscription plans typically have usage caps that reset periodically. During sustained heavy use (multi-agent teams, large refactors), you may hit rate limits and experience throttling. API billing has no such caps — only the rate limits you configure. For teams running automated pipelines or needing guaranteed throughput, API billing may still be the right choice despite higher per-token cost.

Check current pricing and plan details on official pages: Anthropic Claude · OpenAI · GitHub Copilot · Cursor · Google Gemini

How costs compound in agentic sessions:

Not every AI interaction costs the same. The gap between a quick question and a full agentic workflow is significant — knowing where your work falls on this spectrum drives your AI spend.

Single API call — one request, one response. The baseline: you send a prompt, get an answer, done. Most pricing pages show this baseline, which barely reflects what agentic work actually costs.
Interactive chat session — a multi-turn conversation where you and the model go back and forth (think ChatGPT or Claude web UI). Each turn re-sends the growing history, so costs accelerate as the conversation lengthens.
Agentic coding session — the model works semi-autonomously: reading files, writing code, running tests, interpreting results, and looping back. Dozens to hundreds of API calls happen under the hood, each carrying the full conversation context plus tool outputs.
Multi-agent team — several agents run in parallel (e.g., one plans, one implements, one reviews), each maintaining its own context window. Token consumption multiplies across agents, not just across turns.

Interaction type	Typical token consumption	Cost multiplier vs. single call
Single API call	~2K input + ~1K output	1x
Interactive chat session (10-20 turns)	~50K-200K input + ~10K-50K output	25-75x
Agentic coding session (file reads, edits, test runs)	~200K-500K input + ~50K-150K output	100-400x
Multi-agent team (3-5 parallel agents)	~1M-3M input + ~200K-500K output	500-1500x

Multi-agent teams consume roughly 7x more tokens than standard sessions because each agent maintains its own context window. At team scale, these costs become an architectural decision, not a rounding error. Use the pricing comparison tools listed below to estimate actual costs for your model and provider.

Model tiering within a session:

You do not need to use the same model for every step. The most cost-effective pattern is to tier models by task complexity within a single workflow:

Task phase	Recommended model tier*	Why	Relative cost
Architecture analysis, complex planning, multi-step reasoning	Frontier (e.g., Claude Opus, OpenAI GPT-5.x, Gemini 3.x Pro)	Highest intelligence — worth the premium for decisions that shape the entire feature	$$$
Feature implementation, code generation, debugging, refactoring	Mid-tier (e.g., Claude Sonnet, OpenAI GPT-5.x-mini, Gemini 3.x Flash)	Best quality-to-cost ratio for the bulk of coding work	$$
Subagent tasks, linting, simple lookups, classification	Lightweight (e.g., Claude Haiku, OpenAI GPT-5.x-nano, Gemini 3.x Flash Lite)	Fast and cheap — ideal for delegated operations where speed matters more than depth	$

* Specific model names change frequently — check your provider's current lineup for the latest options.

Most agentic coding tools let you switch models mid-session or set defaults per task type. In API-based workflows, route different pipeline stages to different models programmatically.

Rule of thumb: Plan with a frontier model, implement with mid-tier, delegate simple tasks to lightweight. This pattern can reduce session costs by 40-60% compared to using the frontier model for everything.

Context window is your primary cost driver:

In agentic sessions, every message you send includes the entire conversation history — all previous turns, file contents, tool outputs, and thinking tokens. This means cost per message grows linearly with conversation length. A message near the end of a long session can cost 10-25x more than the same message at the start, simply because of accumulated context.

Practical strategies to keep context lean:

/clear between tasks — When switching to unrelated work, clear the conversation. Stale context wastes tokens on every subsequent message. Use /rename before clearing so you can /resume later
Use /compact proactively — When context grows large but you are still mid-task, /compact summarizes the conversation while preserving key details. Add focus hints: /compact Focus on code samples and API usage
Auto-compaction is your safety net, not your strategy — Claude Code automatically compacts when approaching context limits, but by that point you have already paid for the bloated context across many messages. Compact earlier, not later
Delegate verbose operations to subagents — Running a full test suite, fetching documentation, or processing log files can dump thousands of lines into your context. Delegate these to subagents — only a summary returns to your main conversation
Write specific prompts — "Improve this codebase" triggers broad scanning across many files. "Add input validation to the login function in auth.ts" lets Claude work efficiently with minimal file reads
Move detailed instructions from CLAUDE.md to skills — Your CLAUDE.md is loaded into every session. If it contains detailed instructions for specific workflows (PR reviews, database migrations), those tokens are present even when you are doing unrelated work. Skills load on-demand only when invoked — aim to keep CLAUDE.md under ~500 lines

Prompt caching — the hidden cost saver:

Every agentic turn is a new API call that re-sends the system prompt, tool definitions, CLAUDE.md content, and conversation history. Without caching, you pay full price for this repeated content on every single turn. With caching, repeated content costs 90% less after the first send.

Most providers now offer prompt caching — Anthropic, OpenAI, and Google all support it in some form. Some tools enable it automatically; for API-based workflows, check your provider's documentation.

How it works under the hood: Caching is prefix-based — the provider matches what you send against what you sent before, starting from the first token. The moment content diverges, everything after that point is billed as new input. This means stable content (system instructions, tool definitions, project files) should sit at the beginning of your prompt, and variable content (latest user message, tool results) at the end. If you build prompts via the API, keep this order — reordering or editing early sections invalidates the cache for everything that follows.
What gets cached: System prompts, project instruction files, tool definitions, and the stable prefix of your conversation history. A cache hit typically costs a fraction of the standard input price (exact discount varies by provider)
Cache lifetime: Usually a few minutes. As long as you send messages within this window, cached content stays warm. Some providers offer extended caching at higher write cost for batch workloads
Cache-friendly habits: Keep project instruction files stable between sessions — even small edits early in the prompt invalidate everything downstream. Use the same system prompts across team members to maximize shared cache hits
Where caching matters most: In a 50-turn agentic session, the system prompt and tool definitions are sent 50 times. Without caching, you pay full price each time. With caching, the savings compound significantly across long sessions and teams

Further reading: Prompt caching (Anthropic docs) | Prompt caching (OpenAI docs) | Context caching (Google docs) | Research paper: Don't Break the Cache: Prompt Caching for Long-Horizon Agentic Tasks

Budgeting agentic workflows for teams:

Agentic coding is not free-form — it requires the same financial discipline as any infrastructure cost. Treat AI token spend as a line item in your project budget.

Budget lever	How to use it
Workspace/project spend limits	Most providers offer spending caps or budget alerts via their console or dashboard. Set these up before enabling agentic workflows — they prevent runaway costs from long-running agents or automation
Rate limits per team size	Scale tokens-per-minute allocation per user as the team grows. Fewer users are active concurrently in larger teams, so per-user allocation can decrease
Monitor usage actively	Use your tool's built-in usage tracking (session cost reports, token counters, usage dashboards). Review these regularly — a single runaway session can consume a week's budget
Reasoning/thinking token budget	Providers bill reasoning tokens (extended thinking, chain-of-thought) as output — the most expensive token type. For simpler tasks, reduce reasoning depth or effort level. Check your provider's docs for how to control this
Multi-agent governance	When running multiple agents in parallel, each maintains its own context window. Keep teams small, tasks focused, and clean up idle agents promptly

Recommended team workflow for cost control:

Start with a small pilot group (3-5 developers) to establish baseline usage patterns
Set conservative workspace spend limits based on pilot data
Log AI costs per project — track alongside other infrastructure costs
Review monthly and adjust: model mix, context management habits, and automation scope
Revisit quarterly as model pricing consistently trends downward

Pricing Comparison Resources

AI pricing changes constantly — do not rely on memorized numbers. Bookmark these:

Artificial Analysis — Quality-vs-price scatter plots across all major models: artificialanalysis.ai

Price Per Token — Daily-updated pricing for 300+ models with cost calculators: pricepertoken.com

Helicone LLM Cost — Side-by-side cost comparison for 300+ models: helicone.ai/llm-cost

Vellum LLM Cost Comparison — Visual cost comparison by input/output size: vellum.ai/llm-cost-comparison

Official pricing pages: Anthropic Claude · OpenAI · Google Gemini · GitHub Copilot · Cursor

3. AI Coding Assistants — Tools and Setup

An inline autocomplete tool and a full agentic system solve fundamentally different problems — picking the right category matters as much as picking the right model.

Categories of AI Coding Tools

Category	How It Works	Examples	Best For
IDE-integrated autocomplete	Real-time suggestions as you type; tab to accept	GitHub Copilot (inline), Cursor (Tab), JetBrains AI	Fast completions, boilerplate, repetitive patterns
IDE chat / edit mode	Conversational coding within your editor; targeted edits	GitHub Copilot Chat, Cursor Ask/Chat, JetBrains AI Chat	Explanations, targeted refactors, Q&A about code
IDE agent mode	Autonomous multi-file editing with tool use and self-correction	GitHub Copilot Agent Mode, Cursor Agent, Windsurf, JetBrains Junie	Feature implementation, multi-file refactors, complex tasks
Terminal-based agents	CLI tools that understand your repo and execute commands	Claude Code, OpenAI Codex CLI, Google Gemini CLI, Aider	Full project work, git workflows, DevOps tasks, CI/CD
Cloud-based agents	Asynchronous agents that work in cloud sandboxes on assigned tasks	GitHub Copilot Coding Agent, OpenAI Codex (cloud), Cursor Cloud Agents	Parallelized work, issue-based delegation, background tasks

Tool Profiles

	GitHub Copilot	Cursor	Claude Code	OpenAI Codex
Form factor	IDE plugin + CLI + cloud	AI-native IDE (VS Code fork)	Terminal agent + IDE extensions	Cloud + CLI + IDE + desktop app
IDE support	VS Code¹, Visual Studio, JetBrains, Eclipse, Xcode, Neovim	Own IDE; JetBrains (Agent Client Protocol)	Terminal; VS Code¹, JetBrains	VS Code¹; desktop app
Inline completions	✓ + Next Edit Suggestions	✓ + predictive edits	—	—
Agent mode	✓	✓	✓	✓
Cloud / async agents	Coding agent (Issues → PR)	Cloud Agents	GitHub / GitLab CI	Cloud sandboxes; parallel tasks
Plan mode	✓	✓	✓	via `$create-plan` skill
Project config	`.github/copilot-instructions.md`	`.cursor/rules`	`CLAUDE.md`	`AGENTS.md`
MCP	✓	✓	✓	✓
Models	GPT-5.x, Claude, Gemini	Claude, GPT, Gemini, Cursor	Claude	GPT-5.x-Codex
Plans	Free · Pro · Pro+ · Business · Enterprise	Hobby (free) · Pro · Pro+ · Ultra · Business	Pro / Max subscription or API	Plus / Pro / Business / Edu / Enterprise

¹ Including VS Code forks (Cursor, Windsurf)

GitHub Copilot

GitHub Copilot covers the full development lifecycle:

Inline suggestions and Next Edit Suggestions (NES) — predicts not just the next line, but the next logical edit location
Plan mode — analyzes your request, generates a step-by-step implementation plan, and lets you review and refine it before writing any code. Available across VS Code, JetBrains, Eclipse, and Xcode
Agent mode — autonomously edits multiple files, runs terminal commands, and self-corrects errors
Coding agent — asynchronous cloud-based agent that works on GitHub Issues, creates branches, and opens pull requests for review
Code review — AI-generated review suggestions on pull requests
Copilot CLI — terminal-native coding agent with plan mode, autopilot mode, built-in specialized agents, and MCP support
Customization system — a layered configuration model (repository instructions, file-scoped rules, reusable prompts, custom agents, skills) for tailoring Copilot to your project

Documentation: docs.github.com/en/copilot | Copilot customization

IDE feature parity varies. VS Code has the most complete Copilot feature set. JetBrains IDEs support the main instructions file, prompt files, custom agents, and skills, but do not support multiple .instructions.md files with applyTo patterns. Always verify which customization features your IDE supports in the official docs.

Cursor

Cursor is an AI-native IDE (a fork of VS Code) designed around AI workflows:

Codebase indexing — indexes your entire project so AI can reason across all files, not just the one you have open
Agent mode — full agentic workflow that plans, edits, runs commands, and iterates
.cursorrules / .cursor/rules — project-level configuration files to enforce coding standards, architectural patterns, and team conventions
Cloud Agents — autonomous agents running in their own cloud VMs with full development environments, computer use, and the ability to test changes and produce artifacts (screenshots, recordings, logs). Launchable from web, mobile, Slack, or GitHub
Automations — always-on cloud agents that run on a schedule or in response to events
JetBrains integration — available in IntelliJ, PyCharm, WebStorm and other JetBrains IDEs via Agent Client Protocol (ACP)
Composer model — Cursor's own frontier model optimized for code editing
Privacy Mode — configures zero data retention with model providers when enabled

Documentation: docs.cursor.com

Claude Code

Claude Code is Anthropic's agentic coding tool that runs in your terminal. It understands your codebase, edits files, executes commands, and manages git workflows through natural language:

Terminal-native — lives in your terminal, no IDE dependency required
CLAUDE.md — project-level documentation files that teach Claude about your codebase, conventions, and architecture. The better these are maintained, the better the results
Plan mode — a read-only research and planning phase (Shift+Tab x 2) where Claude analyzes the codebase, asks clarifying questions, and produces a structured plan before writing any code
Agentic workflow — reads files, runs tests, commits code, creates PRs
Multi-turn conversations — maintains context across a session for iterative work
Sub-agents — can spawn focused sub-tasks in their own context
Agent teams — coordinate multiple Claude Code instances working in parallel. One session acts as team lead, assigning tasks and synthesizing results, while teammates work independently in their own context windows and communicate directly with each other. Best suited for parallel code review, cross-layer changes, or debugging with competing hypotheses
Multi-IDE integration — beyond the terminal, Claude Code offers official extensions for VS Code (and forks like Cursor), JetBrains IDEs, with interactive diff viewing, selection context sharing, and diagnostic integration
GitHub integration — tag @claude on GitHub issues and PRs for automated code review and implementation
MCP support — extend with Model Context Protocol servers for additional tool access

Documentation: code.claude.com/docs

OpenAI Codex

OpenAI Codex is a software engineering agent platform available as a cloud agent, CLI tool, and IDE extension:

Cloud agent — runs tasks in isolated cloud sandboxes, works on multiple tasks in parallel, and proposes pull requests
Codex CLI — lightweight local agent for terminal-based coding workflows
Desktop app — dedicated interface for managing multiple agents in parallel and collaborating on long-running tasks
Skills — extensible task bundles that combine instructions, scripts, and resources (e.g., $skill-creator, $create-plan)
AGENTS.md support — respects project-level agent instructions

Documentation: developers.openai.com/codex | GPT-5 prompting guide (OpenAI)

Other Tools

Beyond the tools profiled here, you will find IDE-integrated assistants, terminal/CLI-based agents, cloud provider-native tools (AWS, Google, Azure), and open-source multi-model agents. When evaluating any tool not listed here, the Use Approved Tools Only principle applies — check data handling policies, licensing, and get organizational sign-off first.

Project Instruction Files

Every major coding assistant uses a project-level instruction file — a document the agent loads automatically at the start of each session. These files tell the agent what the project is about, how it is structured, and what conventions to follow.

Tool	Main instruction file	Secondary / scoped files
Claude Code	`CLAUDE.md`	Sub-directory `CLAUDE.md` files, `.claude/commands/`, Agent Skills
GitHub Copilot	`.github/copilot-instructions.md`	`.github/instructions/*.instructions.md` (with `applyTo` patterns), prompt files, custom agents, skills
Cursor	`.cursorrules` or `.cursor/rules/`	Multiple rule files in `.cursor/rules/` with glob patterns
OpenAI Codex	`AGENTS.md`	Sub-directory `AGENTS.md` files, Skills

Auto-loaded entry files must stay concise. Because these are always in context, keep them short and focused: tech stack, key commands, critical conventions, and references to detailed docs. Put granular rules in secondary files that load only when relevant. Always verify how your specific assistant handles instruction loading — for example, Copilot's .instructions.md files require an applyTo pattern to activate, and not all features work the same across IDEs.

Model Context Protocol (MCP)

Coding assistants become more useful when they reach beyond your editor into issue trackers, databases, documentation wikis, and monitoring dashboards. The Model Context Protocol (MCP) is the open standard (created by Anthropic in 2024, now governed by the Linux Foundation) that makes this possible without bespoke integrations. You write one MCP server for Jira, and it works with every MCP-compatible coding assistant — no per-tool, per-agent glue code.

How it works: Your coding assistant runs an MCP client; each external integration (a database, Jira, a browser, Sentry) runs as an MCP server exposing tools over the protocol. The assistant discovers available tools at runtime and calls them as needed during coding workflows — querying a database schema while generating a migration, pulling a Jira ticket description into context for implementation, or checking Sentry logs while debugging.

Common MCP servers for development teams:

Server	What It Provides
Filesystem	Read, write, and search files on disk
PostgreSQL / MySQL	Query databases, inspect schemas
GitHub	Create issues, read PRs, search repositories
Jira / Confluence	Read and create tickets; search documentation
Playwright / Browser	Control a browser for testing and scraping
Sentry	Access error logs and stack traces
Context7	Connect to up-to-date third-party documentation

Directory of servers: github.com/modelcontextprotocol/servers

Setting up MCP servers:

Claude Code: Configure in .mcp.json at project root. Use claude mcp add <n> -- <command>.
Codex CLI: Configure in ~/.codex/config.toml. Use codex mcp add <n> -- <command>.
Cursor: Add via Settings → MCP or in .cursor/mcp.json.

MCP Security Risks

When you connect MCP servers to your coding assistant, you're giving an AI agent — one that processes untrusted input like code comments, issue descriptions, and web content — direct access to your databases, repositories, and internal tools, running with your permissions. Every major security research team that has examined MCP in coding workflows — Palo Alto Unit 42, Invariant Labs, JFrog, Checkmarx — has found serious, exploitable vulnerabilities in real-world scenarios.

Risk	What Happens	Real Example	Mitigation
Tool poisoning	Malicious server embeds hidden instructions in tool descriptions that the coding assistant follows silently	Invariant Labs: a "random fact" server exfiltrated an entire WhatsApp history through a legitimate server connected to the same assistant session	Audit tool schema definitions in source code before installing — not just the README
Overprivileged tokens	Compromised server with broad token scopes leaks access to all connected services	GitHub MCP server with a broad Personal Access Token (PAT) allowed a prompt-injected agent to exfiltrate private repos into a public PR	Least privilege: narrowly scoped, short-lived, dedicated credentials per server — never personal all-access tokens
Rug pulls	Server silently changes tool definitions between sessions, adding capabilities you never approved	Documented by eSentire: tool approved on Day 1 can reroute API keys by Day 7	Pin server versions; review changelogs before updating; prefer known publishers
Command injection	AI-generated input passed unsanitized to shell commands in server implementations	CVE-2025-6514 (CVSS 9.6) in `mcp-remote` — 437k+ downloads, affected Cloudflare and Hugging Face integrations	Never concatenate input into shell commands; use parameterized APIs; sandbox servers in containers
Lethal Trifecta via MCP	Your coding assistant gains private data access (database server) + external communication (HTTP tool); a prompt injection in a code comment or issue description chains them for exfiltration	Multiple demonstrations combining database servers with outbound HTTP tools	Don't combine sensitive-data and outbound-communication servers carelessly; require human approval for external actions
Supply chain	No centralized review for community servers; name impersonation (e.g., `mcp-github` mimicking `github-mcp`); tampered one-click installers	Academic researchers documented unofficial installers distributing tampered packages to coding tool users	Install only from trusted sources; verify publisher identity; review source code

MCP security checklist:

✓ Audit tool descriptions and source code before installing any MCP server
✓ Apply least-privilege: narrow token scopes, restricted filesystem/network access
✓ Pin server versions; review changes before updating
✓ Sandbox MCP servers in containers when possible
✓ Never combine sensitive-data and external-communication servers carelessly
✓ Run SAST (Static Application Security Testing) / SCA (Software Composition Analysis) on custom MCP servers; treat them as production code
✓ Require human approval for high-risk tool calls (data export, sending messages, file deletion)
✓ Keep servers and dependencies updated — check for CVEs regularly

Further reading: MCP Security Best Practices (official spec) | Red Hat: MCP Security Risks | Palo Alto Unit 42: MCP Attack Vectors

Documentation: modelcontextprotocol.io | SDKs: github.com/modelcontextprotocol

4. Privacy and Data Handling

Every AI tool you use has a data handling policy. Understanding these policies and configuring tools correctly is non-negotiable — client trust depends on it.

Core Principle reminder: Protect Sensitive Information and Control Data Usage — Understand your boundaries. Follow your organization's policies on what data can be shared with GenAI tools. Ensure providers don't use your inputs for training. Core Principles for Working with AI

Data Handling by Tool

Official data privacy documentation: Always verify current policies directly — they change frequently. Anthropic Privacy Center | OpenAI Enterprise Privacy | GitHub Copilot Trust Center | Cursor Privacy

Tool	Data Used for Training?	Privacy/ZDR Option	Encryption	Compliance
Claude (API / Enterprise)	No	Zero data retention available	In transit + at rest	SOC 2 Type II
GitHub Copilot Business/Enterprise	No (code not used for training)	Business policy controls	In transit + at rest	SOC 2, FedRAMP
Cursor (Privacy Mode)	No (when Privacy Mode enabled)	Privacy Mode = zero data retention with providers	In transit + at rest	SOC 2 Type II
OpenAI Codex (ChatGPT Business/Enterprise)	No (Business/Enterprise)	Zero data retention	In transit + at rest	SOC 2
ChatGPT Free/Plus	May be used for training	Opt-out available	In transit	Limited

Rules for Xebia Teams

Always verify client approval — Before using any AI tool on a client project, confirm that the client or project stakeholders have explicitly approved GenAI usage
Use approved tiers only — Enterprise/Business tiers with proper data handling. Free tiers of consumer tools are prohibited for client work
Sanitize all inputs — Remove credentials, API keys, PII, internal hostnames, and proprietary business logic before submitting to any AI tool
Enable privacy modes — Turn on Privacy Mode in Cursor, use API-based access for Claude, use Business/Enterprise plans for Copilot and Codex
Understand residency — Know where your data is processed. Some clients require data to stay within specific regions (EU, US)
When in doubt, ask — Contact your project's security team or Xebia's compliance guidelines. Never assume it is okay

Data Sanitization Checklist

Before pasting anything into an AI tool, remove or replace:

✗ Passwords, API keys, tokens, secrets
✗ Database connection strings, SSH keys, certificates
✗ Names, emails, addresses, social security numbers
✗ Credit card numbers, financial account details
✗ Customer names, contract details, internal codenames
✗ Specific IPs, hostnames, internal URLs

✓ Use placeholders: "prod-db-01.company.com" → "[DB_HOST]"
✓ Use placeholders: "sk_live_a1b2c3..." → "[API_KEY]"
✓ Use placeholders: "Jan Kowalski" → "[CUSTOMER_NAME]"
✓ Use environment variable references: process.env.DB_PASSWORD

5. Security Risks

Core Principles reminder: Review the full security section in Core Principles for Working with AI for Xebia's authoritative guidance on: verifying outputs, protecting sensitive data, IP protection, access controls, organizational policies, and approved tools.

OWASP Top 10 for LLM Applications (2025)

The OWASP Top 10 for LLM Applications is the standard reference for securing AI-powered applications — worth reading before you build, deploy, or integrate LLM-based tools.

Full reference: genai.owasp.org/llm-top-10

The Lethal Trifecta

Security researcher Simon Willison identified the "Lethal Trifecta" — a security pattern worth internalizing if you use AI agents. When an agent combines these three capabilities, attackers can exploit prompt injection to steal your data:

┌─────────────────────┐     ┌────────────────────────┐     ┌──────────────────────┐
│  1. Access to       │     │  2. Exposure to        │     │  3. Ability to       │
│  PRIVATE DATA       │  +  │  UNTRUSTED CONTENT     │  +  │  COMMUNICATE         │
│  (emails, files,    │     │  (web pages, uploaded  │     │  EXTERNALLY          │
│   databases)        │     │   docs, user inputs)   │     │  (send data out)     │
└─────────────────────┘     └────────────────────────┘     └──────────────────────┘
                                    ↓
                    ⚠️ LETHAL TRIFECTA — Data theft possible

Why it matters: An attacker plants malicious instructions in a document or web page. When the agent processes that content, the instructions trick it into reading your private data and sending it to the attacker's server (e.g., via a URL in a Markdown link). This is not theoretical — researchers have demonstrated it against ChatGPT, Google Gemini, Microsoft Copilot, GitHub Copilot Chat, Slack, and others.

How to protect yourself:

Avoid combining all three — If possible, ensure your agent doesn't have all three capabilities simultaneously
Restrict external communication — Limit the agent's ability to make arbitrary network requests or embed URLs in outputs
Treat all external content as untrusted — Documents, web pages, emails, and user inputs can all contain hidden prompt injection
Use human-in-the-loop — Require manual approval before the agent takes high-risk actions (sending data, making API calls, modifying files outside the workspace)
Monitor and audit — Log all agent actions, especially data access and external communications

Full article: simonwillison.net/2025/Jun/16/the-lethal-trifecta/

6. Measuring AI Impact — Universal Metrics

AI tools change how you work, but "it feels faster" won't convince a client or a budget holder. You need numbers.

Focus these metrics on client value — where AI actually speeds up delivery, reduces defects, or lowers cost per feature.

Client Value & Cost Efficiency (ROI)

Most of our projects run Time and Materials (T&M), so every hour saved translates directly to client savings — more features in the same budget, or lower cost for the same scope.

To make the case, compare what AI tooling costs against what it saves:

ROI Formula:

Value of saved time = (Estimated saved hours per month) × (Hourly rate)\ ROI = Value of saved time / Monthly AI subscription cost

Example: A tool subscription (e.g., Claude Max or GitHub Copilot Enterprise) costs around \$100 per month. If the tool — by quickly generating boilerplate, tests, or assisting in debugging — saves the developer just 5 hours a month, and their rate is \$50/h, the generated savings amount to \$250. In this scenario, the tool pays for itself at 2.5x ROI — and 5 hours per month is a conservative estimate for active users. Track your own numbers to build a credible case.

Developer Experience (DevEx)

Hard metrics capture what's measurable. The rest comes from the engineers themselves — how they experience the work day-to-day (the SPACE framework is a good lens for this). * Cognitive Load: Does AI free engineers from repetitive work so they can spend more time on design and problem-solving? Ask them. * Satisfaction: Short surveys on how comfortable engineers feel working with AI. Higher satisfaction correlates with better retention and more stable delivery — worth measuring even if it is harder to put a number on.

Role-specific playbooks may include additional metrics tailored to their domain.

This playbook evolves alongside the tools and practices it covers. Contributions, corrections, and suggestions are welcome.

Core Principles for Working with AI: Read more