AI Common Playbook
Universal guidance for working with AI effectively — for all roles: backend, frontend, DevOps, UX, QA, and beyond.
Core Principles: All recommendations in this playbook align with Xebia's official Core Principles for Working with AI. Refer to that document for the foundational rules that govern every AI interaction at Xebia.
1. AI Models and Providers
Not every model excels at every job. Using a frontier reasoning model for a simple text classification task wastes both time and money — matching capability to task is an engineering decision worth getting right.
Core Principle reminder: Choose the Right Model for the Task — Don't chase every new release, but periodically research which models are current best-in-class for your use cases. Core Principles for Working with AI
How to Compare and Evaluate AI Models
The AI model market moves fast — new releases weekly, bold benchmark claims daily. Artificial Analysis offers independent, vendor-neutral evaluations across all major providers. Use it before making any model decision.
The site measures every model across three dimensions that map directly to engineering trade-offs:
- Intelligence — A composite score built from multiple benchmarks. Don't stop at the headline number. Switch to the Coding Index (code generation, completion, debugging) and Agentic Index (multi-step tool use, self-correction, planning) tabs — these are far more relevant to development work than the general ranking. A model ranked #4 overall might rank #1 for coding.
- Speed — Output tokens per second across API providers. Critical for agentic workflows and interactive assistants: latency compounds across tool calls. Check this whenever you're building anything that chains multiple model calls.
- Price — USD per 1M tokens (input and output shown separately). At CI/CD or high-volume scale, a 5x price difference is an architectural decision, not a detail.
How to get the most out of the site:
Don't treat the default leaderboard ranking as a universal truth. The most useful view is the Quality vs. Price scatter plot — it shows you the "efficient frontier," the cluster of models that deliver near-top performance at a fraction of the frontier cost. For most development tasks, the right model lives there, not at the very top.
On individual model pages, check the provider comparison table: the same model (e.g., Gemini 3.1 Pro) is often served by multiple API providers at meaningfully different prices and latency profiles. When you integrate a model into a product, this table tells you which provider to route through.
Model Selection Strategy
Match capability to the task — overkill wastes money, underpowered models waste time and quality.
Start with the right index, not the headline score. For coding tasks, use the Coding Index. For multi-step agents, use the Agentic Index. A model that ranks #5 overall may outperform everything else on the task you actually care about.
Match model tier to task complexity:
| Task type | What to optimize for | Where to look |
|---|---|---|
| Architecture analysis, security review, complex refactoring, long-context reasoning | Quality and depth — cost is secondary | Frontier models at the top of the Coding/Agentic Index |
| Feature implementation, code review, debugging, test generation | Quality-to-cost ratio | Models near the efficient frontier on the scatter plot |
| CI/CD automation, inline autocomplete, classification, short-context generation | Speed and price — quality threshold, not maximum | Lightweight models; check Speed and Price tabs |
Practical team workflow:
- Identify your top use cases and open the relevant Artificial Analysis index (Coding, Agentic, or general Intelligence)
- Shortlist two or three models clustering near the efficient frontier on the scatter plot
- Run a quick eval on representative samples from your actual codebase or tasks
- Record your findings in a shared decision log — include the date, since rankings shift
- Revisit quarterly or when a major new model drops
A model that is 5x cheaper and only marginally less capable for your specific task is almost always the correct call at volume. Use data to make that argument, not intuition.
AI Assistants — Use Cases Beyond Coding
Research and analysis — Quickly survey a technology, compare frameworks, or understand a new domain. Feed documentation and ask the model to summarize trade-offs.
Explaining difficult concepts — Use AI to break down complex topics (distributed consensus, event sourcing, Kubernetes networking) for different audiences. Specify the expertise level of your target audience for best results.
Consultation and design review — Share your architecture sketch, API design, or database schema and ask the model to critique it. Provide your constraints and standards as context.
Client interview preparation — Brush up on knowledge for the role, summarize a client's domain, generate likely technical questions.
Pair learning — Use AI as a study partner to learn difficult topics, explain concepts from multiple angles.
Official Prompt Engineering Guides
Every major AI provider publishes prompt engineering documentation — these are the key references:
| Provider | Guide | Focus |
|---|---|---|
| Anthropic | Prompting best practices | Comprehensive reference for Claude models: clarity, examples, XML structuring, thinking, agentic systems |
| Anthropic | Effective context engineering for AI agents | Context management strategies for building reliable agents — the evolution beyond prompt engineering |
| OpenAI | Prompt engineering guide | Strategies for GPT and reasoning models, including agentic and coding-specific patterns |
| Prompt design strategies (Gemini) | Zero-shot, few-shot, system instructions, and multimodal prompting for Gemini models | |
| Microsoft / Azure | Prompt engineering techniques (Azure OpenAI) | Practical prompt construction for enterprise Azure deployments |
Why this matters: The Core Principles document (principles 1-9) covers what to do. These official guides from the model providers cover exactly how to do it with their specific models. Techniques that work well with one model family may need adaptation for another.
Agent Skills Standard
Without a shared format for agent instructions, every team writes its own ad-hoc prompts for things like code review or test generation — the results vary between people, stay locked to one tool, and are hard to share across projects. The Agent Skills open standard solves this by defining a portable format for packaging reusable agent capabilities as skills.
A skill is a folder with a required SKILL.md file (the instructions the agent follows) plus optional scripts, references, and assets. For example, a "code-review" skill might contain review criteria, a checklist of common issues to flag, and a script that collects the diff — any agent that supports the standard can pick it up and run it the same way. Install skills per-user or per-project to share tested workflows across teams.
- Standard website: agentskills.io
- Anthropic implementation: Claude Agent Skills | Skill authoring best practices
- OpenAI implementation: Codex Skills (built into the Codex app and CLI)
2. Cost & Efficiency
Every AI interaction has a price tag. At team and client scale, knowing where the money goes matters more than individual awareness.
Subscription vs. API Pricing
Subscriptions (ChatGPT Plus, Claude Pro, Copilot Pro, Cursor Pro) charge a flat monthly fee with usage caps. Best for individual daily productivity. When evaluating plans, check: message/request limits and overage costs, which models are included, privacy guarantees (free tiers may train on your data), IP indemnity, and team management features. Client work requires business-tier plans for compliance — check current pricing on each provider's page.
API / pay-per-use charges per token (≈ 4 characters), with output tokens typically 3-5x more expensive than input. Cost per call: (input_tokens / 1M × input_price) + (output_tokens / 1M × output_price). Watch for hidden multipliers: oversized context (some providers double rates above 200K tokens), reasoning/thinking tokens (billed as output even when invisible), agentic loops (each cycle is a full API call), and verbose output.
Cost Reduction Levers
| Lever | Impact |
|---|---|
| Model tiering | Highest impact. Frontier models for complex tasks only; mid-tier for daily work; lightweight for autocomplete, classification, simple generation |
| Prompt caching | ~90% discount on repeated context blocks (system prompts, project instructions, RAG knowledge bases) |
| Batch API | 50% discount for non-urgent workloads (nightly analysis, bulk generation) with 24-hour turnaround |
| Effort controls | Dial down reasoning depth per request — a frontier model at reduced effort can match mid-tier cost |
| Context window management | In agentic sessions, cost per message grows linearly with conversation length. Clear context between tasks, compact it proactively, delegate verbose operations to subagents |
| Prompt engineering | Shorter prompts, structured output formats, referencing instruction files instead of re-pasting context |
For teams: estimate token consumption before integrating any API-based workflow, set spending alerts on every provider dashboard, log AI costs per project for ROI tracking, and revisit model choices quarterly as pricing trends consistently downward.
Agentic Session Economics
A single agentic coding session is not a single API call — it is a loop of dozens to hundreds of calls, each carrying the full (and growing) conversation context. This cost multiplier matters most when you adopt agentic workflows at scale.
Subscription vs. API — the first decision:
For individual developers, a subscription plan is almost always more cost-effective than API billing. Agentic sessions consume far more tokens than single-call pricing suggests — every turn re-sends the growing conversation context, and active developers burn through tokens quickly. Most providers offer tiered subscription plans that absorb this cost at a flat rate, making them significantly cheaper than pay-per-token billing for daily agentic work.
Rule of thumb: If you use an agentic coding tool daily, compare subscription tiers against your estimated API spend — subscriptions typically offer a 5-25x cost advantage for active users. Reserve API billing for CI/CD automation, programmatic integrations, and team-wide deployments where you need fine-grained cost control and spending limits.
Important caveat: Subscription plans typically have usage caps that reset periodically. During sustained heavy use (multi-agent teams, large refactors), you may hit rate limits and experience throttling. API billing has no such caps — only the rate limits you configure. For teams running automated pipelines or needing guaranteed throughput, API billing may still be the right choice despite higher per-token cost.
Check current pricing and plan details on official pages: Anthropic Claude · OpenAI · GitHub Copilot · Cursor · Google Gemini
How costs compound in agentic sessions:
Not every AI interaction costs the same. The gap between a quick question and a full agentic workflow is significant — knowing where your work falls on this spectrum drives your AI spend.
- Single API call — one request, one response. The baseline: you send a prompt, get an answer, done. Most pricing pages show this baseline, which barely reflects what agentic work actually costs.
- Interactive chat session — a multi-turn conversation where you and the model go back and forth (think ChatGPT or Claude web UI). Each turn re-sends the growing history, so costs accelerate as the conversation lengthens.
- Agentic coding session — the model works semi-autonomously: reading files, writing code, running tests, interpreting results, and looping back. Dozens to hundreds of API calls happen under the hood, each carrying the full conversation context plus tool outputs.
- Multi-agent team — several agents run in parallel (e.g., one plans, one implements, one reviews), each maintaining its own context window. Token consumption multiplies across agents, not just across turns.
| Interaction type | Typical token consumption | Cost multiplier vs. single call |
|---|---|---|
| Single API call | ~2K input + ~1K output | 1x |
| Interactive chat session (10-20 turns) | ~50K-200K input + ~10K-50K output | 25-75x |
| Agentic coding session (file reads, edits, test runs) | ~200K-500K input + ~50K-150K output | 100-400x |
| Multi-agent team (3-5 parallel agents) | ~1M-3M input + ~200K-500K output | 500-1500x |
Multi-agent teams consume roughly 7x more tokens than standard sessions because each agent maintains its own context window. At team scale, these costs become an architectural decision, not a rounding error. Use the pricing comparison tools listed below to estimate actual costs for your model and provider.
Model tiering within a session:
You do not need to use the same model for every step. The most cost-effective pattern is to tier models by task complexity within a single workflow:
| Task phase | Recommended model tier* | Why | Relative cost |
|---|---|---|---|
| Architecture analysis, complex planning, multi-step reasoning | Frontier (e.g., Claude Opus, OpenAI GPT-5.x, Gemini 3.x Pro) | Highest intelligence — worth the premium for decisions that shape the entire feature | $$$ |
| Feature implementation, code generation, debugging, refactoring | Mid-tier (e.g., Claude Sonnet, OpenAI GPT-5.x-mini, Gemini 3.x Flash) | Best quality-to-cost ratio for the bulk of coding work | $$ |
| Subagent tasks, linting, simple lookups, classification | Lightweight (e.g., Claude Haiku, OpenAI GPT-5.x-nano, Gemini 3.x Flash Lite) | Fast and cheap — ideal for delegated operations where speed matters more than depth | $ |
* Specific model names change frequently — check your provider's current lineup for the latest options.
Most agentic coding tools let you switch models mid-session or set defaults per task type. In API-based workflows, route different pipeline stages to different models programmatically.
Rule of thumb: Plan with a frontier model, implement with mid-tier, delegate simple tasks to lightweight. This pattern can reduce session costs by 40-60% compared to using the frontier model for everything.
Context window is your primary cost driver:
In agentic sessions, every message you send includes the entire conversation history — all previous turns, file contents, tool outputs, and thinking tokens. This means cost per message grows linearly with conversation length. A message near the end of a long session can cost 10-25x more than the same message at the start, simply because of accumulated context.
Practical strategies to keep context lean:
/clearbetween tasks — When switching to unrelated work, clear the conversation. Stale context wastes tokens on every subsequent message. Use/renamebefore clearing so you can/resumelater- Use
/compactproactively — When context grows large but you are still mid-task,/compactsummarizes the conversation while preserving key details. Add focus hints:/compact Focus on code samples and API usage - Auto-compaction is your safety net, not your strategy — Claude Code automatically compacts when approaching context limits, but by that point you have already paid for the bloated context across many messages. Compact earlier, not later
- Delegate verbose operations to subagents — Running a full test suite, fetching documentation, or processing log files can dump thousands of lines into your context. Delegate these to subagents — only a summary returns to your main conversation
- Write specific prompts — "Improve this codebase" triggers broad scanning across many files. "Add input validation to the login function in
auth.ts" lets Claude work efficiently with minimal file reads - Move detailed instructions from CLAUDE.md to skills — Your CLAUDE.md is loaded into every session. If it contains detailed instructions for specific workflows (PR reviews, database migrations), those tokens are present even when you are doing unrelated work. Skills load on-demand only when invoked — aim to keep CLAUDE.md under ~500 lines
Prompt caching — the hidden cost saver:
Every agentic turn is a new API call that re-sends the system prompt, tool definitions, CLAUDE.md content, and conversation history. Without caching, you pay full price for this repeated content on every single turn. With caching, repeated content costs 90% less after the first send.
Most providers now offer prompt caching — Anthropic, OpenAI, and Google all support it in some form. Some tools enable it automatically; for API-based workflows, check your provider's documentation.
- How it works under the hood: Caching is prefix-based — the provider matches what you send against what you sent before, starting from the first token. The moment content diverges, everything after that point is billed as new input. This means stable content (system instructions, tool definitions, project files) should sit at the beginning of your prompt, and variable content (latest user message, tool results) at the end. If you build prompts via the API, keep this order — reordering or editing early sections invalidates the cache for everything that follows.
- What gets cached: System prompts, project instruction files, tool definitions, and the stable prefix of your conversation history. A cache hit typically costs a fraction of the standard input price (exact discount varies by provider)
- Cache lifetime: Usually a few minutes. As long as you send messages within this window, cached content stays warm. Some providers offer extended caching at higher write cost for batch workloads
- Cache-friendly habits: Keep project instruction files stable between sessions — even small edits early in the prompt invalidate everything downstream. Use the same system prompts across team members to maximize shared cache hits
- Where caching matters most: In a 50-turn agentic session, the system prompt and tool definitions are sent 50 times. Without caching, you pay full price each time. With caching, the savings compound significantly across long sessions and teams
Further reading: Prompt caching (Anthropic docs) | Prompt caching (OpenAI docs) | Context caching (Google docs) | Research paper: Don't Break the Cache: Prompt Caching for Long-Horizon Agentic Tasks
Budgeting agentic workflows for teams:
Agentic coding is not free-form — it requires the same financial discipline as any infrastructure cost. Treat AI token spend as a line item in your project budget.
| Budget lever | How to use it |
|---|---|
| Workspace/project spend limits | Most providers offer spending caps or budget alerts via their console or dashboard. Set these up before enabling agentic workflows — they prevent runaway costs from long-running agents or automation |
| Rate limits per team size | Scale tokens-per-minute allocation per user as the team grows. Fewer users are active concurrently in larger teams, so per-user allocation can decrease |
| Monitor usage actively | Use your tool's built-in usage tracking (session cost reports, token counters, usage dashboards). Review these regularly — a single runaway session can consume a week's budget |
| Reasoning/thinking token budget | Providers bill reasoning tokens (extended thinking, chain-of-thought) as output — the most expensive token type. For simpler tasks, reduce reasoning depth or effort level. Check your provider's docs for how to control this |
| Multi-agent governance | When running multiple agents in parallel, each maintains its own context window. Keep teams small, tasks focused, and clean up idle agents promptly |
Recommended team workflow for cost control:
- Start with a small pilot group (3-5 developers) to establish baseline usage patterns
- Set conservative workspace spend limits based on pilot data
- Log AI costs per project — track alongside other infrastructure costs
- Review monthly and adjust: model mix, context management habits, and automation scope
- Revisit quarterly as model pricing consistently trends downward
Pricing Comparison Resources
AI pricing changes constantly — do not rely on memorized numbers. Bookmark these:
- Artificial Analysis — Quality-vs-price scatter plots across all major models: artificialanalysis.ai
- Price Per Token — Daily-updated pricing for 300+ models with cost calculators: pricepertoken.com
- Helicone LLM Cost — Side-by-side cost comparison for 300+ models: helicone.ai/llm-cost
- Vellum LLM Cost Comparison — Visual cost comparison by input/output size: vellum.ai/llm-cost-comparison
Official pricing pages: Anthropic Claude · OpenAI · Google Gemini · GitHub Copilot · Cursor
3. AI Coding Assistants — Tools and Setup
An inline autocomplete tool and a full agentic system solve fundamentally different problems — picking the right category matters as much as picking the right model.
Categories of AI Coding Tools
| Category | How It Works | Examples | Best For |
|---|---|---|---|
| IDE-integrated autocomplete | Real-time suggestions as you type; tab to accept | GitHub Copilot (inline), Cursor (Tab), JetBrains AI | Fast completions, boilerplate, repetitive patterns |
| IDE chat / edit mode | Conversational coding within your editor; targeted edits | GitHub Copilot Chat, Cursor Ask/Chat, JetBrains AI Chat | Explanations, targeted refactors, Q&A about code |
| IDE agent mode | Autonomous multi-file editing with tool use and self-correction | GitHub Copilot Agent Mode, Cursor Agent, Windsurf, JetBrains Junie | Feature implementation, multi-file refactors, complex tasks |
| Terminal-based agents | CLI tools that understand your repo and execute commands | Claude Code, OpenAI Codex CLI, Google Gemini CLI, Aider | Full project work, git workflows, DevOps tasks, CI/CD |
| Cloud-based agents | Asynchronous agents that work in cloud sandboxes on assigned tasks | GitHub Copilot Coding Agent, OpenAI Codex (cloud), Cursor Cloud Agents | Parallelized work, issue-based delegation, background tasks |
Tool Profiles
| GitHub Copilot | Cursor | Claude Code | OpenAI Codex | |
|---|---|---|---|---|
| Form factor | IDE plugin + CLI + cloud | AI-native IDE (VS Code fork) | Terminal agent + IDE extensions | Cloud + CLI + IDE + desktop app |
| IDE support | VS Code¹, Visual Studio, JetBrains, Eclipse, Xcode, Neovim | Own IDE; JetBrains (Agent Client Protocol) | Terminal; VS Code¹, JetBrains | VS Code¹; desktop app |
| Inline completions | ✓ + Next Edit Suggestions | ✓ + predictive edits | — | — |
| Agent mode | ✓ | ✓ | ✓ | ✓ |
| Cloud / async agents | Coding agent (Issues → PR) | Cloud Agents | GitHub / GitLab CI | Cloud sandboxes; parallel tasks |
| Plan mode | ✓ | ✓ | ✓ | via $create-plan skill |
| Project config | .github/copilot-instructions.md |
.cursor/rules |
CLAUDE.md |
AGENTS.md |
| MCP | ✓ | ✓ | ✓ | ✓ |
| Models | GPT-5.x, Claude, Gemini | Claude, GPT, Gemini, Cursor | Claude | GPT-5.x-Codex |
| Plans | Free · Pro · Pro+ · Business · Enterprise | Hobby (free) · Pro · Pro+ · Ultra · Business | Pro / Max subscription or API | Plus / Pro / Business / Edu / Enterprise |
¹ Including VS Code forks (Cursor, Windsurf)
GitHub Copilot
GitHub Copilot covers the full development lifecycle:
- Inline suggestions and Next Edit Suggestions (NES) — predicts not just the next line, but the next logical edit location
- Plan mode — analyzes your request, generates a step-by-step implementation plan, and lets you review and refine it before writing any code. Available across VS Code, JetBrains, Eclipse, and Xcode
- Agent mode — autonomously edits multiple files, runs terminal commands, and self-corrects errors
- Coding agent — asynchronous cloud-based agent that works on GitHub Issues, creates branches, and opens pull requests for review
- Code review — AI-generated review suggestions on pull requests
- Copilot CLI — terminal-native coding agent with plan mode, autopilot mode, built-in specialized agents, and MCP support
- Customization system — a layered configuration model (repository instructions, file-scoped rules, reusable prompts, custom agents, skills) for tailoring Copilot to your project
Documentation: docs.github.com/en/copilot | Copilot customization
IDE feature parity varies. VS Code has the most complete Copilot feature set. JetBrains IDEs support the main instructions file, prompt files, custom agents, and skills, but do not support multiple
.instructions.mdfiles withapplyTopatterns. Always verify which customization features your IDE supports in the official docs.
Cursor
Cursor is an AI-native IDE (a fork of VS Code) designed around AI workflows:
- Codebase indexing — indexes your entire project so AI can reason across all files, not just the one you have open
- Agent mode — full agentic workflow that plans, edits, runs commands, and iterates
.cursorrules/.cursor/rules— project-level configuration files to enforce coding standards, architectural patterns, and team conventions- Cloud Agents — autonomous agents running in their own cloud VMs with full development environments, computer use, and the ability to test changes and produce artifacts (screenshots, recordings, logs). Launchable from web, mobile, Slack, or GitHub
- Automations — always-on cloud agents that run on a schedule or in response to events
- JetBrains integration — available in IntelliJ, PyCharm, WebStorm and other JetBrains IDEs via Agent Client Protocol (ACP)
- Composer model — Cursor's own frontier model optimized for code editing
- Privacy Mode — configures zero data retention with model providers when enabled
Documentation: docs.cursor.com
Claude Code
Claude Code is Anthropic's agentic coding tool that runs in your terminal. It understands your codebase, edits files, executes commands, and manages git workflows through natural language:
- Terminal-native — lives in your terminal, no IDE dependency required
- CLAUDE.md — project-level documentation files that teach Claude about your codebase, conventions, and architecture. The better these are maintained, the better the results
- Plan mode — a read-only research and planning phase (
Shift+Tabx 2) where Claude analyzes the codebase, asks clarifying questions, and produces a structured plan before writing any code - Agentic workflow — reads files, runs tests, commits code, creates PRs
- Multi-turn conversations — maintains context across a session for iterative work
- Sub-agents — can spawn focused sub-tasks in their own context
- Agent teams — coordinate multiple Claude Code instances working in parallel. One session acts as team lead, assigning tasks and synthesizing results, while teammates work independently in their own context windows and communicate directly with each other. Best suited for parallel code review, cross-layer changes, or debugging with competing hypotheses
- Multi-IDE integration — beyond the terminal, Claude Code offers official extensions for VS Code (and forks like Cursor), JetBrains IDEs, with interactive diff viewing, selection context sharing, and diagnostic integration
- GitHub integration — tag
@claudeon GitHub issues and PRs for automated code review and implementation - MCP support — extend with Model Context Protocol servers for additional tool access
Documentation: code.claude.com/docs
OpenAI Codex
OpenAI Codex is a software engineering agent platform available as a cloud agent, CLI tool, and IDE extension:
- Cloud agent — runs tasks in isolated cloud sandboxes, works on multiple tasks in parallel, and proposes pull requests
- Codex CLI — lightweight local agent for terminal-based coding workflows
- Desktop app — dedicated interface for managing multiple agents in parallel and collaborating on long-running tasks
- Skills — extensible task bundles that combine instructions, scripts, and resources (e.g.,
$skill-creator,$create-plan) - AGENTS.md support — respects project-level agent instructions
Documentation: developers.openai.com/codex | GPT-5 prompting guide (OpenAI)
Other Tools
Beyond the tools profiled here, you will find IDE-integrated assistants, terminal/CLI-based agents, cloud provider-native tools (AWS, Google, Azure), and open-source multi-model agents. When evaluating any tool not listed here, the Use Approved Tools Only principle applies — check data handling policies, licensing, and get organizational sign-off first.
Project Instruction Files
Every major coding assistant uses a project-level instruction file — a document the agent loads automatically at the start of each session. These files tell the agent what the project is about, how it is structured, and what conventions to follow.
| Tool | Main instruction file | Secondary / scoped files |
|---|---|---|
| Claude Code | CLAUDE.md |
Sub-directory CLAUDE.md files, .claude/commands/, Agent Skills |
| GitHub Copilot | .github/copilot-instructions.md |
.github/instructions/*.instructions.md (with applyTo patterns), prompt files, custom agents, skills |
| Cursor | .cursorrules or .cursor/rules/ |
Multiple rule files in .cursor/rules/ with glob patterns |
| OpenAI Codex | AGENTS.md |
Sub-directory AGENTS.md files, Skills |
Auto-loaded entry files must stay concise. Because these are always in context, keep them short and focused: tech stack, key commands, critical conventions, and references to detailed docs. Put granular rules in secondary files that load only when relevant. Always verify how your specific assistant handles instruction loading — for example, Copilot's
.instructions.mdfiles require anapplyTopattern to activate, and not all features work the same across IDEs.
Model Context Protocol (MCP)
Coding assistants become more useful when they reach beyond your editor into issue trackers, databases, documentation wikis, and monitoring dashboards. The Model Context Protocol (MCP) is the open standard (created by Anthropic in 2024, now governed by the Linux Foundation) that makes this possible without bespoke integrations. You write one MCP server for Jira, and it works with every MCP-compatible coding assistant — no per-tool, per-agent glue code.
How it works: Your coding assistant runs an MCP client; each external integration (a database, Jira, a browser, Sentry) runs as an MCP server exposing tools over the protocol. The assistant discovers available tools at runtime and calls them as needed during coding workflows — querying a database schema while generating a migration, pulling a Jira ticket description into context for implementation, or checking Sentry logs while debugging.
Common MCP servers for development teams:
| Server | What It Provides |
|---|---|
| Filesystem | Read, write, and search files on disk |
| PostgreSQL / MySQL | Query databases, inspect schemas |
| GitHub | Create issues, read PRs, search repositories |
| Jira / Confluence | Read and create tickets; search documentation |
| Playwright / Browser | Control a browser for testing and scraping |
| Sentry | Access error logs and stack traces |
| Context7 | Connect to up-to-date third-party documentation |
Directory of servers: github.com/modelcontextprotocol/servers
Setting up MCP servers:
- Claude Code: Configure in
.mcp.jsonat project root. Useclaude mcp add <n> -- <command>. - Codex CLI: Configure in
~/.codex/config.toml. Usecodex mcp add <n> -- <command>. - Cursor: Add via Settings → MCP or in
.cursor/mcp.json.
MCP Security Risks
When you connect MCP servers to your coding assistant, you're giving an AI agent — one that processes untrusted input like code comments, issue descriptions, and web content — direct access to your databases, repositories, and internal tools, running with your permissions. Every major security research team that has examined MCP in coding workflows — Palo Alto Unit 42, Invariant Labs, JFrog, Checkmarx — has found serious, exploitable vulnerabilities in real-world scenarios.
| Risk | What Happens | Real Example | Mitigation |
|---|---|---|---|
| Tool poisoning | Malicious server embeds hidden instructions in tool descriptions that the coding assistant follows silently | Invariant Labs: a "random fact" server exfiltrated an entire WhatsApp history through a legitimate server connected to the same assistant session | Audit tool schema definitions in source code before installing — not just the README |
| Overprivileged tokens | Compromised server with broad token scopes leaks access to all connected services | GitHub MCP server with a broad Personal Access Token (PAT) allowed a prompt-injected agent to exfiltrate private repos into a public PR | Least privilege: narrowly scoped, short-lived, dedicated credentials per server — never personal all-access tokens |
| Rug pulls | Server silently changes tool definitions between sessions, adding capabilities you never approved | Documented by eSentire: tool approved on Day 1 can reroute API keys by Day 7 | Pin server versions; review changelogs before updating; prefer known publishers |
| Command injection | AI-generated input passed unsanitized to shell commands in server implementations | CVE-2025-6514 (CVSS 9.6) in mcp-remote — 437k+ downloads, affected Cloudflare and Hugging Face integrations |
Never concatenate input into shell commands; use parameterized APIs; sandbox servers in containers |
| Lethal Trifecta via MCP | Your coding assistant gains private data access (database server) + external communication (HTTP tool); a prompt injection in a code comment or issue description chains them for exfiltration | Multiple demonstrations combining database servers with outbound HTTP tools | Don't combine sensitive-data and outbound-communication servers carelessly; require human approval for external actions |
| Supply chain | No centralized review for community servers; name impersonation (e.g., mcp-github mimicking github-mcp); tampered one-click installers |
Academic researchers documented unofficial installers distributing tampered packages to coding tool users | Install only from trusted sources; verify publisher identity; review source code |
MCP security checklist:
✓ Audit tool descriptions and source code before installing any MCP server
✓ Apply least-privilege: narrow token scopes, restricted filesystem/network access
✓ Pin server versions; review changes before updating
✓ Sandbox MCP servers in containers when possible
✓ Never combine sensitive-data and external-communication servers carelessly
✓ Run SAST (Static Application Security Testing) / SCA (Software Composition Analysis) on custom MCP servers; treat them as production code
✓ Require human approval for high-risk tool calls (data export, sending messages, file deletion)
✓ Keep servers and dependencies updated — check for CVEs regularly
Further reading: MCP Security Best Practices (official spec) | Red Hat: MCP Security Risks | Palo Alto Unit 42: MCP Attack Vectors
Documentation: modelcontextprotocol.io | SDKs: github.com/modelcontextprotocol
4. Privacy and Data Handling
Every AI tool you use has a data handling policy. Understanding these policies and configuring tools correctly is non-negotiable — client trust depends on it.
Core Principle reminder: Protect Sensitive Information and Control Data Usage — Understand your boundaries. Follow your organization's policies on what data can be shared with GenAI tools. Ensure providers don't use your inputs for training. Core Principles for Working with AI
Data Handling by Tool
Official data privacy documentation: Always verify current policies directly — they change frequently. Anthropic Privacy Center | OpenAI Enterprise Privacy | GitHub Copilot Trust Center | Cursor Privacy
| Tool | Data Used for Training? | Privacy/ZDR Option | Encryption | Compliance |
|---|---|---|---|---|
| Claude (API / Enterprise) | No | Zero data retention available | In transit + at rest | SOC 2 Type II |
| GitHub Copilot Business/Enterprise | No (code not used for training) | Business policy controls | In transit + at rest | SOC 2, FedRAMP |
| Cursor (Privacy Mode) | No (when Privacy Mode enabled) | Privacy Mode = zero data retention with providers | In transit + at rest | SOC 2 Type II |
| OpenAI Codex (ChatGPT Business/Enterprise) | No (Business/Enterprise) | Zero data retention | In transit + at rest | SOC 2 |
| ChatGPT Free/Plus | May be used for training | Opt-out available | In transit | Limited |
Rules for Xebia Teams
- Always verify client approval — Before using any AI tool on a client project, confirm that the client or project stakeholders have explicitly approved GenAI usage
- Use approved tiers only — Enterprise/Business tiers with proper data handling. Free tiers of consumer tools are prohibited for client work
- Sanitize all inputs — Remove credentials, API keys, PII, internal hostnames, and proprietary business logic before submitting to any AI tool
- Enable privacy modes — Turn on Privacy Mode in Cursor, use API-based access for Claude, use Business/Enterprise plans for Copilot and Codex
- Understand residency — Know where your data is processed. Some clients require data to stay within specific regions (EU, US)
- When in doubt, ask — Contact your project's security team or Xebia's compliance guidelines. Never assume it is okay
Data Sanitization Checklist
Before pasting anything into an AI tool, remove or replace:
✗ Passwords, API keys, tokens, secrets
✗ Database connection strings, SSH keys, certificates
✗ Names, emails, addresses, social security numbers
✗ Credit card numbers, financial account details
✗ Customer names, contract details, internal codenames
✗ Specific IPs, hostnames, internal URLs
✓ Use placeholders: "prod-db-01.company.com" → "[DB_HOST]"
✓ Use placeholders: "sk_live_a1b2c3..." → "[API_KEY]"
✓ Use placeholders: "Jan Kowalski" → "[CUSTOMER_NAME]"
✓ Use environment variable references: process.env.DB_PASSWORD
5. Security Risks
Core Principles reminder: Review the full security section in Core Principles for Working with AI for Xebia's authoritative guidance on: verifying outputs, protecting sensitive data, IP protection, access controls, organizational policies, and approved tools.
OWASP Top 10 for LLM Applications (2025)
The OWASP Top 10 for LLM Applications is the standard reference for securing AI-powered applications — worth reading before you build, deploy, or integrate LLM-based tools.
Full reference: genai.owasp.org/llm-top-10
The Lethal Trifecta
Security researcher Simon Willison identified the "Lethal Trifecta" — a security pattern worth internalizing if you use AI agents. When an agent combines these three capabilities, attackers can exploit prompt injection to steal your data:
┌─────────────────────┐ ┌────────────────────────┐ ┌──────────────────────┐
│ 1. Access to │ │ 2. Exposure to │ │ 3. Ability to │
│ PRIVATE DATA │ + │ UNTRUSTED CONTENT │ + │ COMMUNICATE │
│ (emails, files, │ │ (web pages, uploaded │ │ EXTERNALLY │
│ databases) │ │ docs, user inputs) │ │ (send data out) │
└─────────────────────┘ └────────────────────────┘ └──────────────────────┘
↓
⚠️ LETHAL TRIFECTA — Data theft possible
Why it matters: An attacker plants malicious instructions in a document or web page. When the agent processes that content, the instructions trick it into reading your private data and sending it to the attacker's server (e.g., via a URL in a Markdown link). This is not theoretical — researchers have demonstrated it against ChatGPT, Google Gemini, Microsoft Copilot, GitHub Copilot Chat, Slack, and others.
How to protect yourself:
- Avoid combining all three — If possible, ensure your agent doesn't have all three capabilities simultaneously
- Restrict external communication — Limit the agent's ability to make arbitrary network requests or embed URLs in outputs
- Treat all external content as untrusted — Documents, web pages, emails, and user inputs can all contain hidden prompt injection
- Use human-in-the-loop — Require manual approval before the agent takes high-risk actions (sending data, making API calls, modifying files outside the workspace)
- Monitor and audit — Log all agent actions, especially data access and external communications
Full article: simonwillison.net/2025/Jun/16/the-lethal-trifecta/
6. Measuring AI Impact — Universal Metrics
AI tools change how you work, but "it feels faster" won't convince a client or a budget holder. You need numbers.
Focus these metrics on client value — where AI actually speeds up delivery, reduces defects, or lowers cost per feature.
Client Value & Cost Efficiency (ROI)
Most of our projects run Time and Materials (T&M), so every hour saved translates directly to client savings — more features in the same budget, or lower cost for the same scope.
To make the case, compare what AI tooling costs against what it saves:
ROI Formula:
Value of saved time = (Estimated saved hours per month) × (Hourly rate)\ROI = Value of saved time / Monthly AI subscription cost
Example: A tool subscription (e.g., Claude Max or GitHub Copilot Enterprise) costs around \$100 per month. If the tool — by quickly generating boilerplate, tests, or assisting in debugging — saves the developer just 5 hours a month, and their rate is \$50/h, the generated savings amount to \$250. In this scenario, the tool pays for itself at 2.5x ROI — and 5 hours per month is a conservative estimate for active users. Track your own numbers to build a credible case.
Developer Experience (DevEx)
Hard metrics capture what's measurable. The rest comes from the engineers themselves — how they experience the work day-to-day (the SPACE framework is a good lens for this). * Cognitive Load: Does AI free engineers from repetitive work so they can spend more time on design and problem-solving? Ask them. * Satisfaction: Short surveys on how comfortable engineers feel working with AI. Higher satisfaction correlates with better retention and more stable delivery — worth measuring even if it is harder to put a number on.
Role-specific playbooks may include additional metrics tailored to their domain.
This playbook evolves alongside the tools and practices it covers. Contributions, corrections, and suggestions are welcome.
Core Principles for Working with AI: Read more