AI-Assisted Code Reviews with Claude Code
I spent a few months experimenting with AI-assisted code reviews before I found a setup that actually works. Most of my early attempts produced either obvious findings I'd catch myself or confident-sounding hallucinations that wasted verification time. The difference turned out to be structure: the right tools, the right context, and a two-pass workflow that keeps the model honest. Here's the exact setup, prompts, and best practices I rely on for code reviews with Claude Code in 2026.
The Code Review Stack
Claude Code (Opus 4.6, 1M context)
├── Jira MCP → ticket context
├── GH CLI → PRs, diffs, file trees
├── Figma MCP → design context for UI reviews
├── Context7 MCP → up-to-date library docs
└── Skills (curated)
├── vercel-react-best-practices
├── vercel-composition-patterns
├── next-best-practices
├── nodejs-backend-patterns
├── nestjs-best-practices
└── [+ any framework-specific skills for your stack]
Each piece is there for a reason. Let me walk through the why.
1M Context Window
This is what makes cross-file analysis possible. You can load an entire small-to-medium service (source files, configs, test suites) into a single session. The model sees relationships that humans miss when reading file-by-file: an unvalidated input in middleware that a controller three directories away trusts implicitly, a type assertion that silently breaks a downstream consumer, a race condition between two services that only appears when you read both.
One caveat: auto-compaction degrades quality in long sessions. When context gets compressed, the model works from summaries instead of actual code. I'll cover how to deal with this in the workflow section.
Jira MCP
A code review isn't just "does this code have bugs." It's "does this code match what was asked for." Loading the Jira ticket directly gives the model acceptance criteria, edge cases discussed in comments, and scope boundaries. This catches scope drift ("this PR adds a feature nobody asked for") and missing requirements ("the ticket says handle the offline case, but there's no offline handling here").
GH CLI
gh pr view and gh pr diff run directly from Claude Code's bash. The model gets full awareness of what changed, what files were touched, and the PR description. Combined with Jira context, it can cross-reference "what was requested" against "what was implemented."
Context7 MCP
This one eliminates an entire class of false positives. Without current docs, the model might flag a perfectly valid API call as "deprecated" because it was trained on older documentation. Context7 fetches the actual, current documentation for the library version you're using. Critical for fast-moving frameworks like Next.js, React, and anything in the TanStack ecosystem.
Figma MCP
For frontend reviews, this changed how I work. The Figma MCP pulls design context (component specs, spacing, colors, layout) directly from Figma files. During a review, the model can compare the implementation against the actual design. Does the spacing match? Are the right design tokens used? Is the hover state implemented? It turns "does this match the mockup?" from a manual squint into a structured check.
Curated Skills
Skills are structured context files that tell the model what "good" looks like for a specific domain. This is the highest-leverage piece of the stack. A recent paper on arXiv showed that curated context files outperform LLM-generated ones. Hand-picked, opinionated guidance beats auto-generated boilerplate every time.
I source my skills from skills.sh and customize them per stack (I covered how I use skills for development in my AI-powered Next.js workflow post). These aren't generic linting rules. They encode real architectural opinions, the kind of things a senior engineer would catch in review but a linter never would.
My Recommended Skills
| Skill | Focus |
|---|---|
vercel-react-best-practices | Performance optimization from Vercel Engineering: hooks discipline, memoization, server/client component boundaries, rendering efficiency |
vercel-composition-patterns | Scalable component APIs: compound components, render props, slot patterns. Catches boolean prop proliferation and under-composed trees |
next-best-practices | App Router file conventions, RSC boundaries, async APIs, data fetching and caching, metadata, error handling, route handlers, image/font optimization |
nodejs-backend-patterns | Production-ready backend services: middleware patterns, authentication, database integration, API design, error handling, graceful shutdown |
nestjs-best-practices | Architecture patterns for production NestJS: module structure, dependency injection, guards/interceptors/pipes, exception filters, security |
Pick the ones matching your stack and customize them. The default skills from skills.sh are a solid starting point, but the real value comes from tuning them to your team's conventions.
The AI Review Workflow: Two Phases
I split every review into two distinct phases. This isn't arbitrary. It's the most reliable way to get accurate findings from a model with a large context window.
Phase 1: Deep Read
Read deeply:
- Tickets: [links]
- PRs: [links]
- Analyze context, create review plan using skills: [list]
This is the "load everything" phase. The model reads tickets, diffs, and source files, then builds a structured review plan before writing any findings. Skills constrain what "good" looks like. Instead of reviewing against some vague internal standard, it reviews against specific, documented patterns.
Don't ask for findings yet. Ask for a plan.
This forces the model to organize its analysis before committing to conclusions.
One useful trick after this phase: ask Which skills have you used? The model will list which skills it actually loaded into context. If the skill you specified isn't in the list, the review is running without those constraints and the findings will be generic. Better to catch that early than to wonder why the output feels shallow.
Phase 2: Cross-Validation
Cross-validate findings:
- Re-read flagged files (actual code, not your summary)
- Verify each finding exists as described
- Remove false positives
- Rank: Critical > High > Medium > Low
- Each finding: file path, line range, issue, impact, suggested fix
The first pass catches things; the second pass asks "did I get that right?" Without cross-validation, you get hallucinated findings: the model confidently describes a bug in code that doesn't exist, or references a line number from a compacted summary that's drifted from the actual file.
The output format is intentional. File paths make findings actionable (click, navigate, verify). "Impact" prevents dismissal ("yeah but who cares"). Suggested fixes turn the audit from a list of complaints into a PR-ready action plan.
Code Review Lessons Learned
1. Auto-compaction will bite you
In long sessions, Claude Code compresses earlier context to fit new information. This means the model might be working from a summary of the code it read, not the actual code. That's why cross-validation is a separate prompt: it forces a fresh read of the actual files.
For critical reviews, I start a new session for the validation pass entirely. Belt and suspenders.
2. Skills beat generic prompts
"Review this code for best practices" produces generic findings. "Review using the nodejs-backend-patterns skill" produces grounded, specific findings tied to documented patterns. Night and day difference.
Start with skills.sh, pick skills for your stack, and customize them as you learn what your team cares about.
3. Demand file paths and line numbers
A finding without a file path is wasted verification time. It also acts as an honesty check: if the model can't point to the exact location, the finding is likely hallucinated. The cross-validation prompt enforces this, but it's worth stating explicitly in Phase 1 too.
4. Let Context7 handle library docs
Don't assume the model knows current APIs. I've seen reviews flag perfectly valid Next.js 16 patterns as "deprecated" because the model's training data included Next.js 13 docs (cutoff dates matter more than you think). Context7 eliminates this entire category of false positives.
5. The human still makes the call
In my experience, audits consistently find real issues. But they also flag intentional design choices as problems: a deliberate use of any for a plugin interface, a "missing" validation that was handled by a middleware upstream. Cross-validation catches some of these, but not all.
Bottom lineThe goal is to catch what humans miss, not to replace human thinking.
Use the audit as a high-quality starting point. Apply your own judgment.
What's Next
I'm packaging this two-phase workflow into a reusable Claude Code skill. The goal is a one-liner: Review [repo/PR] using code-review skill. It's not there yet, but close.
I'm also experimenting with multi-agent orchestration: separate Claude Code instances acting as PM reviewer, frontend reviewer, backend reviewer, and QA, each with different skills and focus areas, merging outputs into a single report. It's promising but the coordination overhead is real. More on that when I have something worth sharing.
Tooling in this space evolves faster than you can write about it, and that's the fun part. If you've built a review workflow that works for you, I'd love to hear about it.