52 AI Engineer Interview Questions for 2026 (with Scoring Rubric)
52 real AI engineer interview questions across LLM API integration, RAG and retrieval, agent frameworks, evaluation and observability, and production operations. With scoring rubric and what good answers look like. Used at Cubitrek to pre-vet every staff-augmentation candidate.


AI engineer interview questions in 2026 split into five buckets that map to what the role actually involves in production: LLM API integration, RAG and retrieval, agent frameworks, evaluation and observability, and production operations. This guide gives you 52 real questions across those buckets, with what good answers look like, ranked by seniority. We have run hundreds of AI engineer interviews at Cubitrek; the questions below are the ones that separate operators who ship from candidates who interview well.
How to actually run the interview
A 90-minute loop split across past work, system design, code task, and their questions. The 52 questions below feed into the first two segments. We covered the full loop structure in Hire an AI Engineer in 2026; this post is the question bank you draw from.
The right pacing: ask 4-6 questions in 15 minutes during the system-design segment. The goal is not to test trivia. The goal is to hear whether the candidate has actually shipped these patterns.
Bucket 1: LLM API integration (10 questions)
For LLM engineers building product features on top of frontier-model APIs.
1) Which LLM providers have you shipped to production, and what was the fallback strategy?
Good answer: names at least two providers with concrete production experience. Describes a fallback that fires on API timeout, model regression, or quota exhaustion. Mentions provider-version-pinning to avoid silent model upgrades.
Red flag: "We just use OpenAI." Single-provider stacks fail when GPT has a bad week.
2) How do you handle rate limits and retries in production?
Good answer: exponential backoff with jitter, request queueing, per-tenant rate buckets. Mentions distinguishing transient failures (retry) from permanent failures (do not retry). Names a specific library or middleware.
3) What's your strategy for prompt versioning and deployment?
Good answer: prompts are versioned alongside code, deploys go through code review, every prompt change runs against the eval suite before merge. Some candidates use prompt-specific tooling (LangSmith, Helicone, PromptLayer); fine if they explain why.
4) How do you decide between GPT-4o, Claude 4, Gemini, and an open-source model for a specific task?
Good answer: names criteria (latency budget, cost per query, accuracy on the eval set, context window, structured output reliability). Avoids absolutes; picks per task. Senior candidates name specific cost-per-million numbers for the major providers.
5) Walk me through a production LLM incident you triaged in the last 6 months.
Good answer: concrete story with timeline, blast radius, root cause, fix, postmortem. Mentions monitoring that caught it or, honestly, did not catch it. Junior candidates say "we had a hallucination problem once."
6) How do you enforce structured output from an LLM?
Good answer: JSON schema with validation, function calling or tool use APIs, retry on parse failure, fallback to a more reliable model. Some mention specific libraries (Instructor, Outlines, Pydantic AI). Senior candidates also discuss the latency tradeoff vs raw text generation.
7) What's the difference between temperature, top-p, and top-k? When do you change them?
Good answer: temperature shapes the probability distribution flatness, top-p caps cumulative probability, top-k caps the number of tokens considered. Most production code locks temperature at 0 or 0.1 for deterministic tasks and raises it for creative tasks. Junior candidates confuse the three.
8) How do you handle PII in prompts going to a third-party LLM?
Good answer: PII detection at the input layer (Presidio, Microsoft's data loss prevention, or a custom NER pipeline), redaction or tokenisation before the LLM call, optional model that supports zero-data-retention API mode (Anthropic, OpenAI enterprise tier). Compliance candidates mention HIPAA BAAs or GDPR DPAs.
9) What does an LLM API call cost in your production system per request?
Good answer: a number, with the math. Junior candidates do not know. Senior candidates have a dashboard that tracks it per feature and per tenant.
10) Have you fine-tuned a model? When and why?
Good answer: honest about when fine-tuning beats prompting and when it does not. Fine-tuning is the right answer for style consistency, structured output, latency reduction (smaller model fine-tuned on a narrower task), and cost reduction at high volume. Wrong answer for "we need the model to know our new product launch" (that is a RAG problem).
Bucket 2: RAG and retrieval (12 questions)
For engineers building anything grounded on proprietary data.
11) Walk me through your RAG architecture end to end.
Good answer: document ingestion, chunking strategy, embedding model, vector store, retrieval (hybrid or dense), re-ranking, prompt assembly, response generation, evaluation. Names a vector DB (Pinecone, Weaviate, Qdrant, pgvector, Vespa). Senior candidates also name a hybrid search pattern.
12) What's your chunking strategy and why?
Good answer: depends on document type. Fixed-size with overlap for unstructured prose, semantic chunking for documents with clear sections, parent-document retrieval for documents where you need both narrow chunks for search and broader context for answering. Senior candidates name specific token sizes (150-300 for retrieval chunks in 2026, down from 300-500 in 2024).
13) Why hybrid search instead of pure dense retrieval?
Good answer: dense vectors miss exact-match queries (SKUs, error codes, named entities). BM25 catches them. Hybrid combines both signals via reciprocal rank fusion or weighted scoring. See hybrid search optimization for the full pattern.
14) Which embedding model do you use and why?
Good answer: OpenAI text-embedding-3-large for general English, Voyage voyage-large-2 for cost-sensitive workloads, BGE-M3 for self-hosted multilingual. Domain-tuned (FinBERT for finance, ClinicalBERT for medical) for narrow verticals. Junior candidates use ada-002 because they have not updated since 2023.
15) How do you evaluate retrieval quality?
Good answer: labeled relevance dataset, metrics like NDCG, recall at K, MRR, plus end-to-end answer-quality evaluation. Mentions the chicken-and-egg problem of building the labeled set in the first place and how they solved it. Senior candidates measure retrieval quality separately from generation quality so they can debug each independently.
16) What happens when your knowledge base updates?
Good answer: re-embedding strategy (full re-index, incremental updates, change-data-capture), version control on the document corpus, A/B testing the new index against the old. Some candidates mention strategies for handling deletions (tombstones, sparse re-indexing).
17) How do you handle long documents that exceed the context window?
Good answer: parent-document retrieval, multi-step retrieval (retrieve broad first, narrow second), recursive summarisation, or long-context models for the final synthesis. Senior candidates pick per workload.
18) What's late-interaction retrieval and when do you use it?
Good answer: token-level retrieval (ColBERTv2, JaColBERT, ColPali for visual RAG) where each query token interacts with each document token independently. Better grounding than chunk-level retrieval. Used when you need extreme accuracy and can afford 1.5-2x the compute cost of dense-only.
19) How do you handle multi-modal RAG (text plus images plus video)?
Good answer: multi-modal embedding models (CLIP, SigLIP, Cohere Embed v3), separate retrieval paths for each modality, fusion at the re-ranking stage. Or a single multi-modal embedding space. Honest about the tradeoff: multi-modal is harder to evaluate and tune than text-only.
20) How do you prevent your RAG system from hallucinating?
Good answer: (a) require citations in the response, (b) verify citations exist in the retrieved context with a separate model or rule, (c) refuse to answer if retrieval confidence is below a threshold, (d) eval the system on adversarial prompts that try to elicit hallucinations. The honest answer includes "you cannot prevent it completely; you reduce the rate to acceptable levels and design fallbacks."
21) What vector database have you used in production and why?
Good answer: a concrete choice with reasoning. Pinecone for managed simplicity, Weaviate for hybrid search and self-host, Qdrant for cost, pgvector for "we already have Postgres," Vespa for >1B vector scale.
22) How big is your largest production RAG index?
Good answer: a number, plus the operating cost. Junior candidates do not know either. Senior candidates know both and have a story about how they scaled it.
Bucket 3: Agent frameworks (10 questions)
For engineers building autonomous agents.
23) Which agent frameworks have you shipped to production?
Good answer: at least two of LangChain, CrewAI, AutoGen, OpenClaw, or hand-rolled. Picks per workload. Honest about where each one breaks.
24) What's the difference between an agent and a workflow?
Good answer: a workflow follows scripted branches. An agent makes decisions about its next step based on context. The line is fuzzy; what matters is reasoning at runtime vs deterministic routing. Senior candidates also name when a workflow is actually the right answer (deterministic logic does not need an LLM).
25) How do you handle multi-turn state in an agent?
Good answer: checkpoint-based state (LangGraph), conversation memory with summarisation, or external state in a database. Discusses tradeoffs between context-window memory (cheap, capped by token limit) and external memory (richer, slower).
26) What's Model Context Protocol and when do you use it?
Good answer: Anthropic's open standard for letting agents discover and call external tools and resources over JSON-RPC. Used when your agent needs to call third-party services or when you want to expose your services to other agents. See Best MCP Servers in 2026 for the ecosystem.
27) How do you prevent prompt injection in an agent that processes user input?
Good answer: multi-layer defence. Input sanitisation, separating system prompts from user content via structured templates, tool-use guardrails that prevent the agent from calling destructive tools without explicit confirmation, output filtering. Honest that no defence is perfect.
28) Walk me through your multi-agent orchestration pattern.
Good answer: supervisor pattern (one agent picks which specialist agent to call), parallel execution (multiple agents work on separable subtasks), or sequential pipelines (agent A passes to agent B). Names a specific implementation (LangGraph supervisor, CrewAI crew, custom code). Senior candidates discuss when multi-agent is actually worse than a single well-prompted agent.
29) How do you debug an agent that is misbehaving in production?
Good answer: tracing tooling (LangSmith, Langfuse, Phoenix, OpenTelemetry), per-step reasoning logs, replay capability, eval set of the failure modes. Distinguishes between "the LLM picked the wrong tool" and "the tool worked but the LLM misinterpreted the output."
30) How do you handle tool-call failures?
Good answer: retry with exponential backoff for transient failures, graceful degradation for permanent failures, a fallback path that escalates to a human, explicit logging so the failure mode is visible in postmortems.
31) What's the cost profile of an agent vs a single LLM call?
Good answer: agents typically cost 5-30x more per task than a single LLM call because the reasoning loop calls the LLM multiple times. Names cost-optimisation strategies (smaller models for reasoning, larger models for synthesis, caching repeated tool calls, capping max iterations).
32) When have you decided NOT to build an agent?
Good answer: a real story where the candidate looked at the workload and said "this should be deterministic code with one LLM call, not an agent." Self-awareness about when agents are overkill. Junior candidates think every problem is an agent problem.
Bucket 4: Evaluation and observability (10 questions)
The discipline that separates production engineers from prototype builders.
33) How do you measure whether a prompt change improved your AI feature?
Good answer: labeled eval set, automated scoring (LLM-as-judge, exact match, custom rubric depending on the task), gate the deploy on the score. Junior candidates say "I tested it manually."
34) How big is your eval set and how did you build it?
Good answer: 200-2,000 examples for most workloads. Built from real production data with manual labelling, synthetic data generation for edge cases, adversarial examples for failure modes. Senior candidates describe a continuous labelling pipeline where new production examples become eval candidates.
35) What's LLM-as-judge and when does it work?
Good answer: using a strong LLM to score outputs from another LLM. Works for subjective quality dimensions (helpfulness, tone, completeness) where rule-based scoring fails. Does not work for factual accuracy without grounding. Calibrated against human labels to verify the judge's reliability.
36) How do you handle model-version regressions?
Good answer: pin model versions in production, eval the new version against the same test set before upgrading, A/B the new version on a small traffic slice, monitor key metrics during rollout. Senior candidates also discuss model deprecation timelines from the major providers.
37) What metrics do you track in production for an LLM feature?
Good answer: latency (p50, p95, p99), cost per request, error rate, eval score on a continuous eval sample, user-reported quality (thumbs up/down, escalation rate), and business metric (conversion, deflection, task completion). Junior candidates only track latency and error rate.
38) How do you monitor for drift in production?
Good answer: compare current production output distribution against a baseline (embedding-space drift, classifier-based drift, simple metric drift). Alert on rate-of-change. Some candidates mention input drift (the user inputs themselves are changing) separately from output drift.
39) What's your observability stack for LLM features?
Good answer: tracing (LangSmith, Langfuse, Phoenix, Helicone), metrics (Prometheus, Datadog), logs (whatever your existing stack uses). Senior candidates discuss the cost of capturing full traces in production and how they sample.
40) How do you handle PII in your eval set?
Good answer: synthetic generation or PII redaction at the labelling stage, separate compliance tier for the eval set, access controls. Junior candidates have not thought about it.
41) Walk me through evaluating a RAG system specifically.
Good answer: retrieval quality (NDCG, recall at K, MRR) measured independently from generation quality (faithfulness, answer relevance, completeness). End-to-end task success measured separately again. Three distinct eval surfaces, three distinct fix paths.
42) How do you build an eval set when you have no labels yet?
Good answer: bootstrap with manual labelling of 50-100 real production examples, augment with synthetic data, validate the synthetic data with human spot-check, expand the labelled set over time as production traffic grows. Senior candidates also discuss the failure mode of synthetic-data-only eval sets.
Bucket 5: Production operations (10 questions)
The discipline you discover you need after the first production incident.
43) What is your incident response process for an LLM feature in production?
Good answer: pager rotation, runbook, model-version rollback capability, kill-switch to disable the feature, postmortem template. Names a real incident and walks through the response.
44) How do you handle a model provider's API outage?
Good answer: multi-provider fallback (provider A primary, provider B fallback), circuit breaker that detects the outage, graceful degradation (cached responses, simpler heuristic answers, "service degraded" message). Senior candidates have a tested fallback that has actually fired in production.
45) What's your strategy for cost runaway?
Good answer: per-user and per-tenant rate limits, query cost estimation before LLM call, budget alerts at multiple thresholds, hard cap at runtime. Mentions a real incident where cost would have run away without the cap.
46) How do you handle context-window overflow in production?
Good answer: token counting before the LLM call, truncation strategy that preserves the most important content, summarisation of older context, fallback to a model with a larger context window. Junior candidates wait for the API error and handle it after.
47) How do you do red-teaming on LLM features?
Good answer: adversarial test set (prompt injection attempts, jailbreaks, off-topic queries, malicious inputs), automated red-team prompts generated by another LLM, manual security review for high-stakes features. Senior candidates separate red-teaming for safety (does the model say bad things) from red-teaming for security (can attackers exfiltrate data or hijack the agent).
48) What's your strategy for caching LLM responses?
Good answer: semantic cache (cache by embedding similarity, not exact-match) for repeated similar queries, response cache for deterministic prompts, no caching for personalised or stateful queries. Discusses cache invalidation strategy.
49) How do you handle on-call for an AI feature?
Good answer: on-call engineer, runbook for common incidents (model regression, API outage, cost runaway), escalation path, postmortem culture. Distinguishes between "the model is wrong" (probably not pageable) and "the model is unreachable" (pageable).
50) What's the difference between an LLM-feature outage and a regular software outage?
Good answer: LLM features fail more gracefully (slower responses, lower quality, soft failures) than hard outages. The hard part is detecting quality degradation, not detecting outages. Quality degradation needs continuous eval in production, not just uptime monitoring.
51) How do you handle PII or sensitive data leaving your infrastructure to a third-party LLM?
Good answer: enterprise tier APIs with zero data retention, on-premise or VPC deployment for the most sensitive workloads, PII detection and redaction at the application layer before the LLM call, contractual coverage (BAAs for HIPAA, DPAs for GDPR). Honest about what each option costs and what residual risk remains.
52) What part of your production AI feature scares you most?
Good answer: a real concrete answer. "Cost runaway if a customer hits us with a long-tail of unusual queries." Or "Model deprecation on three months notice that breaks our eval baseline." Or "Prompt injection through user input that exfiltrates internal context." Self-aware about specific risks; not blanket reassurance.
How to score the answers
A scoring rubric we use at Cubitrek across the 52 questions above:
- 3 points if the candidate gives a specific answer grounded in concrete production experience (named tools, real numbers, real stories).
- 2 points if the candidate gives a correct general answer but cannot tie it to a specific production case.
- 1 point if the candidate gives a partial answer that hints at the right concept but misses key details.
- 0 points if the candidate guesses or admits they have not done it.
Senior AI engineers should score 90+ out of 156 (60 percent) on a sample of 20 questions from the buckets that match the role. Below 80 they are mid-level. Below 60 they are junior dressed up.
The non-obvious thing this scoring catches: candidates who have read the documentation but never shipped. They score 2s on most questions (correct general answer) and 0s on the "walk me through a production incident" questions because they cannot make up a story they have not lived.
Frequently asked questions
1) How long should an AI engineer technical interview be?
90 minutes for a senior loop. 60 minutes for mid-level. Split into past-work walkthrough (15 min), system design (30 min), code task (30 min), candidate questions (15 min). Skip any segment that the candidate's resume already proves they can pass.
2) Should I ask coding or system-design questions for AI engineers?
Both, weighted toward system design at senior level. AI engineering is heavily about architecture, eval, and operating LLM features at scale, not about implementing transformer attention from scratch. Use a 30-minute applied coding task (build a classifier, evaluate it, output a confusion matrix) rather than a Leetcode-style algorithm round.
3) What's the single most important AI engineer interview question?
"Walk me through a production AI feature you shipped in the last 12 months, including the eval methodology and the biggest production failure." This question alone filters out 60-70 percent of candidates who have not actually shipped. Lead with it. Save time on candidates who cannot answer it concretely.
4) How do I interview for an AI engineer when I am not an AI engineer myself?
Pair with a senior engineer (internal or contracted) for the technical interview. You run the past-work walkthrough and the candidate questions; they run the system design and code task. Or use a staff-augmentation provider that has already done the technical vetting (Cubitrek runs the 52 questions above against every candidate before they reach you).
5) Should I require an AI engineer to know all 5 sub-specialisations (LLM, RAG, agents, prompt engineering, MLOps)?
No. A senior AI engineer is typically deep on 2-3 of the 5 and competent on the others. Most teams need LLM-plus-RAG or LLM-plus-agents; pure single-specialisation engineers are rare. Write the job spec around the 2-3 sub-specialisations your workload actually needs.
6) Are these AI engineer interview questions hire-or-no-hire on individual questions?
No, the scoring is aggregate. A senior candidate can miss 2-3 of the 20 questions you ask and still be a strong hire. The question is whether the overall score signals real production experience or just confident reading.
7) What if the candidate uses AI assistants during the coding task?
Let them. Production AI engineers use Cursor, Claude, and ChatGPT every day. What you are testing is whether they make good engineering decisions, not whether they can write code from memory. The risk is candidates who only know how to prompt an assistant, not how to debug the assistant's output. Watch for that during the task.
8) Does Cubitrek pre-vet AI engineers using these questions?
Yes. Every engineer in our staff-augmentation pool has passed a technical interview built from the question bank above. You skip the loop and interview only candidates who have scored top quartile against this rubric. See the staff augmentation program for how it works.
Want this run for you?
Cubitrek's staff augmentation program ships pre-vetted senior AI, ML, RAG, and agent-framework engineers in 7 days. We run the 52 questions above against every candidate before they reach you. $2,000 to $5,000 per month per engineer, no recruiter margin, replace anytime. Talk to a delivery lead via contact to scope the role.
Key takeaways
- Eval discipline is the fastest filter. Candidates who cannot describe their labeled test set methodology have not shipped to production.
- Senior AI engineers should be deep on 2-3 of the 5 sub-specialisations, not all 5. Pure single-specialisation candidates are rare.
- Code tasks at senior level should be applied tasks (build a classifier, eval it, output confusion matrix) not Leetcode algorithm rounds.
- Let candidates use AI assistants during the coding task. Production AI engineers use them daily. Watch for candidates who only know how to prompt, not how to debug the output.
- Past-work walkthrough is more diagnostic than synthetic system design. Real shipped features have stories; bluffers do not.

Faizan Ali Khan
Founder of Cubitrek. Ships agentic AI systems that automate sales, marketing, and operations for SaaS, e-commerce, and real estate companies. Coined the term 'single-player agency' in 2026.
Questions people ask about this
Sourced from client conversations, Search Console, and AI-search citation monitoring.
- 90 minutes for a senior loop. 60 minutes for mid-level. Split into past-work walkthrough (15 min), system design (30 min), code task (30 min), candidate questions (15 min). Skip any segment that the candidate's resume already proves they can pass.
Related articles.
More on the same thread, picked by tag and category, not chronology.

Hire an AI Engineer in 2026
Hire an AI engineer in 2026 and you face a market that did not exist three years ago. The role splits into 5 sub-specialisations, US senior salaries cleared $300k base in 2025, and remote staff augmentation now ships at $2k-$5k/mo for the same seniority. This guide tells you what to look for, what to pay, and when to skip a full-time hire entirely.

How to Evaluate AI Agent Development Companies
Buyer's guide to evaluating AI agent development companies. Assessment criteria, red flags, questions to ask, and pricing benchmarks for 2026.

The AI Agent Tech Stack: What You Need to Build Production Agents
The complete AI agent tech stack for production deployments. LLMs, frameworks, memory, tools, observability, and guardrails, everything you need in 2026.

The AI-first growth memo.
One email every other Tuesday. What's moving across AI search, paid, and agentic AI, with the playbooks attached.
No spam. Unsubscribe in one click.
Want Cubitrek to run Staff Augmentation for you?
We install staff augmentation programs for growing companies across the US and Europe. Book a call and we'll come back with a one-page plan in 72 hours.
