Skip to content

AI Engineering

Choosing the right model. Building beyond the prompt.

Most of what I ship runs on Anthropic's Claude because it's the strongest model for nuanced reasoning today. That doesn't make me a Claude developer. It makes me an engineer who picks the right model for the job. Sometimes that means a different API. Sometimes that means a vector database doing the retrieval. Sometimes that means running an open-weight model locally. This page is the technical depth behind those choices.

Why this page exists

The AI engineer job is more than prompt-writing.

The market is awash with people who can write a prompt. The actual engineering work is everything before and after the prompt. Choosing which model. Feeding the right context. Validating the output. Catching the failure modes. Keeping the cost down.

What follows is the technical work I bring to a build, beyond the prompts you'd write into ChatGPT.

Capability 01

Model selection.

The right model depends on the job. Claude Sonnet leads on nuanced reasoning and tool use. Claude Haiku is fast and cheap for routing and classification. GPT-4.1 sometimes wins on long-context exact-match tasks. Llama 3 and Mistral are credible open-weight options where you need to run on your own infrastructure. Gemma is what I reach for when I'm prototyping locally with Ollama. Qwen and DeepSeek are worth watching for coding-heavy workloads.

Most production builds I deliver today are Claude builds. That's a deliberate choice based on output quality and tool-use reliability, not a default. I review which model fits each project at scoping, not after writing code against the wrong API.

Capability 02

Retrieval-augmented generation (RAG).

An AI model trained on the public internet doesn't know your contracts, your policies or your client history. RAG is how you give it that knowledge without retraining anything. The model does the language work. A retrieval system finds the relevant chunks of your documents, hands them to the model in context, and the answer comes back grounded in what was retrieved.

The interesting work isn't the RAG concept. It's the engineering choices around it. Embedding model selection. Chunk size and overlap. Whether to use Postgres pgvector for small corpora or a dedicated store like Qdrant for larger ones. Hybrid search combining vector similarity with keyword match. Reranking the top results before they hit the model. Each choice changes accuracy and cost.

Where it fits: policy lookup tools, contract Q&A, internal knowledge search, any workflow where the answer is somewhere in a document the model has never seen.

Capability 03

Private and on-premises AI.

For sensitive workloads the data can't leave the client's infrastructure. Healthcare records, legal documents under privilege, trade secrets, financial models. The fix is to run an open-weight model on infrastructure the client controls.

The standard stack: Ollama for development and small deployments, vLLM for larger ones, open-weight models from the Llama, Mistral, Qwen or Gemma families. I run Gemma locally with Ollama for prototyping and experimentation. I haven't deployed an on-prem model to production yet. Most client builds today don't need to leave a managed cloud, and the answer is usually a properly-scoped data-processing agreement with Anthropic or OpenAI plus zero-retention mode rather than a self-hosted model. Where compliance does require it, the stack above is the plan.

Capability 04

What makes the difference: evals, observability, chaining.

A working prototype isn't a production build. The gap is in three things.

Evaluation. A regression test suite for AI outputs that catches when a prompt change breaks a previously-working case. Without this you're flying blind every time you tweak the system.

Observability. Every model call logged, traced and inspectable, so when an answer is wrong you can see exactly which prompt, which context window, which retrieved chunks produced it.

Chaining. Deterministic logic between AI steps so the system doesn't depend on the model making the same decision twice. Pure-AI pipelines are fragile. AI for the language work plus code for the routing is what holds up under real traffic.

This is the work that turns a clever demo into something a business can actually rely on.

Common questions

What people actually want to know.

Is AI engineering different from AI implementation?

AI implementation is the outcome. Scope, build, deliver a working system that replaces manual work. AI engineering is the craft you bring to that project beyond writing prompts. Choosing models, designing retrieval, evaluating outputs, instrumenting failure modes. The two travel together. You cannot have a reliable implementation without the engineering depth.

Why do most of your builds use Claude?

Claude Sonnet currently leads on the kind of work most clients need: nuanced reasoning over real documents, reliable tool use, instruction-following without rambling. GPT is competitive on a few specific tasks like very long-context exact match. Open-weight models are gaining fast but mostly have not caught up on tool use and structured output yet. Model choice gets revisited at scoping every project. If a different model wins on cost or quality for your workload, that is what gets built.

Can you run AI on our own servers?

Yes, with caveats. The standard stack for on-premises is Ollama for smaller deployments and vLLM for larger ones, running open-weight models from the Llama, Mistral, Qwen or Gemma families. I have used Ollama with Gemma locally for prototyping. I have not deployed an on-premises model to production yet. For most builds the answer is a managed-cloud API with a properly scoped data-processing agreement and zero-retention mode rather than self-hosted. Where compliance genuinely requires self-hosted, the stack above is the plan.

What is RAG and do I need it?

Retrieval-augmented generation. Instead of relying on what the model learned at training time, you give the model your documents at query time. The model still does the language work. A search system finds the relevant chunks first. You need it whenever the right answer is in your data rather than the model's training data. Internal knowledge bases, policy lookup, contract Q&A, document search at scale.

How long does an AI engineering build take?

Most engineering-heavy builds run four to eight weeks. The first week is scoping and stack selection. The next two to four weeks are the build. The last week or two is evaluation, observability setup and handover. Smaller targeted engagements like a single workflow or single model integration can run two to four weeks total.

Got a specific technical question?

Book a 30-minute call. I'd rather talk through the actual constraints of your project than write generic copy about how I'd approach it in the abstract.

Book a 30-minute call