Close

LLM Architecture: What Technical Founders Need to Know

by 
Team CRV
June 16, 2026

Table of Contents

You ship the prototype on a Friday and watch the first real users log in over the weekend. For founders building artificial intelligence (AI) products, that weekend shapes the next year more than any architecture review ever could. A handful of decisions in the first 90 days separate products that grow with the team from products that need rewriting every other quarter.

This guide covers transformer mechanics that drive your inference bills, the customization tradeoffs you'll face and the evaluation infrastructure that holds everything together.

What Is LLM Architecture?

LLM architecture refers to the full set of design choices that determine how a large language model (LLM) processes input tokens, produces predictions and serves results in production. Three layers shape the system: the underlying model itself, the customization techniques that adapt that model to a specific product and the serving infrastructure that runs inference at scale. Each layer carries its own cost and flexibility trade offs, and the decisions made early shape what's affordable to change later.

The architectural choices made in a product's first few months drive most of what later shows up as inference cost, latency, accuracy and the flexibility to swap base models. Founders who benefit most from these choices keep them reversible from the start, then revisit each layer as usage patterns become clear.

How Transformer Architecture Drives LLM Inference Costs

As of 2025, every major production LLM is built on a transformer architecture. Claude, DeepSeek, Gemini, GPT, Llama and Mistral all share the same foundational mechanism: self-attention, where every token can attend to every previous token when generating predictions. Two specific properties of this architecture will directly affect your inference bills and graphics processing unit (GPU) requirements.

Attention Scales Quadratically With Context Length

The single most cost-relevant architectural fact for founders: attention scales quadratically with sequence length at training time, and each new token generated at inference time costs linearly with context size. A key-value (KV) cache makes inference tractable by storing computed keys and values, so each new token only computes its own representations rather than reprocessing every prior token.

The tradeoff is memory: for products with context windows above 100,000 tokens, the KV cache alone can consume hundreds of gigabytes of GPU memory. Your product's context window requirements directly dictate your GPU memory needs, your choice of inference provider and whether architectural alternatives belong in your evaluation. Benchmarking typical context lengths before selecting infrastructure prevents committing to GPU tiers that either waste budget or bottleneck throughput.

Mixture of Experts Changes Cost Math

Mixture of experts (MoE) has become a leading architecture for frontier models. Its core principle is conditional computation: only a sparse subset of expert sub-networks activates for any given input. Total parameter count is now a misleading proxy for inference cost because a large MoE model activates only a fraction of its parameters per token, and the active parameter count determines what you actually pay.

Large frontier MoE models can activate only a fraction of their total parameters per forward pass, so their serving cost sits closer to a much smaller dense model than raw parameter counts suggest. Verifying active parameter count before building cost projections prevents cost model errors that can be off by an order of magnitude.

LLM Architecture Choices: RAG vs. Fine-Tuning vs. Prompt Engineering

Retrieval-augmented generation (RAG), fine-tuning and prompt engineering each solve a different problem: RAG addresses knowledge access, fine-tuning shapes model behavior and prompt engineering tests whether either is necessary at all. Choosing the right approach has direct infrastructure and cost implications, and most teams pick wrong by reaching for fine-tuning when the underlying problem is actually retrieval.

Prompt Engineering as Your First Architecture

Prompt engineering should come first because you're testing whether the base model can do what you need before committing to heavier infrastructure. Application programming interface (API) pricing varies widely, from $0.10 per million input tokens for budget models like GPT-4.1 Nano to $30 per million input tokens for frontier reasoning models like GPT-5.4 Pro.

One underused technique, prompt caching, can cut input-side inference costs by 80 percent or more by reducing re-read costs to $0.10 to $0.50 per million tokens, depending on the model. Prompt engineering can't address the model's knowledge cutoff, nor can it guarantee consistent behavioral compliance when few-shot examples repeatedly fail.

RAG for Knowledge Access Problems

RAG is the right choice when your model needs proprietary data, information that changes frequently or the ability to cite specific sources. RAG also offers a compliance advantage for products that handle European Union (EU) personal data, because removing specific entity information no longer requires retraining the entire model.

In most RAG pipelines, the generation phase accounts for the bulk of the pipeline's latency, while vector database retrieval contributes a small share of the total execution time. Your engineering effort belongs on the generation step rather than the retrieval infrastructure. The architectural decisions that deserve your attention include chunking strategy, embedding model selection, retrieval method (dense, sparse or hybrid) and whether to add a re-ranking step. Each affects retrieval quality more than most founders expect.

Fine-Tuning for Behavioral Problems

Fine-tuning becomes the right answer when you need consistent structured output, domain-specific format requirements or response styles so particular that few-shot examples can't capture them. CRV-backed CodeRabbit builds an AI code review for engineering teams, a domain where output structure and behavioral consistency can't tolerate the variance introduced by prompting alone.

Fine-tuned models carry a stranded investment risk because they're tightly coupled to their base architecture. When a new model generation supersedes the one you customized, your fine-tuning doesn't transfer. Maintaining your fine-tuning data pipeline from day one makes retraining on a new base tractable rather than catastrophic. Combining approaches, such as fine-tuning over an RAG layer, can improve results by addressing different model limitations together.

LLM Architecture Deployment: API-First vs. Self-Hosted

The infrastructure question for early stage startups is at what scale self-hosting becomes economically justified. A common starting point is to begin with serverless inference APIs, then migrate workloads as utilization patterns become clear. This decision ranks among the most common early architectural choices, and getting the sequencing right avoids locking in costs prematurely.

Break-Even Timelines by Scale

Break-even timelines vary widely by scale across 54 deployment scenarios: small deployments hit break-even around three months, medium deployments take six to 24 months and large deployments often extend beyond two years before self-hosting pays for itself. The gap between API providers also creates meaningful cost variation. Provider selection becomes a significant cost decision at scale rather than merely an operational detail.

Compliance requirements can override pure economics, since organizations operating under Health Insurance Portability and Accountability Act (HIPAA), Payment Card Industry (PCI) and General Data Protection Regulation (GDPR) often evaluate self-hosting earlier than cost analysis alone would suggest. Privacy, security and cross-border data transfer risks tend to drive that calculus rather than formal data residency requirements. Teams handling sensitive data should map compliance requirements alongside cost analysis from the start, since retrofitting data residency controls into an API-dependent architecture is far more expensive than planning for them upfront.

Inference Techniques That Reduce Cost

Several inference techniques shape your cost profile at both self-hosted and managed deployments. Four are worth understanding in detail before infrastructure decisions:

  1. Quantization: Reduces memory footprint and increases throughput by lowering numerical precision, with each step from 16-bit floating-point (FP16) to 8-bit floating-point (FP8) to 4-bit floating-point (FP4), trading marginal quality for resource savings.
  2. Continuous batching: Lets new requests fill GPU slots as prior sequences complete, replacing static batching that stalls on the longest request in a batch.
  3. PagedAttention: Divides the KV cache into fixed-size blocks stored non-contiguously in GPU memory, freeing resources from completed sequences immediately.
  4. Speculative decoding: Uses a small draft model to propose multiple tokens that the main model verifies in parallel, accelerating inference materially in practice.

These techniques work best when paired with provider abstraction from the start. If your inference layer is tightly coupled to a single provider, switching when economics shift requires rewriting product code rather than updating a configuration. In practice, that means implementing a routing layer or adapter pattern that isolates your application logic from any single provider's software development kit (SDK) or endpoint format.

Evaluation Infrastructure for LLM Architecture

We've watched founders across the AI companies we've backed learn the same lesson: early investment in evaluation pays increasing returns, while skipping it accumulates debugging costs that grow with every deployment. Getting evaluation right early is the highest-impact infrastructure investment you can make.

Product Evaluation vs. Model Evaluation

You need two distinct evaluation systems, and conflating them leads to building the wrong measurement infrastructure. LLM product evaluation tests your full system (prompt chains, guardrails, integrations) against custom scenario-based datasets for your specific task. LLM model evaluation compares the underlying capabilities of different LLMs using general benchmarks when new versions are released. Most founders over-invest in benchmark comparisons and under-invest in product evaluation: your product evaluation system runs continuously against every deployment, while model evaluation runs periodically when you assess new model releases or provider changes.

The Three-Layer Evaluation Stack

Three evaluation layers compose an operational system, each operating at a different cost-coverage trade-off. The quality of the human judgments that ground any automated metric fundamentally limits it, so skipping human evaluation yields automated evaluators you can't trust. Together, these layers catch failures progressively, from the cheapest and fastest checks to the most authoritative:

  1. Deterministic checks: Rule-based assertions, format validation and regex matching run on every inference call at zero per-call cost. These catch structural failures before anything else.
  2. LLM-as-judge evaluation: A stronger LLM scores or compares outputs against quality criteria. Scalable, but each judge needs calibration against human labels before you can trust its assessments.
  3. Human evaluation: Subject-matter experts label outputs to set ground truth. Expensive and slow, but it's the calibration mechanism that makes the automated layers reliable.

The stack works because each layer covers a different failure mode and cost point. Run deterministic checks everywhere, use LLM-as-judge for scalable coverage and reserve human review for the judgments that anchor the rest of the system.

Five Architecture Mistakes That Burn Runway

The failure patterns in production LLM systems cluster around a predictable set of architectural choices. Working closely with CRV-backed Vercel and other deployment infrastructure teams gives us a close view of how AI-native products succeed and fail at scale. The mistakes below show up repeatedly across early stage teams:

  1. Fine-tuning when you need RAG: Treating fine-tuning as the default approach when the underlying problem is knowledge access, not model behavior. Fine-tuning fits when outputs have the wrong format or style, while RAG fits when the model lacks required information.
  2. Treating the LLM as a source of truth: Building systems where the model recalls facts rather than reasons over retrieved data creates a structural flaw, not a prompting problem. In postmortem drafting, teams have found that LLMs recall events from logs and text well, but their causal reasoning still requires careful human review.
  3. Context window mismanagement: Expanding context silently truncates system-level constraints placed in early messages, which causes behavior drift with no warning signal. Agents often fail after many steps because context consumption compounds and coherence degrades.
  4. Prompt injection through retrieved content: Placing retrieved documents in the system role or without explicit delimiters lets adversarial content override system behavior. Retrieved content belongs inside delimiters that identify it as untrusted external data.
  5. Shipping on vibes instead of evaluation: Making deployment decisions based on manual spot-checking provides no regression signal. Changes that improve one behavior silently degrade others, and human intuition about what constitutes a good LLM system design is systematically miscalibrated.

Each of these patterns shows up early enough in a system's life to be caught with the right instrumentation, which is why the architectural choices that surround them carry as much weight as the mistakes themselves.

Building LLM Architecture That Adapts

These mistakes share the same shape: they look like model problems from the outside, but usually come from system design choices. The LLM architecture decisions you make in the first few months shape your cost structure and your ability to swap components as the field moves. We've seen the strongest AI-native founders treat these choices as reversible experiments and build abstraction layers from the start so they can adapt as models and economics shift. 

In practice, that means starting with APIs, instrumenting everything from day one and migrating workloads only after utilization patterns justify the operational overhead. If you're an early stage founder looking for support with these technical tradeoffs, reach out to us to see if we'd be a good fit.

Frequently Asked Questions

Should I build on proprietary APIs or open-weight models at the seed stage?

Proprietary APIs give you the fastest path to a working product without having to manage GPU infrastructure. Open-weights models offer better unit economics at scale and remove dependency on a single provider's pricing decisions. Most seed stage teams should start with APIs to validate the product, then evaluate open-weights migration once they understand usage patterns and can sustain high GPU utilization.

How much does fine-tuning actually cost for a small team?

Supervised fine-tuning with low-rank adaptation (LoRA) on models like Llama 4 Scout is advertised by some hosted services at around $3 per one million tokens, while GPU-time pricing varies by provider. The greater cost concern is the stranded-investment risk when base models are superseded. Maintaining your training data pipeline has a more long-term impact than minimizing the per-job expense.

When should I add RAG instead of stuffing context into the prompt?

RAG becomes the better choice when your data changes frequently, when you need to cite specific sources, when your context exceeds what the model reliably attends to or when compliance requires the ability to delete specific records. Even when a model's context window is technically large enough, RAG often outperforms context stuffing because model attention degrades on information positioned early in long contexts. A practical signal: if your team finds itself manually updating prompts with new information weekly, RAG belongs in the stack.

What evaluation tooling should a small team start with?

The right tooling depends on your team's current stage and goals, but some priorities apply broadly. Tracing on every LLM call comes first, using tools like Langfuse or LangSmith for instrumentation. After that, a labeled evaluation dataset, built through daily manual review of production traces, provides the ground-truth data that automated evaluators and continuous integration and continuous deployment (CI/CD) regression testing need to calibrate against.

No items found.