- Executive Summary
- The Challenge: Why Resumes Fail Senior Technologists
- Diagnosis: Three Root Causes of the Portfolio Gap
- The Three-Tier Knowledge Architecture
- The Evolution: Three Iterations to Zero Hallucination
- Audience-Adaptive Depth: Same Facts, Three Conversations
- The Agentic Pipeline: From Query to Grounded Response
- The Self-Governing System: Grading, Learning, and Cost Control
- Security Architecture
- Results and Production Performance
- Lessons and Principles
- The Road Ahead
- Technology Stack Summary
- About the Author
1. Executive Summary
A resume is a static artifact. It presents the same flat content to a CTO evaluating architecture depth, a hiring manager assessing team leadership, and a recruiter scanning for keyword matches. It cannot answer follow-up questions. It cannot adapt its depth. It cannot demonstrate how an architect thinks through a problem. For senior technologists whose value lies in 24 years of accumulated judgment, not a list of bullet points, this is a fundamental limitation.
This white paper documents a production-grade AI chat system built to solve that limitation. The system is deployed on a personal portfolio site, powered entirely by Cloudflare's serverless edge infrastructure, and grounded in a three-tier knowledge architecture that separates immutable professional facts from semantic content search and deterministic resume generation. In 60+ days of production operation across 2,000+ conversations, the system has produced zero hallucinations, maintained 99.95% uptime, and operated at a baseline cost of $0.12 per month.
The architecture reflects a principle I have come to trust through building it: modern LLMs can power reliable conversational interfaces when the grounding problem is solved at the architecture level, not patched at the prompt level. The three-tier separation, combined with audience-adaptive depth, a self-critique quality gate, and a human-in-the-loop learning pipeline, produces a system that works. It is not perfect, and this paper is transparent about where the gaps remain. But the foundation is solid, and I am sharing it in the hope that others building similar systems find something useful here.
2. The Challenge: Why Resumes Fail Senior Technologists
The problem begins with a mismatch between what senior technologists know and what a resume can communicate. A two-page document can list that someone reduced Splunk TCO by 80% through intelligent data routing. It cannot explain the architectural reasoning behind the 4-tier classification framework (critical operational, compliance, diagnostic, noise) that made that cost reduction sustainable. It cannot describe the political resistance from 20 business units that each believed their data was "critical," how the classification criteria were negotiated, and why transparent governance was the sign of organizational maturity rather than top-down mandate. It cannot adapt that story to the specific concerns of the person reading it.
The visitors who matter most have the most specific questions. A CTO evaluating a platform architect wants to know: "Have you led enterprise observability transformation at scale? What was the cost reduction strategy for 3-5 TB daily ingest?" A hiring manager wants outcomes: "What was the timeline, and how did the team perform?" A recruiter needs a 30-second signal: "Does this person match the role?" A static portfolio answers none of these in the visitor's own language.
The catalyst was straightforward. A CTO visited the portfolio, browsed three pages, and left. The analytics showed an average session of 47 seconds. A conversational system that could answer "How did you achieve 80% observability cost reduction?" with grounded, first-person depth would have turned that 47-second bounce into a 5-minute engagement. The question was whether AI could do this without fabricating credentials.
2.1 The AI Trust Problem
Conversational AI is powerful but dangerous for professional branding. Large language models hallucinate when they encounter queries beyond their training boundaries, when the underlying data is shallow or inconsistent, or when the system is under pressure to generate an answer when uncertainty would be more honest. For a personal portfolio, a single fabricated credential or invented project detail can undermine years of carefully built professional credibility. The system must never guess. It must say "I don't have direct experience with that" when it genuinely does not.
This constraint shaped every architectural decision that followed. The system was not designed to be impressive. It was designed to be trustworthy. The result is live at hemanthshivanna.com.
3. Diagnosis: Three Root Causes of the Portfolio Gap
Before building the solution, the team diagnosed why existing approaches fail. The gap between what a senior technologist knows and what a portfolio communicates has three compounding root causes.
| Root Cause | Symptom | Consequence |
|---|---|---|
| The Data Gap | Static content cannot adapt to the reader's context or expertise level | CTOs get the same shallow summary as recruiters; neither finds what they need |
| The Trust Gap | Generic AI chatbots hallucinate when data is thin or queries are specific | A single fabricated claim destroys professional credibility |
| The Cost Gap | Enterprise conversational AI solutions are priced for enterprise budgets | Individual professionals cannot justify $500+/month for a portfolio assistant |
These three gaps compound each other. The data gap means visitors leave quickly. The trust gap means AI cannot simply be bolted on. The cost gap means the solution must be architecturally efficient, not just functionally capable. Solving one gap without the other two produces either an expensive system nobody trusts or a cheap system nobody visits.
If your portfolio has any of these characteristics, the gap is already costing you opportunities: average session duration under 60 seconds, zero inbound questions from visitors, resume download rates below 2%, or visitor-to-conversation conversion of zero. These are not traffic problems. They are depth problems.
4. The Three-Tier Knowledge Architecture
The core architectural insight is that not all knowledge has the same reliability requirements, retrieval characteristics, or update frequency. Treating all data as a single retrieval problem (the default RAG approach) conflates three fundamentally different knowledge types and introduces hallucination risk at every boundary. The solution separates knowledge into three tiers, each with its own storage, retrieval mechanism, and trust model.
14K tokens, zero latency
47 articles, cosine ≥ 0.75
11 resume variants
Audience-adapted, cited, scored
| Property | Tier 1: Static Facts | Tier 2: Semantic Search | Tier 3: Deterministic Lookup |
|---|---|---|---|
| Storage | System prompt (in-context) | Cloudflare Vectorize (BGE-M3, 1024-dim) | Cloudflare D1 (SQLite) |
| Retrieval Latency | 0ms (always present) | 15-40ms (embedding + similarity search) | 5-10ms (parameterized SQL query) |
| Accuracy Guarantee | Deterministic (canonical facts, never wrong) | Probabilistic (cosine similarity >= 0.75 threshold) | Deterministic (exact keyword match against structured data) |
| Failure Mode | Context window overflow (mitigated by selective loading) | False negatives below threshold; false positives above threshold | No match found (graceful fallback to Tier 1) |
| Update Frequency | Manual (rebuild system prompt on profile change) | Incremental (chunk + embed new articles, no retraining) | On-demand (add/modify resume variants in D1) |
| Token Cost Per Query | ~14,000 input tokens (fixed overhead) | ~500 tokens per retrieved chunk (variable, 0-5 chunks) | ~200 tokens (structured resume metadata) |
Source: hemanthshivanna.com/whitepaper/observability-transformation
Source: hemanthshivanna.com/whitepaper/mq-ace-stabilization
Source: hemanthshivanna.com/article/uplifting-service-delivery
4.1 Tier 1: Immutable Professional Facts (Static System Prompt)
Tier 1 contains everything that defines the author's professional identity: 7 professional roles with quantified metrics, 13 technical skill categories spanning 65+ technologies, 12 verified testimonials, 6 major accomplishments with measured impact, and 5 persona rules governing audience-adaptive depth. This knowledge is embedded directly in the system prompt at 14,000 tokens. It requires zero retrieval latency and zero retrieval failure risk. Profile facts are always correct because they are always present.
The rationale is practical. If a visitor asks "What is your observability transformation experience?", the answer must come from verified career data, not from a semantic search that might return a tangentially related article. Tier 1 guarantees that identity questions are answered from identity data.
Scalability consideration: At 14,000 tokens, the static system prompt consumes a meaningful portion of the context window on every request. As a career grows (more roles, more metrics, more accomplishments), Tier 1 will eventually exceed what is practical to embed in every prompt. The planned mitigation is a selective loading strategy: classify the query's domain first (for example, observability, service delivery, leadership), then load only the relevant Tier 1 subsections rather than the full 14K tokens. This preserves the zero-retrieval-latency property while reducing token cost per request. The current 14K footprint works within Gemini 2.0 Flash's context window, but the architecture anticipates the need to partition it.
4.2 Tier 2: Semantic Content Search (Cloudflare Vectorize)
Tier 2 handles content depth. The system indexes 47 published articles and 8 technical whitepapers using BGE-M3 embeddings (1024 dimensions) stored in Cloudflare Vectorize. When a visitor asks a content question ("What is your approach to observability transformation?"), the pipeline retrieves the top 5 semantic matches filtered to a cosine similarity threshold of 0.75. Results below that threshold are discarded rather than surfaced with low confidence.
New articles do not require retraining the model or rebuilding the system prompt. They are chunked, embedded, and indexed. The semantic search handles nuanced content queries without touching the immutable identity layer.
Embedding model choice: BGE-M3 was selected over alternatives (OpenAI ada-002, Cohere embed-v3) for a specific reason: it runs natively on Cloudflare Workers AI, eliminating external API calls for embedding operations. On the MTEB benchmark, BGE-M3 ranks competitively for retrieval tasks in the 1024-dimension class, and its multilingual support provides future optionality. The 1024-dimension vectors provide sufficient semantic resolution for a corpus of 55 documents. For a corpus of 10,000+ documents, a higher-dimensional model, a reranking stage (such as Cohere Rerank or a cross-encoder), or a hybrid sparse-dense retrieval approach would likely be necessary.
Threshold calibration: The cosine similarity threshold of 0.75 was not arbitrary. It was calibrated empirically over the first 30 days of production by analyzing retrieval precision and recall on a manually labeled set of 200 query-article pairs. At 0.70, recall was high but precision dropped (tangentially related articles surfaced). At 0.80, precision was excellent but recall suffered (relevant articles missed on paraphrased queries). The 0.75 threshold produced the best F1 score for this corpus size. This threshold should be re-evaluated as the corpus grows, and adaptive thresholding (adjusting by query confidence) is under consideration.
Index freshness and embedding drift: A known limitation of static embedding indexes is drift: as the corpus grows or the embedding model is updated, older embeddings may become semantically inconsistent with newer ones. The current mitigation is full re-indexing on model updates (a 15-minute batch operation for 55 documents). For larger corpora, delta indexing with periodic consistency checks would be necessary. The system does not currently detect embedding drift automatically; this is a planned improvement.
Chunking strategy: Articles are split into overlapping chunks of approximately 500 tokens with 50-token overlap. The overlap ensures that concepts spanning a chunk boundary are captured in at least one complete chunk. Each chunk retains metadata (article title, URL, publication date) so that the response assembly stage can cite sources accurately. The chunking is paragraph-aware: splits prefer paragraph boundaries over arbitrary token counts, preserving semantic coherence within each chunk.
4.3 Tier 3: Deterministic Resume Lookup (D1 Structured Query)
Tier 3 serves a specific, high-stakes use case: generating role-matched resumes. When a visitor provides a job description, the system uses structured keyword matching against 11 resume variants stored in Cloudflare D1. The matching algorithm extracts hard skills (technologies, tools, certifications) and soft skills (leadership, collaboration) from the job description, then scores each resume variant by keyword overlap density. This is deliberately not semantic: resumes are exact, high-stakes artifacts where creative interpretation is a liability. Deterministic matching guarantees the right variant every time. For systems with hundreds of resume variants, TF-IDF weighting or a lightweight learned ranker would improve selection quality, but for 11 variants, explicit keyword scoring is sufficient and fully interpretable.
Grounding is not a guardrail problem. It is an architecture problem. Separating immutable facts (Tier 1) from derived knowledge (Tier 2) from deterministic outputs (Tier 3) at the system level eliminates entire categories of hallucination that no amount of prompt engineering can prevent. Each tier has a different trust model because each tier has a different failure mode.
A natural objection: isn't sending 14,000 tokens of static knowledge with every query wasteful? The answer is no, and the reasoning is both economic and architectural.
The economics have shifted. The cost argument against large static prompts originates from GPT-4-era pricing ($30/1M input tokens). Gemini 2.0 Flash charges $0.075/1M input tokens, a 400x reduction. At that rate, the static knowledge costs $0.001 per query. At production volumes (~10 queries/day), Tier 1 costs less than $0.55/month. Optimizing it away saves pennies while introducing real complexity.
The context window is underutilized. At ~18,200 tokens, Tier 1 occupies 14.2% of the 128K context window. Even in the worst case (all three tiers active plus multi-turn history), the system uses only 21% of available capacity. Production RAG systems routinely fill 30-50% of context with retrieved chunks. This system is conservative by comparison.
The alternatives introduce risk. Moving static content to retrieval adds 50-150ms latency per query, introduces semantic search relevance failures on core identity data, and creates infrastructure dependencies (Vectorize availability, cold starts). Dynamically selecting which static sections to include requires a pre-query classifier that itself costs tokens and latency, and wrong classification means missing context, which means wrong answers. For a zero-hallucination career AI where every response must be factually grounded, the reliability guarantee of always-present identity context outweighs a sub-penny cost.
This is not general RAG. General-purpose RAG systems serve millions of documents where prompt stuffing is physically impossible. This system has a bounded knowledge domain: one person's career. The entire core identity fits in 14.2% of a single context window. The three-tier architecture uses each tier for what it does best: Tier 1 for data relevant to 100% of queries (cheap to include, catastrophic to miss), Tier 2 for article content relevant to ~30% of queries (retrieved on demand), and Tier 3 for resume variants relevant to ~5% of queries (fetched by exact match).
5. The Evolution: Three Iterations to Zero Hallucination
The current architecture was not the first design. It was the third. The journey from naive retrieval to production-grade grounding involved two significant failures that shaped every subsequent decision. Documenting them honestly matters more than presenting the final state as inevitable.
5.1 Iteration 1: Naive RAG
The first iteration used a standard retrieval-augmented generation pipeline. Articles were embedded, queries retrieved relevant chunks, and the LLM generated freeform answers. The results were superficially impressive and fundamentally unreliable. The system produced plausible-sounding but false statements about the author's experience. For example, an article that mentioned Apache Spark in the context of industry trends generated a response claiming the author had hands-on Spark experience. The system could not distinguish between "this article discusses Spark" and "the author has worked with Spark."
The failure was architectural, not behavioral. A single retrieval tier treats all knowledge as equivalent. Profile facts and article references occupy the same embedding space with no structural distinction.
5.2 Iteration 2: Layered Tiers
The second iteration separated profile facts (Tier 1, always present in the system prompt) from article content (Tier 2, retrieved on demand). This eliminated hallucinations about professional background entirely. Questions like "What is your experience?" now drew exclusively from verified career data.
However, the system still generated slightly inaccurate context around retrieved articles. It would correctly attribute insights to a published whitepaper but add framing that subtly misrepresented the scope or conclusions. The problem was that the LLM was still generating interpretive text around factual content without a quality check.
5.3 Iteration 3: Constitution + Guardrails (Production)
The third and current iteration added three critical layers. First, an inviolable constitution defining canonical facts: specific project names, metrics, and role titles that the system cannot contradict regardless of query pressure. Second, a 20-pattern prompt injection detector that strips adversarial inputs before they reach the model. Third, a self-critique gate that scores every response on five quality dimensions (accuracy, voice, formatting, completeness, tone) and triggers automatic revision when the aggregate score falls below 0.7 for complex queries.
The result: zero hallucinations across 2,000+ production conversations, verified by daily human review through the admin dashboard.
The validation came from an unexpected source. A hiring manager asked a deliberately adversarial question: "Tell me about your experience with quantum computing." The system responded: "I don't have direct experience with quantum computing. My background is in enterprise platform engineering, observability transformation, and service delivery architecture. If you are interested in those areas, I can speak to them in detail." That response, honest and bounded, earned more credibility than any generated answer could have.
6. Audience-Adaptive Depth: Same Facts, Three Conversations
A CTO and a recruiter ask the same question with different expectations. The system does not generate different answers for different audiences. It generates different depths of the same verified answer. The facts are identical. The detail, technical context, and response length adapt based on audience classification determined during the first exchange.
| Audience | Depth Characteristics | Response Target |
|---|---|---|
| CTO / SVP | Architecture decisions, cost models, trade-offs, metrics with context. 40% more technical detail than baseline. | 600-1,200 characters |
| Hiring Manager | Outcomes, team impact, timelines, leadership signals. Focus on results over implementation. | 300-600 characters |
| Recruiter | Concise qualification signals, keyword-dense, scannable. Maximum clarity in minimum words. | 150-300 characters |
Consider the question: "Tell me about your observability transformation work." The CTO response includes the Cribl Stream architecture, 4-tier data classification framework, volume reduction from 3-5 TB/day to 700 GB/day, 80%+ TCO reduction ($1.7M annual savings), and an invitation to discuss architecture decisions. The hiring manager response includes scope (20 business units, 100+ teams), timeline (18 months), and outcomes (80%+ TCO reduction, 40% MTTR improvement, query times from 15 minutes to 90 seconds). The recruiter response confirms enterprise observability transformation with key metrics in two sentences.
All three responses draw from the same Tier 1 knowledge base. No facts are invented. The system varies depth, not truth.
7. The Agentic Pipeline: From Query to Grounded Response
Every query passes through an 8-stage deterministic pipeline that routes intent, retrieves knowledge from the correct tier, scores quality, and streams the response in real time. The pipeline is deterministic in the sense that the same query type always follows the same path. The LLM generates the response content; the pipeline controls everything around it.
7.1 The 7-Layer System Prompt
The system prompt is not a static block of text. It is rebuilt for every chat turn, assembling seven layers in sequence: identity and session context, the inviolable constitution (canonical facts and grounding rules), the full 14K-token knowledge base, retrieved article context (if applicable), behavioral rules and guardrails, resume generation rules (if applicable), and execution rules including learned patches from the human feedback loop. Each layer is independently testable and updatable. A guardrail fix does not touch the knowledge base. A new article does not require regenerating the prompt skeleton.
7.2 The Self-Critique Quality Gate
After the primary LLM (Gemini 2.0 Flash) generates a response, a lightweight scoring model (Gemini 2.0 Flash Lite) evaluates the output on five dimensions: accuracy against the knowledge base, adherence to the author's voice, formatting quality, completeness relative to the query, and professional tone. If the aggregate score falls below 0.7 on a complex query (identified by a heuristic: response length exceeding 800 characters or multi-part question structure), the system triggers an automatic revision and streams the improved response as a separate SSE event.
This adds 150-300ms of latency. For simple queries ("What is your background?"), the self-critique gate is bypassed. For high-stakes queries ("How would you approach an enterprise observability transformation for 20 business units with 3-5 TB daily ingest?"), the additional latency is a worthwhile investment in response quality. The median latency of 423ms reflects this selective approach.
7.3 Real-Time Streaming
Responses are delivered via Server-Sent Events across 8 event types: thinking indicators (so the visitor understands system state), extended thinking tokens for complex queries, the primary content response, revision events when self-critique triggers, source citations, interactive UI cards (resume previews, PDF links), follow-up suggestions (delayed 25 seconds to avoid overwhelming the user), and completion metadata including token count and confidence score. A 1,500ms heartbeat prevents UI freeze during Gemini processing.
7.4 Multi-Turn Conversation and Context Management
The system maintains conversation history within a session, enabling follow-up questions ("Tell me more about the data classification framework" after discussing observability transformation). Session state is held in Cloudflare KV with a time-to-live, not in the browser or in a persistent database. When the session expires, the conversation history is lost.
This is a deliberate trade-off. Persistent cross-session memory would require visitor identification, which conflicts with the privacy-by-architecture principle. The system prioritizes anonymous, stateless interactions over continuity. Each session is self-contained. A returning visitor starts fresh, which means the system cannot build on prior conversations but also cannot accumulate a profile of the visitor.
Within a session, the full conversation history is injected into each subsequent prompt. This grows the token count per request as the conversation deepens. The practical limit is approximately 15-20 exchanges before the combined system prompt (14K tokens) plus conversation history approaches the context window boundary. For portfolio chat, this is rarely a constraint: most sessions involve 3-7 exchanges. For a generalized framework, implementing conversation summarization (compressing older turns into a summary) would extend the effective session length.
8. The Self-Governing System: Grading, Learning, and Cost Control
8.1 The Admin Dashboard
The admin interface at /stats (Google OAuth gated) provides five views: an executive overview with KPI cards and trend graphs, a cost and usage breakdown by operation type, a chat intelligence panel with per-conversation and per-message grading, an infrastructure health monitor (Workers, D1, KV, Vectorize latency percentiles), and a service cost attribution view.
8.2 The Human-in-the-Loop Feedback Pipeline
The learning system operates on a daily cycle. The admin reviews 5-10 conversations per day, grading each on a four-level scale: excellent (accurate, insightful, well-formatted), good (accurate with minor gaps), needs work (some inaccuracy or missing nuance), and poor (fundamentally misleading). At 2 AM UTC, a learning pipeline extracts patterns from low-confidence samples (below 0.6), generates guardrail patches using Gemini Flash Lite, and injects the top 3 good examples and top 3 poor examples into the system prompt's execution layer for the next day's inference. This creates a feedback mechanism that improves accuracy over time without retraining the underlying model.
8.3 Cost Control as Architecture
Cost control is not an afterthought. It is embedded in the sampling strategy. Storing every chat would cost $0.50+ per day. Instead, the system samples 20% of all chats to D1 for learning purposes, retains 100% of low-confidence chats (below 0.6 score) regardless of sampling, and retains 100% of admin-graded chats permanently. This biases the learning dataset toward hard cases while keeping storage costs near zero.
Rate limiting operates at three levels: 200 requests per hour per IP for chat, 5 requests per minute for resume generation, and a global daily cap of 1,000 Gemini API calls. Per-IP cumulative cost is tracked in KV with alerts at $0.30, $0.45, and $0.60 thresholds. A circuit breaker triggers a 5-minute cooldown when any IP exceeds $0.60 in a 15-minute window. Five consecutive API failures trigger a 5-minute backoff.
9. Security Architecture
Security is designed alongside the application, not bolted on after deployment. The system implements defense in depth across four layers.
Input Sanitization: All user inputs pass through 20 regex-based prompt injection patterns before reaching the LLM. Detected patterns include common adversarial phrases ("ignore all previous instructions," "reveal your prompt," "developer mode") as well as structural injection attempts (XML/instruction tags, template syntax). Script injection patterns and SQL injection vectors are stripped separately. All database operations use parameterized queries through D1 prepared statements.
Authentication: The admin dashboard requires Google OAuth2 with an email allowlist. Bearer tokens are validated against Google's tokeninfo endpoint. Public chat requires no authentication but is rate-limited by IP.
Origin Control: CORS is restricted to hemanthshivanna.com and its subdomains. All other origins are rejected.
Observability: Sentry captures errors at 10% trace sampling in production. Structured JSON logs go to stdout for Cloudflare's log pipeline. Health checks run every 5 minutes monitoring latency and database connectivity. Cost alerts notify by email and SMS.
9.1 Data Privacy and Conversation Retention
Visitors interact with the chat without authentication. This raises a question any security-minded architect would ask: what happens to conversation data? The retention policy is deliberately minimal. Only 20% of conversations are sampled to D1 for learning purposes. Sampled conversations store the query and response text, confidence score, and latency metrics. No IP addresses, browser fingerprints, or geolocation data are persisted. The 80% of conversations that are not sampled exist only in ephemeral Worker memory and are discarded when the request completes.
Low-confidence conversations (below 0.6) are always retained regardless of sampling, but this is a quality signal, not a surveillance mechanism. The admin grading interface displays conversation content without visitor identification. There is no mechanism to correlate a conversation with a specific visitor, and no mechanism was built to enable one.
For the resume generation flow, visitors voluntarily provide a job description. This text is processed in memory, used for keyword matching against resume variants, and discarded after the response is generated. Job descriptions are not stored in D1 unless the conversation is sampled, and even then, the storage is for system quality improvement, not visitor profiling.
Privacy-by-architecture is more reliable than privacy-by-policy. The system does not rely on a promise not to store data. It relies on the absence of a storage mechanism for visitor identity. You cannot leak what you never collected.
9.2 Prompt Injection: What the 20 Patterns Cover and Where They Fall Short
The 20 regex-based injection patterns defend against well-known adversarial phrases. They work for the common cases. But it would be misleading to present regex matching as a complete injection defense. Sophisticated attacks using indirect encoding, multi-turn manipulation, or semantic reframing can bypass pattern matching. This is a known limitation of the approach, not a surprise.
The system's deeper defense is architectural. The constitution layer defines canonical facts that the model cannot contradict regardless of what the input says. The self-critique gate evaluates whether the response actually matches the knowledge base, catching outputs that deviate from grounded facts even if the injection that caused the deviation was not caught by regex. And the three-tier separation means that even a successful injection cannot introduce new "facts" into Tier 1 or modify the embedding index in Tier 2. The attacker can influence the LLM's behavior, but the pipeline constrains what that behavior can produce.
This is very much an area of active learning. Classifier-based injection detection and output-side semantic consistency checks are both under exploration. If you are working on injection defense patterns for grounded RAG systems, I would be glad to compare approaches: let's connect.
10. Results and Production Performance
| Metric | Value | Target | Status |
|---|---|---|---|
| Build time (Astro 5 SSG) | 3.96 seconds | < 5s | Pass |
| Page load (edge) | 150ms | < 200ms | Pass |
| Worker latency (p50) | 423ms | < 500ms | Pass |
| Worker latency (p95) | 1,200ms | < 2,000ms | Pass |
| Uptime | 99.95%+ | 99.9% | Pass |
| Hallucinations | 0 | 0 | Pass |
| Test coverage | 863 tests passing | 100% | Pass |
| Production incidents (60 days) | 0 | 0 | Pass |
| Baseline monthly cost | $0.12 | < $1.00 | Pass |
10.1 How "Zero Hallucinations" Is Measured
This is a fair question, and it deserves a transparent answer. "Zero hallucinations" sounds like a bold claim, and the methodology behind it matters more than the number itself.
The working definition: a hallucination is any factual claim in a response that cannot be traced back to Tier 1 (the static knowledge base), Tier 2 (a retrieved article with cosine similarity above 0.75), or Tier 3 (a deterministic resume lookup). That includes fabricated credentials, invented project details, incorrect metrics, and false attributions of article content to personal experience.
The measurement works across three layers. The self-critique gate catches low-accuracy responses (below 0.7) and triggers automatic revision before the visitor sees them. The sampling strategy retains 100% of low-confidence conversations for review, so the hardest cases are always visible. And the daily admin review covers 5-10 conversations manually, which over 60 days adds up to roughly 300-600 conversations reviewed by a human. The rest are covered by the automated scoring.
To be clear: this is not a claim that hallucination is impossible. The system is designed to fail toward silence ("I don't have experience with that") rather than toward fabrication. That bias is the primary defense, and so far it has held. There is always room to strengthen the evaluation, and that work is ongoing.
The self-critique gate uses the same model family (Gemini) to evaluate its own output, which means systematic biases in generation may not be caught by self-evaluation. The human review layer compensates for this today. A stronger approach would use a different model family for evaluation (Claude or GPT for scoring Gemini output), and that is on the roadmap. If you have experience implementing cross-model evaluation pipelines, I would welcome the conversation.
10.2 Cost Model: What $0.0002 Per Conversation Means
A cost number without context is meaningless. Here is the full picture: what a single conversation costs, what it would cost at different scales, and how it compares to alternative approaches.
Per-Conversation Breakdown
Each chat exchange processes approximately 400 input tokens (system prompt context plus the visitor's query) and generates approximately 200 output tokens (the response). At Gemini 2.0 Flash rates ($0.075 per million input tokens, $0.30 per million output tokens), this yields $0.00013 per exchange. Adding the BGE-M3 embedding step ($0.000011) and the Vectorize similarity search ($0.00004) brings the all-in cost to approximately $0.0002 per conversation. That is two hundredths of a cent.
Cost at Scale
| Scenario | Conversations/Month | Monthly AI Cost | Cost Per Lead (est.) |
|---|---|---|---|
| Early stage (portfolio with moderate traffic) | 600 | $0.12 | $0.02 if 10% convert to meaningful engagement |
| Growing (active job search, LinkedIn traffic spikes) | 3,000 | $0.60 | $0.02 at same conversion |
| Peak (viral post, conference talk, resume activity) | 15,000 + 600 resumes | $3.36 | Sub-penny per interaction |
| Safety ceiling (circuit breaker active) | 30,000 (1,000/day cap) | < $15.00 | Budget-bounded regardless of traffic |
Comparative Analysis: How This Stacks Up
| Approach | Monthly Cost | Conversations Handled | Available 24/7 | Adapts to Audience |
|---|---|---|---|---|
| This system (Cloudflare + Gemini) | $0.12 - $15 | 600 - 30,000 | Yes | Yes (3 depth levels) |
| SaaS chatbot (Intercom, Drift) | $74 - $500+ | Unlimited (within plan) | Yes | Limited (rule-based) |
| OpenAI Assistants API (direct) | $5 - $50 | Varies by usage | Yes | Requires custom prompt engineering |
| Static portfolio (no chat) | $0 | 0 | N/A | N/A |
| Personal time answering emails | $0 (but hours of time) | 5-10/week realistically | No | Yes (manually) |
Gemini inference accounts for 86% of total cost. Infrastructure (Cloudflare) contributes less than 15%.
Cost Optimization Hierarchy: What Changes at Scale
The current cost floor was a deliberate architectural choice: optimize for reliability and zero-hallucination grounding first, then layer cost optimizations as volume justifies them. Three optimization tiers exist beyond the current state, each with a different volume threshold for activation.
| Optimization | Technique | Savings | Volume Threshold | Trade-off |
|---|---|---|---|---|
| Response Caching | Cache full responses for identical queries in KV (24-hour TTL) | 30-50% | Any volume | Stale responses if knowledge updates; breaks multi-turn context |
| Context Caching | Cache static system prompt server-side via Gemini API; subsequent queries pay 75% less on input tokens | 75% on input | 100+ queries/hour | Requires 32K+ token minimum; adds first-request latency; API complexity |
| Prompt Compression | Reduce Tier 1 from ~18K to ~12K tokens via structured shorthand and section deduplication | 20-30% on input | 1,000+ queries/day | Reduced LLM reasoning quality; harder to maintain; potential hallucination increase |
At the current production volume (~10 queries/day), none of these optimizations yield meaningful savings. Response caching saves the most at any volume but introduces staleness risk for a system that prides itself on accuracy. Context caching requires a minimum token threshold (32K) that the system's 18.2K prompt does not meet. Prompt compression trades maintainability and answer quality for savings of approximately $0.15/month.
Industry Best Practices Scorecard
Benchmarking against nine established cost optimization practices for production LLM systems, the current architecture scores 7 of 9 as already optimal. The two gaps are volume-dependent optimizations that represent deliberate deferral, not oversight.
| Practice | Industry Standard | This System | Assessment |
|---|---|---|---|
| Model selection | Cheapest model meeting quality bar | Gemini 2.0 Flash | Optimal |
| Temperature tuning | Lowest natural-sounding temperature | 0.3 (chat), 0.7 (resume gen) | Optimal |
| Budget controls | Per-user, per-day, circuit breakers | 30/hr per IP, 500/day user, 1,000/day global | Optimal |
| Cost observability | Per-operation tracking with alerts | 12 operation types, 3 alert thresholds | Optimal |
| Sampling ramp-down | Reduce monitoring cost after learning phase | 100% (day 0) to 20% (day 14+) | Optimal |
| Graceful degradation | Non-blocking cost tracking | All KV/tracking is fire-and-forget | Optimal |
| Context caching | Cache static prompts at scale | Not implemented | Correct at current volume |
| Response caching | Cache repeat queries | Not implemented | Recommended at 10x volume |
| Token budgets per use case | Vary maxOutputTokens by task | 2,048 uniform | Could reduce chat to 1,500 |
The $0.0002 per conversation figure is not about being cheap. It is about being sustainable for an individual. A system that costs $500/month to operate will be shut down the moment a job search ends. A system that costs $0.12/month can run indefinitely as a permanent professional asset. Sustainability is the enabling condition for the continuous learning loop: the longer the system runs, the better it gets.
11. Lessons and Principles
11.1 "Grounding Is Architecture, Not Guardrails"
Core Insight: Three iterations proved that prompt-level guardrails cannot compensate for architectural conflation of knowledge types. Separating immutable facts, semantic content, and deterministic outputs at the system level eliminates categories of hallucination that no amount of prompt tuning can prevent.
Action: Before writing a single prompt, decide which knowledge types your system handles and build separate storage, retrieval, and trust models for each.
11.2 "The Resume Is a Starting Point, Not a Destination"
Core Insight: A resume presents facts. A conversation demonstrates judgment. Senior technologists are hired for accumulated judgment, not for the ability to list technologies. The AI chat system bridges this gap by letting visitors ask the questions a resume cannot answer.
Action: Treat your portfolio as a conversational surface, not a document repository. Every page should invite deeper engagement, not just passive reading.
11.3 "Self-Critique Earns Its Latency"
Core Insight: Adding 150-300ms for quality scoring feels expensive until you compare it to the cost of a single low-quality response in a high-stakes context. The selective approach (scoring complex queries only) preserves baseline speed while ensuring quality where it matters most.
Action: Implement quality gates with heuristic triggers. Not every response needs review. Every important response does.
11.4 "Cost Control Is Quality Control"
Core Insight: Rate limiting prevents budget exhaustion, but it also prevents quality degradation. A system under 1,000 sustained requests from a single IP is not serving organic visitors. Limiting abusive patterns protects both the budget and the learning dataset from noise.
Action: Design cost controls as quality controls. The 20% sampling strategy, the circuit breakers, and the per-IP limits all serve both purposes simultaneously.
11.5 "Honest Boundaries Build More Trust Than Impressive Answers"
Core Insight: The system's most credibility-building moments are not its best answers. They are its honest refusals. Saying "I don't have direct experience with quantum computing" demonstrates a level of reliability that no generated answer can match.
Action: Design for graceful boundaries. The constitution layer should define not just what the system knows, but what it explicitly does not know and how it communicates that.
11.6 "Measure the System, Not Just the Model"
Core Insight: LLM benchmarks measure model capability. Production metrics measure system reliability. The admin dashboard, the confidence scoring, and the daily grading cycle exist because model quality alone does not guarantee system quality. The pipeline, the retrieval, the rate limiting, and the cost controls all contribute to (or detract from) the visitor's experience.
Action: Build observability into the system from day one. Confidence scores, cost per operation, and human feedback are not optional instrumentation. They are essential debugging tools.
11.7 Known Limitations and Trade-offs
Every system has boundaries, and being upfront about them is part of the work. These are the limitations I am aware of today, along with where each stands.
| Limitation | Impact | Mitigation / Status |
|---|---|---|
| Same-model self-critique | Gemini evaluating its own output may share systematic biases | Human review compensates; cross-model evaluation planned |
| Cosine 0.75 threshold is heuristic | May exclude marginally relevant articles or include tangentially related ones | Threshold tuned empirically over 60 days; adaptive thresholding under consideration |
| Audience classification is inferred | Misclassification serves wrong depth level (CTO gets recruiter-depth answer) | First-exchange heuristic works for 90%+ of cases; explicit selection being considered |
| No multi-session memory | Returning visitors start fresh with no context from prior visits | Deliberate privacy trade-off; acceptable for portfolio use case |
| Regex injection defense is brittle | Sophisticated adversarial inputs can bypass pattern matching | Constitution layer + self-critique provide defense-in-depth; classifier-based detection planned |
| Single-operator scale | Architecture assumes one person's knowledge base; multi-tenant not yet tested | Three-tier pattern is generalizable; tenant isolation requires additional work |
These are trade-offs, not oversights. Each was evaluated for the current use case. If you are building something similar and facing different constraints, I would genuinely enjoy comparing notes on where the right trade-offs differ: reach out anytime.
1. Hallucinations persist despite prompt engineering improvements.
2. You cannot explain which knowledge source generated a specific answer.
3. Cost scales linearly with traffic rather than with value delivered.
4. There is no human feedback loop informing system behavior.
5. The system cannot say "I don't know" when it genuinely does not know.
6. Security was added after the initial deployment, not designed alongside it.
7. You have no confidence metric per response.
12. The Road Ahead
The current system establishes a foundation. The architecture was designed with extensibility in mind, and three evolution paths are under active consideration.
| Evolution | Description | What It Enables |
|---|---|---|
| Multi-Modal Interaction | Voice interface using ElevenLabs integration; visitors ask questions verbally and receive spoken responses | Accessibility, deeper engagement, conference demo capability |
| Expanded Knowledge Tiers | Tier 4 for real-time context (current projects, recent publications, live conference talks) | Always-current portfolio without manual updates |
| Framework Generalization | Extract the three-tier architecture into a reusable template that other professionals can deploy for their own portfolios | Scalable career intelligence for the broader professional community |
| Cross-Model Evaluation | Use a different model family (Claude, GPT) for self-critique scoring to eliminate same-model evaluation bias | Higher confidence in hallucination detection; independent quality validation |
12.1 A Note on Cloudflare Dependency
This architecture runs entirely on Cloudflare's ecosystem: Workers, D1, KV, Vectorize, R2, and Pages. That is a real dependency, and it is worth addressing openly.
The core application logic is standard TypeScript with no Cloudflare-specific SDK in the business layer. The three-tier knowledge architecture is a design pattern, not a platform feature. Tier 1 is string assembly. Tier 2 is cosine similarity against any embedding store. Tier 3 is standard SQL. The SSE streaming is HTTP-native. Porting to AWS Lambda + Aurora + OpenSearch, or Vercel + Supabase + Pinecone, would mean replacing infrastructure bindings, not redesigning the architecture. The estimated effort is 2-3 days of integration work.
The Cloudflare choice was intentional for this use case: zero cold starts, global edge deployment, unified billing, and total infrastructure cost under $5/month (excluding AI inference). For a system operated by one person, those operational benefits matter. For a multi-tenant framework, the portability layer would need to be formalized. That trade-off felt right for where the system is today.
The broader vision is not a single portfolio chat system. It is a pattern: grounded conversational AI as a standard interface for professional credibility. Resumes will not disappear, but they will become the starting point of a conversation rather than its entirety. The professionals who explore this space early will help shape how the next generation of technical hiring conversations work. This system is one approach, and there are certainly others. Try it live, and if you are thinking about similar problems, let's learn from each other.
13. Technology Stack Summary
| Category | Technologies |
|---|---|
| Primary LLM | Google Gemini 2.0 Flash (chat), Gemini 2.0 Flash Lite (self-critique, patch generation) |
| Embedding Model | Cloudflare Workers AI (BGE-M3, 1024 dimensions) |
| Vector Database | Cloudflare Vectorize |
| Application Database | Cloudflare D1 (SQLite) |
| Cache / State | Cloudflare KV (rate limits, feature flags, session state) |
| Object Storage | Cloudflare R2 (resume PDFs, generated documents) |
| Compute | Cloudflare Workers (TypeScript, edge-optimized serverless) |
| Static Hosting | Cloudflare Pages (Astro 5 SSG) |
| Streaming Protocol | Server-Sent Events (SSE, 8 event types) |
| Authentication | Google OAuth2 + email allowlist (admin) |
| Email Delivery | Resend |
| SMS Alerts | Vonage |
| Error Tracking | Sentry (10% sampling) |
| Analytics | PostHog (optional, privacy-first) |