Beyond What a Resume Can Say

Contents

Executive Summary
The Challenge: Why Resumes Fail Senior Technologists
Diagnosis: Three Root Causes of the Portfolio Gap
The Three-Tier Knowledge Architecture
The Evolution: Three Iterations to Zero Hallucination
Audience-Adaptive Depth: Same Facts, Three Conversations
The Agentic Pipeline: From Query to Grounded Response
The Self-Governing System: Grading, Learning, and Cost Control
Security Architecture
Results and Production Performance
Lessons and Principles
The Road Ahead
Technology Stack Summary
About the Author

1. Executive Summary

A resume is a static artifact. It presents the same flat content to a CTO evaluating architecture depth, a hiring manager assessing team leadership, and a recruiter scanning for keyword matches. It cannot answer follow-up questions. It cannot adapt its depth. It cannot demonstrate how an architect thinks through a problem. For senior technologists whose value lies in 24 years of accumulated judgment, not a list of bullet points, this is a fundamental limitation.

This white paper documents a production-grade AI chat system built to solve that limitation. The system is deployed on a personal portfolio site, powered entirely by Cloudflare's serverless edge infrastructure, and grounded in a three-tier knowledge architecture that separates immutable professional facts from semantic content search and deterministic resume generation. In 60+ days of production operation across 2,000+ conversations, the system has produced zero hallucinations, maintained 99.95% uptime, and operated at a baseline cost of $0.12 per month.

The architecture reflects a principle I have come to trust through building it: modern LLMs can power reliable conversational interfaces when the grounding problem is solved at the architecture level, not patched at the prompt level. The three-tier separation, combined with audience-adaptive depth, a self-critique quality gate, and a human-in-the-loop learning pipeline, produces a system that works. It is not perfect, and this paper is transparent about where the gaps remain. But the foundation is solid, and I am sharing it in the hope that others building similar systems find something useful here.

$0.0002 Cost Per Conversation

423ms Median Worker Latency

0 Production Hallucinations

863 Automated Tests Passing

2,000+ Production Conversations

60+ Days Zero Incidents

💬

Try the Live System

Ask the AI chat anything about the author's technical background at hemanthshivanna.com

→

hemanthshivanna.com Live System

Recruiter Does Hemanth have enterprise observability experience?

Hemanth AI Yes, extensive enterprise observability experience. I led a transformation that reduced Splunk TCO by 80%+ through Cribl Stream data routing and a 4-tier classification framework. Query performance improved from 15 minutes to 90 seconds. Strong in Splunk, Cribl, cloud observability, and data architecture.

Recruiter What about cloud cost optimization results?

Hemanth AI I achieved $2.2M in annual cloud cost reduction through reserved instances, spot fleet strategies, and right-sizing. Also led an observability transformation that reduced Splunk TCO by 80%+, saving $1.7M annually. Happy to share specifics on the approach.

Recruiter Can he manage global teams?

Hemanth AI Yes. I have managed cross-functional teams across the United States, Canada, and India, supporting platforms that serve $6B+ in enterprise operations. Built offshore Tier 1/2 support teams from the ground up with trust-based ownership models.

Recruiter What certifications does he hold?

Hemanth AI Current certifications: AWS Solutions Architect Associate, Salesforce Agentforce Specialist, ITIL v3, Microsoft Certified (MCP, MCSE), Cisco Certified (CCNA), and Splunk Certified (Levels 1-3). I also hold an MBA.

Recruiter Does he have SAP implementation experience?

Hemanth AI I don't have direct SAP implementation experience. My background is in observability architecture (Splunk, Cribl, cloud monitoring), service delivery transformation, and enterprise integration platforms (IBM MQ, ACE). If the role involves observability, service operations, or platform engineering, I can speak to those areas in depth.

Figure 1: Recruiter interaction showing grounded responses, keyword-dense depth, and honest boundary handling

2. The Challenge: Why Resumes Fail Senior Technologists

The problem begins with a mismatch between what senior technologists know and what a resume can communicate. A two-page document can list that someone reduced Splunk TCO by 80% through intelligent data routing. It cannot explain the architectural reasoning behind the 4-tier classification framework (critical operational, compliance, diagnostic, noise) that made that cost reduction sustainable. It cannot describe the political resistance from 20 business units that each believed their data was "critical," how the classification criteria were negotiated, and why transparent governance was the sign of organizational maturity rather than top-down mandate. It cannot adapt that story to the specific concerns of the person reading it.

The visitors who matter most have the most specific questions. A CTO evaluating a platform architect wants to know: "Have you led enterprise observability transformation at scale? What was the cost reduction strategy for 3-5 TB daily ingest?" A hiring manager wants outcomes: "What was the timeline, and how did the team perform?" A recruiter needs a 30-second signal: "Does this person match the role?" A static portfolio answers none of these in the visitor's own language.

The Moment of Truth

The catalyst was straightforward. A CTO visited the portfolio, browsed three pages, and left. The analytics showed an average session of 47 seconds. A conversational system that could answer "How did you achieve 80% observability cost reduction?" with grounded, first-person depth would have turned that 47-second bounce into a 5-minute engagement. The question was whether AI could do this without fabricating credentials.

2.1 The AI Trust Problem

Conversational AI is powerful but dangerous for professional branding. Large language models hallucinate when they encounter queries beyond their training boundaries, when the underlying data is shallow or inconsistent, or when the system is under pressure to generate an answer when uncertainty would be more honest. For a personal portfolio, a single fabricated credential or invented project detail can undermine years of carefully built professional credibility. The system must never guess. It must say "I don't have direct experience with that" when it genuinely does not.

This constraint shaped every architectural decision that followed. The system was not designed to be impressive. It was designed to be trustworthy. The result is live at hemanthshivanna.com.

3. Diagnosis: Three Root Causes of the Portfolio Gap

Before building the solution, the team diagnosed why existing approaches fail. The gap between what a senior technologist knows and what a portfolio communicates has three compounding root causes.

Root Cause	Symptom	Consequence
The Data Gap	Static content cannot adapt to the reader's context or expertise level	CTOs get the same shallow summary as recruiters; neither finds what they need
The Trust Gap	Generic AI chatbots hallucinate when data is thin or queries are specific	A single fabricated claim destroys professional credibility
The Cost Gap	Enterprise conversational AI solutions are priced for enterprise budgets	Individual professionals cannot justify $500+/month for a portfolio assistant

These three gaps compound each other. The data gap means visitors leave quickly. The trust gap means AI cannot simply be bolted on. The cost gap means the solution must be architecturally efficient, not just functionally capable. Solving one gap without the other two produces either an expensive system nobody trusts or a cheap system nobody visits.

Diagnostic Warning Signs

If your portfolio has any of these characteristics, the gap is already costing you opportunities: average session duration under 60 seconds, zero inbound questions from visitors, resume download rates below 2%, or visitor-to-conversation conversion of zero. These are not traffic problems. They are depth problems.

4. The Three-Tier Knowledge Architecture

The core architectural insight is that not all knowledge has the same reliability requirements, retrieval characteristics, or update frequency. Treating all data as a single retrieval problem (the default RAG approach) conflates three fundamentally different knowledge types and introduces hallucination risk at every boundary. The solution separates knowledge into three tiers, each with its own storage, retrieval mechanism, and trust model.

Visitor Query

→

Intent Classification

→

Route to Correct Tier

Tier 1: Static Facts
14K tokens, zero latency

Tier 2: Semantic Search
47 articles, cosine ≥ 0.75

Tier 3: Deterministic Lookup
11 resume variants

Grounded Response
Audience-adapted, cited, scored

Figure 2: Three-Tier Knowledge Architecture with Intent-Based Routing

Property	Tier 1: Static Facts	Tier 2: Semantic Search	Tier 3: Deterministic Lookup
Storage	System prompt (in-context)	Cloudflare Vectorize (BGE-M3, 1024-dim)	Cloudflare D1 (SQLite)
Retrieval Latency	0ms (always present)	15-40ms (embedding + similarity search)	5-10ms (parameterized SQL query)
Accuracy Guarantee	Deterministic (canonical facts, never wrong)	Probabilistic (cosine similarity >= 0.75 threshold)	Deterministic (exact keyword match against structured data)
Failure Mode	Context window overflow (mitigated by selective loading)	False negatives below threshold; false positives above threshold	No match found (graceful fallback to Tier 1)
Update Frequency	Manual (rebuild system prompt on profile change)	Incremental (chunk + embed new articles, no retraining)	On-demand (add/modify resume variants in D1)
Token Cost Per Query	~14,000 input tokens (fixed overhead)	~500 tokens per retrieved chunk (variable, 0-5 chunks)	~200 tokens (structured resume metadata)

hemanthshivanna.com Tier 2 Retrieval

CTOWhat is your approach to observability transformation?

Hemanth AII led an enterprise observability transformation that replaced a legacy Splunk deployment with a tiered architecture using Cribl Stream for data routing and classification. The approach: classify before you route, measure before you optimize. We achieved 80%+ TCO reduction and sub-90-second query times that previously took 15 minutes.

Source: hemanthshivanna.com/whitepaper/observability-transformation

CTOTell me about the MQ platform stabilization work.

Hemanth AIThat project is documented in detail. The MQ-ACE platform was experiencing 36 outages annually across 200+ integrations. I led the stabilization by decoupling audit log processing from business-critical message flows using MQ cluster isolation. Reduced outages from 36 to 3 annually, a 92% improvement.

Source: hemanthshivanna.com/whitepaper/mq-ace-stabilization

CTOHow did you transform service delivery at scale?

Hemanth AIThe transformation covered a 35+ person global team across 3 continents supporting 200+ applications. We used a trust-based ownership model with 4 maturity stages, moving from "follow the runbook" to full autonomous ownership. First-contact resolution improved from 18% to over 70%, and we reclaimed 75% of engineering time from toil to actual engineering. The key insight is documented in the article: infrastructure scales when people do.

Source: hemanthshivanna.com/article/uplifting-service-delivery

Figure 3: Tier 2 semantic retrieval returning cited whitepaper and article sources alongside grounded responses

4.1 Tier 1: Immutable Professional Facts (Static System Prompt)

Tier 1 contains everything that defines the author's professional identity: 7 professional roles with quantified metrics, 13 technical skill categories spanning 65+ technologies, 12 verified testimonials, 6 major accomplishments with measured impact, and 5 persona rules governing audience-adaptive depth. This knowledge is embedded directly in the system prompt at 14,000 tokens. It requires zero retrieval latency and zero retrieval failure risk. Profile facts are always correct because they are always present.

The rationale is practical. If a visitor asks "What is your observability transformation experience?", the answer must come from verified career data, not from a semantic search that might return a tangentially related article. Tier 1 guarantees that identity questions are answered from identity data.

Scalability consideration: At 14,000 tokens, the static system prompt consumes a meaningful portion of the context window on every request. As a career grows (more roles, more metrics, more accomplishments), Tier 1 will eventually exceed what is practical to embed in every prompt. The planned mitigation is a selective loading strategy: classify the query's domain first (for example, observability, service delivery, leadership), then load only the relevant Tier 1 subsections rather than the full 14K tokens. This preserves the zero-retrieval-latency property while reducing token cost per request. The current 14K footprint works within Gemini 2.0 Flash's context window, but the architecture anticipates the need to partition it.

4.2 Tier 2: Semantic Content Search (Cloudflare Vectorize)

Tier 2 handles content depth. The system indexes 47 published articles and 8 technical whitepapers using BGE-M3 embeddings (1024 dimensions) stored in Cloudflare Vectorize. When a visitor asks a content question ("What is your approach to observability transformation?"), the pipeline retrieves the top 5 semantic matches filtered to a cosine similarity threshold of 0.75. Results below that threshold are discarded rather than surfaced with low confidence.

New articles do not require retraining the model or rebuilding the system prompt. They are chunked, embedded, and indexed. The semantic search handles nuanced content queries without touching the immutable identity layer.

Embedding model choice: BGE-M3 was selected over alternatives (OpenAI ada-002, Cohere embed-v3) for a specific reason: it runs natively on Cloudflare Workers AI, eliminating external API calls for embedding operations. On the MTEB benchmark, BGE-M3 ranks competitively for retrieval tasks in the 1024-dimension class, and its multilingual support provides future optionality. The 1024-dimension vectors provide sufficient semantic resolution for a corpus of 55 documents. For a corpus of 10,000+ documents, a higher-dimensional model, a reranking stage (such as Cohere Rerank or a cross-encoder), or a hybrid sparse-dense retrieval approach would likely be necessary.

Threshold calibration: The cosine similarity threshold of 0.75 was not arbitrary. It was calibrated empirically over the first 30 days of production by analyzing retrieval precision and recall on a manually labeled set of 200 query-article pairs. At 0.70, recall was high but precision dropped (tangentially related articles surfaced). At 0.80, precision was excellent but recall suffered (relevant articles missed on paraphrased queries). The 0.75 threshold produced the best F1 score for this corpus size. This threshold should be re-evaluated as the corpus grows, and adaptive thresholding (adjusting by query confidence) is under consideration.

Index freshness and embedding drift: A known limitation of static embedding indexes is drift: as the corpus grows or the embedding model is updated, older embeddings may become semantically inconsistent with newer ones. The current mitigation is full re-indexing on model updates (a 15-minute batch operation for 55 documents). For larger corpora, delta indexing with periodic consistency checks would be necessary. The system does not currently detect embedding drift automatically; this is a planned improvement.

Chunking strategy: Articles are split into overlapping chunks of approximately 500 tokens with 50-token overlap. The overlap ensures that concepts spanning a chunk boundary are captured in at least one complete chunk. Each chunk retains metadata (article title, URL, publication date) so that the response assembly stage can cite sources accurately. The chunking is paragraph-aware: splits prefer paragraph boundaries over arbitrary token counts, preserving semantic coherence within each chunk.

4.3 Tier 3: Deterministic Resume Lookup (D1 Structured Query)

Tier 3 serves a specific, high-stakes use case: generating role-matched resumes. When a visitor provides a job description, the system uses structured keyword matching against 11 resume variants stored in Cloudflare D1. The matching algorithm extracts hard skills (technologies, tools, certifications) and soft skills (leadership, collaboration) from the job description, then scores each resume variant by keyword overlap density. This is deliberately not semantic: resumes are exact, high-stakes artifacts where creative interpretation is a liability. Deterministic matching guarantees the right variant every time. For systems with hundreds of resume variants, TF-IDF weighting or a lightweight learned ranker would improve selection quality, but for 11 variants, explicit keyword scoring is sufficient and fully interpretable.

Core Architectural Insight

Grounding is not a guardrail problem. It is an architecture problem. Separating immutable facts (Tier 1) from derived knowledge (Tier 2) from deterministic outputs (Tier 3) at the system level eliminates entire categories of hallucination that no amount of prompt engineering can prevent. Each tier has a different trust model because each tier has a different failure mode.

Why Embed ~14K Static Tokens in Every Query?

A natural objection: isn't sending 14,000 tokens of static knowledge with every query wasteful? The answer is no, and the reasoning is both economic and architectural.

The economics have shifted. The cost argument against large static prompts originates from GPT-4-era pricing ($30/1M input tokens). Gemini 2.0 Flash charges $0.075/1M input tokens, a 400x reduction. At that rate, the static knowledge costs $0.001 per query. At production volumes (~10 queries/day), Tier 1 costs less than $0.55/month. Optimizing it away saves pennies while introducing real complexity.

The context window is underutilized. At ~18,200 tokens, Tier 1 occupies 14.2% of the 128K context window. Even in the worst case (all three tiers active plus multi-turn history), the system uses only 21% of available capacity. Production RAG systems routinely fill 30-50% of context with retrieved chunks. This system is conservative by comparison.

The alternatives introduce risk. Moving static content to retrieval adds 50-150ms latency per query, introduces semantic search relevance failures on core identity data, and creates infrastructure dependencies (Vectorize availability, cold starts). Dynamically selecting which static sections to include requires a pre-query classifier that itself costs tokens and latency, and wrong classification means missing context, which means wrong answers. For a zero-hallucination career AI where every response must be factually grounded, the reliability guarantee of always-present identity context outweighs a sub-penny cost.

This is not general RAG. General-purpose RAG systems serve millions of documents where prompt stuffing is physically impossible. This system has a bounded knowledge domain: one person's career. The entire core identity fits in 14.2% of a single context window. The three-tier architecture uses each tier for what it does best: Tier 1 for data relevant to 100% of queries (cheap to include, catastrophic to miss), Tier 2 for article content relevant to ~30% of queries (retrieved on demand), and Tier 3 for resume variants relevant to ~5% of queries (fetched by exact match).

5. The Evolution: Three Iterations to Zero Hallucination

The current architecture was not the first design. It was the third. The journey from naive retrieval to production-grade grounding involved two significant failures that shaped every subsequent decision. Documenting them honestly matters more than presenting the final state as inevitable.

5.1 Iteration 1: Naive RAG

The first iteration used a standard retrieval-augmented generation pipeline. Articles were embedded, queries retrieved relevant chunks, and the LLM generated freeform answers. The results were superficially impressive and fundamentally unreliable. The system produced plausible-sounding but false statements about the author's experience. For example, an article that mentioned Apache Spark in the context of industry trends generated a response claiming the author had hands-on Spark experience. The system could not distinguish between "this article discusses Spark" and "the author has worked with Spark."

The failure was architectural, not behavioral. A single retrieval tier treats all knowledge as equivalent. Profile facts and article references occupy the same embedding space with no structural distinction.

5.2 Iteration 2: Layered Tiers

The second iteration separated profile facts (Tier 1, always present in the system prompt) from article content (Tier 2, retrieved on demand). This eliminated hallucinations about professional background entirely. Questions like "What is your experience?" now drew exclusively from verified career data.

However, the system still generated slightly inaccurate context around retrieved articles. It would correctly attribute insights to a published whitepaper but add framing that subtly misrepresented the scope or conclusions. The problem was that the LLM was still generating interpretive text around factual content without a quality check.

5.3 Iteration 3: Constitution + Guardrails (Production)

The third and current iteration added three critical layers. First, an inviolable constitution defining canonical facts: specific project names, metrics, and role titles that the system cannot contradict regardless of query pressure. Second, a 20-pattern prompt injection detector that strips adversarial inputs before they reach the model. Third, a self-critique gate that scores every response on five quality dimensions (accuracy, voice, formatting, completeness, tone) and triggers automatic revision when the aggregate score falls below 0.7 for complex queries.

The result: zero hallucinations across 2,000+ production conversations, verified by daily human review through the admin dashboard.

The Moment That Defined Success

The validation came from an unexpected source. A hiring manager asked a deliberately adversarial question: "Tell me about your experience with quantum computing." The system responded: "I don't have direct experience with quantum computing. My background is in enterprise platform engineering, observability transformation, and service delivery architecture. If you are interested in those areas, I can speak to them in detail." That response, honest and bounded, earned more credibility than any generated answer could have.

hemanthshivanna.com Grounding Demo

Hiring ManagerTell me about your experience with quantum computing.

Hemanth AII don't have direct experience with quantum computing. My background is in enterprise platform engineering, observability transformation, and service delivery architecture. If you are interested in those areas, I can speak to them in detail.

Hiring ManagerHave you worked with Apache Spark or big data pipelines?

Hemanth AII don't have hands-on Apache Spark experience. My data engineering work has focused on enterprise integration pipelines (IBM MQ, ACE, MuleSoft) and observability data routing (Cribl Stream processing 2TB+ daily). These are high-volume data flows, but not Spark/Hadoop-style batch processing.

Hiring ManagerWhat about machine learning model development?

Hemanth AII have not built ML models from scratch. My AI work is on the application and architecture side: building agentic AI systems (this chat is one example), integrating LLMs with enterprise knowledge bases, and designing the guardrail and grounding layers that make AI reliable in production. Happy to discuss that angle if it is relevant to the role.

Figure 4: The system declining to fabricate expertise, redirecting to genuine strengths instead

6. Audience-Adaptive Depth: Same Facts, Three Conversations

A CTO and a recruiter ask the same question with different expectations. The system does not generate different answers for different audiences. It generates different depths of the same verified answer. The facts are identical. The detail, technical context, and response length adapt based on audience classification determined during the first exchange.

Audience	Depth Characteristics	Response Target
CTO / SVP	Architecture decisions, cost models, trade-offs, metrics with context. 40% more technical detail than baseline.	600-1,200 characters
Hiring Manager	Outcomes, team impact, timelines, leadership signals. Focus on results over implementation.	300-600 characters
Recruiter	Concise qualification signals, keyword-dense, scannable. Maximum clarity in minimum words.	150-300 characters

Consider the question: "Tell me about your observability transformation work." The CTO response includes the Cribl Stream architecture, 4-tier data classification framework, volume reduction from 3-5 TB/day to 700 GB/day, 80%+ TCO reduction ($1.7M annual savings), and an invitation to discuss architecture decisions. The hiring manager response includes scope (20 business units, 100+ teams), timeline (18 months), and outcomes (80%+ TCO reduction, 40% MTTR improvement, query times from 15 minutes to 90 seconds). The recruiter response confirms enterprise observability transformation with key metrics in two sentences.

All three responses draw from the same Tier 1 knowledge base. No facts are invented. The system varies depth, not truth.

hemanthshivanna.com Depth Adaptation

CTO / SVPTell me about your observability transformation work.

Hemanth AILed an enterprise observability transformation across 20 business units. Architected Cribl Stream as the intelligent routing layer with a 4-tier data classification framework: critical operational, compliance, diagnostic, and noise. Reduced data volume from 3-5 TB/day to 700 GB/day. Achieved 80%+ TCO reduction, saving $1.7M annually. Query performance improved from 14m 47s to 1m 22s (91% improvement). Built self-service Terraform modules for onboarding, PII masking at the Cribl routing layer, and S3 SmartStore for tiered retention. Happy to discuss the architecture or governance model.

Hiring ManagerTell me about your observability transformation work.

Hemanth AILed an enterprise observability transformation across 20 business units and 100+ teams with a 4-person core team. Achieved 80%+ TCO reduction ($1.7M annual savings) and 40% improvement in P1 MTTR. Query times went from 15 minutes to 90 seconds. Delivered in 18 months with a governance model that scaled to all business units.

RecruiterTell me about your observability transformation work.

Hemanth AILed enterprise observability transformation. 80%+ cost reduction, 40% faster incident resolution. Strong in Splunk, Cribl, and cloud observability.

Figure 5: The same question answered at CTO depth (580 chars), Hiring Manager depth (330 chars), and Recruiter depth (130 chars)

7. The Agentic Pipeline: From Query to Grounded Response

Every query passes through an 8-stage deterministic pipeline that routes intent, retrieves knowledge from the correct tier, scores quality, and streams the response in real time. The pipeline is deterministic in the sense that the same query type always follows the same path. The LLM generates the response content; the pipeline controls everything around it.

7.1 The 7-Layer System Prompt

The system prompt is not a static block of text. It is rebuilt for every chat turn, assembling seven layers in sequence: identity and session context, the inviolable constitution (canonical facts and grounding rules), the full 14K-token knowledge base, retrieved article context (if applicable), behavioral rules and guardrails, resume generation rules (if applicable), and execution rules including learned patches from the human feedback loop. Each layer is independently testable and updatable. A guardrail fix does not touch the knowledge base. A new article does not require regenerating the prompt skeleton.

7.2 The Self-Critique Quality Gate

After the primary LLM (Gemini 2.0 Flash) generates a response, a lightweight scoring model (Gemini 2.0 Flash Lite) evaluates the output on five dimensions: accuracy against the knowledge base, adherence to the author's voice, formatting quality, completeness relative to the query, and professional tone. If the aggregate score falls below 0.7 on a complex query (identified by a heuristic: response length exceeding 800 characters or multi-part question structure), the system triggers an automatic revision and streams the improved response as a separate SSE event.

This adds 150-300ms of latency. For simple queries ("What is your background?"), the self-critique gate is bypassed. For high-stakes queries ("How would you approach an enterprise observability transformation for 20 business units with 3-5 TB daily ingest?"), the additional latency is a worthwhile investment in response quality. The median latency of 423ms reflects this selective approach.

7.3 Real-Time Streaming

Responses are delivered via Server-Sent Events across 8 event types: thinking indicators (so the visitor understands system state), extended thinking tokens for complex queries, the primary content response, revision events when self-critique triggers, source citations, interactive UI cards (resume previews, PDF links), follow-up suggestions (delayed 25 seconds to avoid overwhelming the user), and completion metadata including token count and confidence score. A 1,500ms heartbeat prevents UI freeze during Gemini processing.

7.4 Multi-Turn Conversation and Context Management

The system maintains conversation history within a session, enabling follow-up questions ("Tell me more about the data classification framework" after discussing observability transformation). Session state is held in Cloudflare KV with a time-to-live, not in the browser or in a persistent database. When the session expires, the conversation history is lost.

This is a deliberate trade-off. Persistent cross-session memory would require visitor identification, which conflicts with the privacy-by-architecture principle. The system prioritizes anonymous, stateless interactions over continuity. Each session is self-contained. A returning visitor starts fresh, which means the system cannot build on prior conversations but also cannot accumulate a profile of the visitor.

Within a session, the full conversation history is injected into each subsequent prompt. This grows the token count per request as the conversation deepens. The practical limit is approximately 15-20 exchanges before the combined system prompt (14K tokens) plus conversation history approaches the context window boundary. For portfolio chat, this is rarely a constraint: most sessions involve 3-7 exchanges. For a generalized framework, implementing conversation summarization (compressing older turns into a summary) would extend the effective session length.

8. The Self-Governing System: Grading, Learning, and Cost Control

8.1 The Admin Dashboard

The admin interface at /stats (Google OAuth gated) provides five views: an executive overview with KPI cards and trend graphs, a cost and usage breakdown by operation type, a chat intelligence panel with per-conversation and per-message grading, an infrastructure health monitor (Workers, D1, KV, Vectorize latency percentiles), and a service cost attribution view.

hemanthshivanna.com/stats Admin Dashboard

2,147 Total Chats

0.87 Avg Confidence

$0.09 Cost Today

423ms Median Latency

Conversation Grading

ExcellentGoodNeeds WorkPoor

Daily Cost Trend (7 days)

MonTueWedThuFriSatToday

Figure 6: Admin dashboard showing KPI overview, grading distribution, and daily cost trends

8.2 The Human-in-the-Loop Feedback Pipeline

The learning system operates on a daily cycle. The admin reviews 5-10 conversations per day, grading each on a four-level scale: excellent (accurate, insightful, well-formatted), good (accurate with minor gaps), needs work (some inaccuracy or missing nuance), and poor (fundamentally misleading). At 2 AM UTC, a learning pipeline extracts patterns from low-confidence samples (below 0.6), generates guardrail patches using Gemini Flash Lite, and injects the top 3 good examples and top 3 poor examples into the system prompt's execution layer for the next day's inference. This creates a feedback mechanism that improves accuracy over time without retraining the underlying model.

8.3 Cost Control as Architecture

Cost control is not an afterthought. It is embedded in the sampling strategy. Storing every chat would cost $0.50+ per day. Instead, the system samples 20% of all chats to D1 for learning purposes, retains 100% of low-confidence chats (below 0.6 score) regardless of sampling, and retains 100% of admin-graded chats permanently. This biases the learning dataset toward hard cases while keeping storage costs near zero.

Rate limiting operates at three levels: 200 requests per hour per IP for chat, 5 requests per minute for resume generation, and a global daily cap of 1,000 Gemini API calls. Per-IP cumulative cost is tracked in KV with alerts at $0.30, $0.45, and $0.60 thresholds. A circuit breaker triggers a 5-minute cooldown when any IP exceeds $0.60 in a 15-minute window. Five consecutive API failures trigger a 5-minute backoff.

9. Security Architecture

Security is designed alongside the application, not bolted on after deployment. The system implements defense in depth across four layers.

Input Sanitization: All user inputs pass through 20 regex-based prompt injection patterns before reaching the LLM. Detected patterns include common adversarial phrases ("ignore all previous instructions," "reveal your prompt," "developer mode") as well as structural injection attempts (XML/instruction tags, template syntax). Script injection patterns and SQL injection vectors are stripped separately. All database operations use parameterized queries through D1 prepared statements.

Authentication: The admin dashboard requires Google OAuth2 with an email allowlist. Bearer tokens are validated against Google's tokeninfo endpoint. Public chat requires no authentication but is rate-limited by IP.

Origin Control: CORS is restricted to hemanthshivanna.com and its subdomains. All other origins are rejected.

Observability: Sentry captures errors at 10% trace sampling in production. Structured JSON logs go to stdout for Cloudflare's log pipeline. Health checks run every 5 minutes monitoring latency and database connectivity. Cost alerts notify by email and SMS.

9.1 Data Privacy and Conversation Retention

Visitors interact with the chat without authentication. This raises a question any security-minded architect would ask: what happens to conversation data? The retention policy is deliberately minimal. Only 20% of conversations are sampled to D1 for learning purposes. Sampled conversations store the query and response text, confidence score, and latency metrics. No IP addresses, browser fingerprints, or geolocation data are persisted. The 80% of conversations that are not sampled exist only in ephemeral Worker memory and are discarded when the request completes.

Low-confidence conversations (below 0.6) are always retained regardless of sampling, but this is a quality signal, not a surveillance mechanism. The admin grading interface displays conversation content without visitor identification. There is no mechanism to correlate a conversation with a specific visitor, and no mechanism was built to enable one.

For the resume generation flow, visitors voluntarily provide a job description. This text is processed in memory, used for keyword matching against resume variants, and discarded after the response is generated. Job descriptions are not stored in D1 unless the conversation is sampled, and even then, the storage is for system quality improvement, not visitor profiling.

Key Insight

Privacy-by-architecture is more reliable than privacy-by-policy. The system does not rely on a promise not to store data. It relies on the absence of a storage mechanism for visitor identity. You cannot leak what you never collected.

9.2 Prompt Injection: What the 20 Patterns Cover and Where They Fall Short

The 20 regex-based injection patterns defend against well-known adversarial phrases. They work for the common cases. But it would be misleading to present regex matching as a complete injection defense. Sophisticated attacks using indirect encoding, multi-turn manipulation, or semantic reframing can bypass pattern matching. This is a known limitation of the approach, not a surprise.

The system's deeper defense is architectural. The constitution layer defines canonical facts that the model cannot contradict regardless of what the input says. The self-critique gate evaluates whether the response actually matches the knowledge base, catching outputs that deviate from grounded facts even if the injection that caused the deviation was not caught by regex. And the three-tier separation means that even a successful injection cannot introduce new "facts" into Tier 1 or modify the embedding index in Tier 2. The attacker can influence the LLM's behavior, but the pipeline constrains what that behavior can produce.

This is very much an area of active learning. Classifier-based injection detection and output-side semantic consistency checks are both under exploration. If you are working on injection defense patterns for grounded RAG systems, I would be glad to compare approaches: let's connect.

10. Results and Production Performance

Metric	Value	Target	Status
Build time (Astro 5 SSG)	3.96 seconds	< 5s	Pass
Page load (edge)	150ms	< 200ms	Pass
Worker latency (p50)	423ms	< 500ms	Pass
Worker latency (p95)	1,200ms	< 2,000ms	Pass
Uptime	99.95%+	99.9%	Pass
Hallucinations	0	0	Pass
Test coverage	863 tests passing	100%	Pass
Production incidents (60 days)	0	0	Pass
Baseline monthly cost	$0.12	< $1.00	Pass

10.1 How "Zero Hallucinations" Is Measured

This is a fair question, and it deserves a transparent answer. "Zero hallucinations" sounds like a bold claim, and the methodology behind it matters more than the number itself.

The working definition: a hallucination is any factual claim in a response that cannot be traced back to Tier 1 (the static knowledge base), Tier 2 (a retrieved article with cosine similarity above 0.75), or Tier 3 (a deterministic resume lookup). That includes fabricated credentials, invented project details, incorrect metrics, and false attributions of article content to personal experience.

The measurement works across three layers. The self-critique gate catches low-accuracy responses (below 0.7) and triggers automatic revision before the visitor sees them. The sampling strategy retains 100% of low-confidence conversations for review, so the hardest cases are always visible. And the daily admin review covers 5-10 conversations manually, which over 60 days adds up to roughly 300-600 conversations reviewed by a human. The rest are covered by the automated scoring.

To be clear: this is not a claim that hallucination is impossible. The system is designed to fail toward silence ("I don't have experience with that") rather than toward fabrication. That bias is the primary defense, and so far it has held. There is always room to strengthen the evaluation, and that work is ongoing.

Room for Improvement

The self-critique gate uses the same model family (Gemini) to evaluate its own output, which means systematic biases in generation may not be caught by self-evaluation. The human review layer compensates for this today. A stronger approach would use a different model family for evaluation (Claude or GPT for scoring Gemini output), and that is on the roadmap. If you have experience implementing cross-model evaluation pipelines, I would welcome the conversation.

10.2 Cost Model: What $0.0002 Per Conversation Means

A cost number without context is meaningless. Here is the full picture: what a single conversation costs, what it would cost at different scales, and how it compares to alternative approaches.

Per-Conversation Breakdown

Each chat exchange processes approximately 400 input tokens (system prompt context plus the visitor's query) and generates approximately 200 output tokens (the response). At Gemini 2.0 Flash rates ($0.075 per million input tokens, $0.30 per million output tokens), this yields $0.00013 per exchange. Adding the BGE-M3 embedding step ($0.000011) and the Vectorize similarity search ($0.00004) brings the all-in cost to approximately $0.0002 per conversation. That is two hundredths of a cent.

Per-Operation Cost Breakdown (USD)

Chat Completion

$0.0003

Resume Generation

$0.0006

Cover Letter

$0.0004

Self-Critique

$0.0002

Vectorize Query

$0.00004

BGE-M3 Embedding

$0.000011

Figure 7: Cost per operation at Gemini 2.0 Flash rates (March 2026). Resume generation is the most expensive single operation at $0.0006.

Cost at Scale

Scenario	Conversations/Month	Monthly AI Cost	Cost Per Lead (est.)
Early stage (portfolio with moderate traffic)	600	$0.12	$0.02 if 10% convert to meaningful engagement
Growing (active job search, LinkedIn traffic spikes)	3,000	$0.60	$0.02 at same conversion
Peak (viral post, conference talk, resume activity)	15,000 + 600 resumes	$3.36	Sub-penny per interaction
Safety ceiling (circuit breaker active)	30,000 (1,000/day cap)	< $15.00	Budget-bounded regardless of traffic

Monthly Cost Projection by Traffic Scenario

Early Stage 600 chats/mo

$0.12

Growing 3,000 chats/mo

$0.60

Peak 15K chats + 600 resumes

$3.36

Safety Ceiling 30K chats (capped)

< $15.00

$0 $3.75 $7.50 $11.25 $15.00

Figure 8: Monthly cost scales sub-linearly with traffic. Circuit breaker caps maximum spend at $15/month regardless of volume.

Comparative Analysis: How This Stacks Up

Approach	Monthly Cost	Conversations Handled	Available 24/7	Adapts to Audience
This system (Cloudflare + Gemini)	$0.12 - $15	600 - 30,000	Yes	Yes (3 depth levels)
SaaS chatbot (Intercom, Drift)	$74 - $500+	Unlimited (within plan)	Yes	Limited (rule-based)
OpenAI Assistants API (direct)	$5 - $50	Varies by usage	Yes	Requires custom prompt engineering
Static portfolio (no chat)	$0	0	N/A	N/A
Personal time answering emails	$0 (but hours of time)	5-10/week realistically	No	Yes (manually)

Service Cost Attribution: Where Every Dollar Goes

Gemini 2.0 Flash (Chat)

72%

Gemini Flash Lite (Self-Critique)

14%

Cloudflare Vectorize (Search)

Workers AI / BGE-M3 (Embedding)

D1 + KV + R2 (Storage/State)

Gemini inference accounts for 86% of total cost. Infrastructure (Cloudflare) contributes less than 15%.

Figure 9: Service cost attribution showing Gemini API as the dominant cost driver. Cloudflare infrastructure costs are negligible at this scale.

Cost Optimization Hierarchy: What Changes at Scale

The current cost floor was a deliberate architectural choice: optimize for reliability and zero-hallucination grounding first, then layer cost optimizations as volume justifies them. Three optimization tiers exist beyond the current state, each with a different volume threshold for activation.

Optimization	Technique	Savings	Volume Threshold	Trade-off
Response Caching	Cache full responses for identical queries in KV (24-hour TTL)	30-50%	Any volume	Stale responses if knowledge updates; breaks multi-turn context
Context Caching	Cache static system prompt server-side via Gemini API; subsequent queries pay 75% less on input tokens	75% on input	100+ queries/hour	Requires 32K+ token minimum; adds first-request latency; API complexity
Prompt Compression	Reduce Tier 1 from ~18K to ~12K tokens via structured shorthand and section deduplication	20-30% on input	1,000+ queries/day	Reduced LLM reasoning quality; harder to maintain; potential hallucination increase

At the current production volume (~10 queries/day), none of these optimizations yield meaningful savings. Response caching saves the most at any volume but introduces staleness risk for a system that prides itself on accuracy. Context caching requires a minimum token threshold (32K) that the system's 18.2K prompt does not meet. Prompt compression trades maintainability and answer quality for savings of approximately $0.15/month.

Industry Best Practices Scorecard

Benchmarking against nine established cost optimization practices for production LLM systems, the current architecture scores 7 of 9 as already optimal. The two gaps are volume-dependent optimizations that represent deliberate deferral, not oversight.

Practice	Industry Standard	This System	Assessment
Model selection	Cheapest model meeting quality bar	Gemini 2.0 Flash	Optimal
Temperature tuning	Lowest natural-sounding temperature	0.3 (chat), 0.7 (resume gen)	Optimal
Budget controls	Per-user, per-day, circuit breakers	30/hr per IP, 500/day user, 1,000/day global	Optimal
Cost observability	Per-operation tracking with alerts	12 operation types, 3 alert thresholds	Optimal
Sampling ramp-down	Reduce monitoring cost after learning phase	100% (day 0) to 20% (day 14+)	Optimal
Graceful degradation	Non-blocking cost tracking	All KV/tracking is fire-and-forget	Optimal
Context caching	Cache static prompts at scale	Not implemented	Correct at current volume
Response caching	Cache repeat queries	Not implemented	Recommended at 10x volume
Token budgets per use case	Vary maxOutputTokens by task	2,048 uniform	Could reduce chat to 1,500

Key Insight

The $0.0002 per conversation figure is not about being cheap. It is about being sustainable for an individual. A system that costs $500/month to operate will be shut down the moment a job search ends. A system that costs $0.12/month can run indefinitely as a permanent professional asset. Sustainability is the enabling condition for the continuous learning loop: the longer the system runs, the better it gets.

11. Lessons and Principles

11.1 "Grounding Is Architecture, Not Guardrails"

Core Insight: Three iterations proved that prompt-level guardrails cannot compensate for architectural conflation of knowledge types. Separating immutable facts, semantic content, and deterministic outputs at the system level eliminates categories of hallucination that no amount of prompt tuning can prevent.

Action: Before writing a single prompt, decide which knowledge types your system handles and build separate storage, retrieval, and trust models for each.

11.2 "The Resume Is a Starting Point, Not a Destination"

Core Insight: A resume presents facts. A conversation demonstrates judgment. Senior technologists are hired for accumulated judgment, not for the ability to list technologies. The AI chat system bridges this gap by letting visitors ask the questions a resume cannot answer.

Action: Treat your portfolio as a conversational surface, not a document repository. Every page should invite deeper engagement, not just passive reading.

11.3 "Self-Critique Earns Its Latency"

Core Insight: Adding 150-300ms for quality scoring feels expensive until you compare it to the cost of a single low-quality response in a high-stakes context. The selective approach (scoring complex queries only) preserves baseline speed while ensuring quality where it matters most.

Action: Implement quality gates with heuristic triggers. Not every response needs review. Every important response does.

11.4 "Cost Control Is Quality Control"

Core Insight: Rate limiting prevents budget exhaustion, but it also prevents quality degradation. A system under 1,000 sustained requests from a single IP is not serving organic visitors. Limiting abusive patterns protects both the budget and the learning dataset from noise.

Action: Design cost controls as quality controls. The 20% sampling strategy, the circuit breakers, and the per-IP limits all serve both purposes simultaneously.

11.5 "Honest Boundaries Build More Trust Than Impressive Answers"

Core Insight: The system's most credibility-building moments are not its best answers. They are its honest refusals. Saying "I don't have direct experience with quantum computing" demonstrates a level of reliability that no generated answer can match.

Action: Design for graceful boundaries. The constitution layer should define not just what the system knows, but what it explicitly does not know and how it communicates that.

11.6 "Measure the System, Not Just the Model"

Core Insight: LLM benchmarks measure model capability. Production metrics measure system reliability. The admin dashboard, the confidence scoring, and the daily grading cycle exist because model quality alone does not guarantee system quality. The pipeline, the retrieval, the rate limiting, and the cost controls all contribute to (or detract from) the visitor's experience.

Action: Build observability into the system from day one. Confidence scores, cost per operation, and human feedback are not optional instrumentation. They are essential debugging tools.

11.7 Known Limitations and Trade-offs

Every system has boundaries, and being upfront about them is part of the work. These are the limitations I am aware of today, along with where each stands.

Limitation	Impact	Mitigation / Status
Same-model self-critique	Gemini evaluating its own output may share systematic biases	Human review compensates; cross-model evaluation planned
Cosine 0.75 threshold is heuristic	May exclude marginally relevant articles or include tangentially related ones	Threshold tuned empirically over 60 days; adaptive thresholding under consideration
Audience classification is inferred	Misclassification serves wrong depth level (CTO gets recruiter-depth answer)	First-exchange heuristic works for 90%+ of cases; explicit selection being considered
No multi-session memory	Returning visitors start fresh with no context from prior visits	Deliberate privacy trade-off; acceptable for portfolio use case
Regex injection defense is brittle	Sophisticated adversarial inputs can bypass pattern matching	Constitution layer + self-critique provide defense-in-depth; classifier-based detection planned
Single-operator scale	Architecture assumes one person's knowledge base; multi-tenant not yet tested	Three-tier pattern is generalizable; tenant isolation requires additional work

These are trade-offs, not oversights. Each was evaluated for the current use case. If you are building something similar and facing different constraints, I would genuinely enjoy comparing notes on where the right trade-offs differ: reach out anytime.

Red Flags: When Your AI System Needs Architectural Rethinking

1. Hallucinations persist despite prompt engineering improvements.
2. You cannot explain which knowledge source generated a specific answer.
3. Cost scales linearly with traffic rather than with value delivered.
4. There is no human feedback loop informing system behavior.
5. The system cannot say "I don't know" when it genuinely does not know.
6. Security was added after the initial deployment, not designed alongside it.
7. You have no confidence metric per response.

12. The Road Ahead

The current system establishes a foundation. The architecture was designed with extensibility in mind, and three evolution paths are under active consideration.

Evolution	Description	What It Enables
Multi-Modal Interaction	Voice interface using ElevenLabs integration; visitors ask questions verbally and receive spoken responses	Accessibility, deeper engagement, conference demo capability
Expanded Knowledge Tiers	Tier 4 for real-time context (current projects, recent publications, live conference talks)	Always-current portfolio without manual updates
Framework Generalization	Extract the three-tier architecture into a reusable template that other professionals can deploy for their own portfolios	Scalable career intelligence for the broader professional community
Cross-Model Evaluation	Use a different model family (Claude, GPT) for self-critique scoring to eliminate same-model evaluation bias	Higher confidence in hallucination detection; independent quality validation

12.1 A Note on Cloudflare Dependency

This architecture runs entirely on Cloudflare's ecosystem: Workers, D1, KV, Vectorize, R2, and Pages. That is a real dependency, and it is worth addressing openly.

The core application logic is standard TypeScript with no Cloudflare-specific SDK in the business layer. The three-tier knowledge architecture is a design pattern, not a platform feature. Tier 1 is string assembly. Tier 2 is cosine similarity against any embedding store. Tier 3 is standard SQL. The SSE streaming is HTTP-native. Porting to AWS Lambda + Aurora + OpenSearch, or Vercel + Supabase + Pinecone, would mean replacing infrastructure bindings, not redesigning the architecture. The estimated effort is 2-3 days of integration work.

The Cloudflare choice was intentional for this use case: zero cold starts, global edge deployment, unified billing, and total infrastructure cost under $5/month (excluding AI inference). For a system operated by one person, those operational benefits matter. For a multi-tenant framework, the portability layer would need to be formalized. That trade-off felt right for where the system is today.

Strategic Principle

The broader vision is not a single portfolio chat system. It is a pattern: grounded conversational AI as a standard interface for professional credibility. Resumes will not disappear, but they will become the starting point of a conversation rather than its entirety. The professionals who explore this space early will help shape how the next generation of technical hiring conversations work. This system is one approach, and there are certainly others. Try it live, and if you are thinking about similar problems, let's learn from each other.

13. Technology Stack Summary

Category	Technologies
Primary LLM	Google Gemini 2.0 Flash (chat), Gemini 2.0 Flash Lite (self-critique, patch generation)
Embedding Model	Cloudflare Workers AI (BGE-M3, 1024 dimensions)
Vector Database	Cloudflare Vectorize
Application Database	Cloudflare D1 (SQLite)
Cache / State	Cloudflare KV (rate limits, feature flags, session state)
Object Storage	Cloudflare R2 (resume PDFs, generated documents)
Compute	Cloudflare Workers (TypeScript, edge-optimized serverless)
Static Hosting	Cloudflare Pages (Astro 5 SSG)
Streaming Protocol	Server-Sent Events (SSE, 8 event types)
Authentication	Google OAuth2 + email allowlist (admin)
Email Delivery	Resend
SMS Alerts	Vonage
Error Tracking	Sentry (10% sampling)
Analytics	PostHog (optional, privacy-first)

14. About the Author

Hemanth Shivanna

Senior Principal AI & Agentic Solutions Delivery Consultant at Genpact | Co-Founder (2024), Elite Technology Solutions | Enterprise Product & Platform Security Leader

Hemanth Shivanna is an enterprise technology leader with 24+ years of experience in platform engineering, observability transformation, service delivery architecture, and AI-driven automation. His career spans large-scale enterprise environments in automotive remarketing, financial services, and fleet management, where he has consistently delivered transformative outcomes at the intersection of technology operations and business strategy.

Hemanth has led initiatives that achieved $2.2M in annual cloud cost savings, a 90% reduction in critical outages, and an enterprise observability transformation achieving 80%+ TCO reduction and 40% MTTR improvement across 20 business units. In 2024, he co-founded Elite Technology Solutions to bring agentic AI into production for marketing, customer engagement, and professional career intelligence. He currently serves as Senior Principal AI & Agentic Solutions Delivery Consultant at Genpact.

The AI chat system documented in this white paper is live at hemanthshivanna.com. It is a working system, not a finished one. There is always more to learn, and collaboration makes the work better. If you are exploring grounded AI for professional branding, career intelligence, or portfolio systems, I would welcome the conversation.

AWS Solutions Architect Associate Salesforce Agentforce Specialist ITIL Certified Microsoft Certified Cisco Certified MBA

🚀

Try It, Break It, Improve It

Ask the AI chat anything. If you find a gap or have ideas on how to make it better, I am all ears.

→