Strategic Growth & "Intelligent Failure"

Contents

Executive Summary
The Starting Point: Platform Archaeology
The Stopgap: OpenShift Container Platform
The Target Architecture: AWS EKS
The Migration Factory: 370 in 9 Months
Production Night: The Honest Account
Security Architecture
The Road Ahead
Lessons and Principles
About the Author
Technology Stack Summary

1. Executive Summary

370+ IBM integration workloads. From a physical AIX data center to cloud-native Kubernetes on AWS EKS. Delivered in 9 months against an original 36-month estimate. This paper is a practitioner's account of the architectural decisions, operational realities, and human moments that made that outcome possible.

The strategic arc of this migration was not a single project. It was a three-generation platform evolution spanning nearly a decade: from on-premises IBM Message Broker on AIX physical servers, through Red Hat OpenShift Container Platform as an intermediate containerization step, to a production-grade AWS EKS architecture with a forward-looking path to fully cloud-native services. Each generation built on the organizational and technical maturity developed by the previous one.

The organizational stakes were not abstract. Every integration migrated touched revenue-generating processes across multiple lines of business. A failed migration would have disrupted every line of business simultaneously. That weight was present in every architectural decision, every wave planning session, and every production cutover night.

Framing Note

This paper is Part 2 of the MQ/ACE Transformation Series. Part 1 covers the foundational work: stabilizing the platform on-premises, decoupling audit processing, and building the organizational credibility that made this migration possible. Read that paper first if you want the full arc.

2. The Starting Point: Platform Archaeology

Before you can migrate a platform, you have to understand what you are actually migrating. On paper, the inventory was IBM Integration Bus version 9 on AIX physical servers. In practice, it was a decade of accumulated configuration, implicit dependencies, and institutional knowledge spread across teams that had long since moved on to other projects.

The enterprise integration hub, internally known as IHUB, supported more than 370 integration flows across every line of business. Each BAR application (the deployable unit for IBM App Connect Enterprise) had its own dependency graph: IBM MQ queue managers it relied on, databases it connected to through ODBC, external endpoints it called over TLS, and file shares it read from or wrote to. None of those dependencies were centrally documented. They had to be excavated.

The infrastructure underneath was equally entrenched. IBM MQ queue managers ran as dedicated processes on AIX hosts. Storage was NFS, mounted from the data center fabric. Deployment was manual: RPM installs, CLI configuration of execution groups, hand-edited queue manager configs. Promoting a change from development to production required a human being to type commands into a terminal on the right server in the right order. That was not a shortcoming of the team; it was the expected operational model of its era.

The organizational reality compounding the technical one was the classic barrier to modernization: "It works, do not touch it." Every system that powered revenue earned a protective instinct from the people closest to it. The platform team had spent years stabilizing IHUB. They were right to be cautious. The challenge was channeling that caution into disciplined migration rather than paralysis.

Dimension	Value
Platform	IBM IIB v9 on AIX physical servers
Integration Count	370+ BAR applications
MQ Version	IBM MQ v7, later v9
Storage	NFS (data center fabric)
Deployment Model	Manual RPM installs, CLI configuration
Lines of Business	All enterprise business units

3. The Stopgap: OpenShift Container Platform

Before the organization moved to AWS EKS, it took an intermediate step: Red Hat OpenShift Container Platform. Understanding why that step happened, and what it cost and delivered, is important context for the EKS architecture that followed.

OpenShift was the right move at that point in time. The organization needed to containerize its IBM middleware before it could reason clearly about cloud migration. Running IBM App Connect Enterprise and IBM MQ in containers on OpenShift gave the platform team its first real exposure to operator-based deployment patterns, namespace isolation, and Helm-based package management. Those were not small skills to acquire. Container orchestration requires a different mental model than server-based operations, and OpenShift provided a relatively guided path to that mental model through its operator framework.

The IBM operators for ACE and MQ were particularly instructive. They introduced the concept of declarative configuration: describe the desired state of a queue manager or integration server, and let the operator reconcile reality to that description. This was a significant shift from the imperative CLI model on AIX. It also surfaced the integration team's first real experience with CRD-based resource management, which became central to the EKS architecture.

OpenShift was not the final destination. Three factors drove the move to EKS: licensing cost relative to AWS-native alternatives, operational overhead of a separate control plane, and strategic misalignment with the organization's deepening AWS commitment. The right conclusion from the OpenShift chapter was not that OpenShift failed; it was that it served as the bridge that made EKS viable.

4. The Target Architecture: AWS EKS

4.1 Infrastructure Foundation

The EKS architecture began with an AWS-managed control plane and worker nodes sized to the actual workload profile of ACE integration servers. Node groups were split into two categories: general-purpose nodes for lightweight routing flows, and higher-memory nodes for JVM-heavy ACE runtimes running complex transformation logic. This distinction mattered because ACE integration servers are not homogeneous; a flow doing simple MQ-to-MQ message routing has a different resource footprint than a flow doing XSLT transformation against a 50KB payload.

Networking used the AWS VPC CNI plugin, giving every pod a real VPC IP address. This was a deliberate choice over overlay networking: it simplified security group rules, made network tracing predictable, and avoided the performance overhead of an additional encapsulation layer. For integration workloads that mix internal MQ connectivity with external HTTPS calls to third-party systems, having a clear network identity for each pod simplified both routing and security auditing.

Istio was deployed as the service mesh, providing three specific capabilities the team could not reasonably build themselves: VirtualService-based host and path routing for ACE services, ServiceEntry and DestinationRule objects for controlling external HTTPS egress, and mTLS between services within the cluster. The egress control piece was particularly important: dozens of integration flows made outbound calls to external systems, and the team needed a consistent, auditable way to govern which services could reach which external endpoints without embedding that logic into application configuration.

4.2 The Separation of Concerns Model

This is the architectural decision that made migrating 370 BAR applications feasible in 9 months. The team decomposed the platform into three independently deployable layers. Changing one layer did not require touching the others. That independence was what allowed the team to move fast without breaking things.

Core Architectural Insight

We did not migrate 370 servers. We migrated 370 BAR applications into a smaller fleet of standardized ACE 12 Integration Servers on Kubernetes. The three-layer separation model is what made that consolidation possible without losing configurability.

Layer 1: Runtime. The Integration Server Helm chart defined the Deployment, Service, NetworkPolicy, and Istio objects for each ACE runtime. Two environment variables governed what each runtime loaded: ACE_CONTENT_SERVER_URL pointed to the Artifactory URL serving the BAR file, and ACE_CONFIGURATIONS listed the configuration CRD names the server should fetch at startup. The runtime chart was identical across every integration server; only the parameter values differed.

Layer 2: Configuration. A separate chart and pipeline existed for each configuration type: ODBC datasource definitions, policy project archives, truststore certificates, BarAuth credentials, and SetDBParms entries. These were deployed as Kubernetes CRDs, independently of the runtime. This meant a database password rotation or truststore update required deploying a new configuration CRD. The integration server's startup sequence fetched current config from the CRD store, so the update took effect on the next pod restart. No Helm upgrade to the runtime chart. No redeployment of the BAR file. Just the config.

Layer 3: Infrastructure. Terraform owned the EKS cluster, VPC, IAM roles, S3 bucket for state, and DynamoDB table for state locking. ArgoCD managed the platform controllers: the External Secrets Operator, Istio control plane, cluster autoscaler, and the IBM operators. Changes to infrastructure went through Terraform pull requests with plan output in the review. Changes to platform controllers went through ArgoCD sync. Neither layer touched application runtime or configuration.

Migration: Before and After

Before: Manual, Server-Coupled Deployment on AIX

Developer

→

Manual RPM
Install

→

AIX Broker
Exec Group

→

Manual MQ
Config

→

NFS
Storage

After: GitOps-Driven, Three-Layer EKS Architecture

Developer

→

BAR to
Artifactory

→

ADO
Pipeline

→

Helm Upgrade
EKS Pod

→

Config CRDs
+ Istio Routing

→

Storage GW
(S3-backed)

Figure 1. Platform deployment model before and after EKS migration.

4.3 MQ in Kubernetes

Running IBM MQ in Kubernetes required deliberate choices about how Kubernetes's stateless-first design principles applied to stateful messaging infrastructure. A standard Deployment was not appropriate: MQ queue managers require stable identity and persistent storage. The team used StatefulSets, which provide both through stable pod names and persistent volume claims that survive pod restarts.

The configuration model was equally important. Queue manager configuration via qm.ini, and MQ object definitions (queues, channels, listeners) via MQSC scripts, were managed through a dedicated pipeline rather than applied manually via CLI. This was a direct lesson from the AIX era: when queue manager configuration lived in someone's head or in a shared runbook, drift was inevitable. When it lived in a Git repository and was applied through a pipeline, drift became detectable and correctable.

ACE-to-MQ connectivity used the default policy project CRD, deployed as a Layer 2 configuration object. This meant the MQ connection parameters (host, port, channel, queue manager name) could be updated independently of the ACE runtime, following the same lifecycle as any other configuration object in the platform.

4.4 Secrets Evolution

The platform went through two generations of secret management before arriving at the production approach. The first generation used Kamus encryption: secrets were encrypted in Git using a cluster-specific key, and decrypted at deploy time by the Kamus operator. This solved the immediate problem of secrets in source control, but it created operational overhead: rotating a secret required re-encrypting the value and committing the result. It also created a dependency on the Kamus operator's availability during deployment.

The second generation replaced Kamus with External Secrets Operator backed by AWS Secrets Manager. Under this model, secret values lived in AWS. The Kubernetes cluster fetched them directly through an ExternalSecret resource, which described which AWS secret to read and how to map its fields into a Kubernetes Secret. Terraform provisioned the IAM roles with the minimum permissions needed for each namespace's ServiceAccount to read only the secrets it required. Secret values were never in tfvars files, never in Terraform state, and never in Git. The pipeline that applied Terraform only managed the IAM policy and the ExternalSecret resource definitions. The actual credentials existed only in AWS Secrets Manager.

4.5 Storage Evolution

Storage turned out to be one of the most expensive lessons of the migration. The on-premises NFS mounts that AIX-era integration servers relied on for file-based flows had to be replaced, and the replacement choices carried real cost and operational implications.

The first cloud storage solution was Amazon FSx for NetApp ONTAP, chosen for its NFS compatibility and enterprise feature set. It worked, but it was costly: FSx for NetApp allocates capacity in fixed provisioned increments, and the actual usage of file-based integration flows did not justify the provisioned size. The team was paying for white space.

The third-generation solution was AWS Storage Gateway in file gateway mode, backed by Amazon S3. This approach aligned cost to actual usage: data written through Storage Gateway landed in S3, where it was subject to lifecycle policies that moved objects to S3 Standard for the first three years, to S3 Glacier Flexible Retrieval for years three through five, and to S3 Glacier Deep Archive beyond year five. For file-based integration flows that produced output documents read infrequently after initial processing, this tiering model delivered substantially lower per-GB costs than any provisioned storage alternative.

5. The Migration Factory: 370 in 9 Months

The number "370 integrations in 9 months" sounds like it required an army. The actual team was small. What scaled was the tooling and the method, not the headcount.

Key Insight

The "370" was not 370 unique pipelines. It was a few templates reused with parameters. The migration factory's entire velocity came from standardizing the platform first, then parameterizing the workload. Teams that try to migrate individual workloads without standardizing the platform first do not get a factory; they get 370 one-off projects.

Each BAR application went through the same migration unit: capture the source execution group settings, map all inbound and outbound dependencies, identify the BAR file's entrypoints and the config types it needed, upload the BAR to Artifactory, create or update the Helm values file for the target integration server, create the required configuration CRDs, and expose the service via an Istio VirtualService. That unit of work was repeatable by anyone who understood the template.

The wave strategy organized the 370 applications by risk and dependency profile, not just by business unit or team. The logic was that each wave should validate a specific platform capability, so that by the time high-risk applications migrated, every platform component they depended on had been exercised in production.

Wave	Category	What It Validated
A	Pure routing / MQ only	Baseline ACE runtime stability, MQ connectivity, Helm chart correctness
B	ODBC + database flows	ODBC configuration pipeline, SetDBParms CRD, secret lifecycle in production
C	Complex TLS / external HTTP calls	Truststore pipeline, Istio ServiceEntry for egress TLS, certificate rotation
D	File share / batch flows	PVC mounts, Storage Gateway connectivity, SMB/NFS permission mapping

The pipeline that executed migrations was built on two reusable Azure DevOps YAML templates. The first, a generic configuration deploy template, accepted a configuration type and a set of values and emitted the appropriate Helm install or upgrade command for any CRD type. The second, a runtime deploy template, accepted an integration server name, a BAR URL, and a list of configuration references and ran helm upgrade --install against the integration server chart. Every one of the 370 migrations used one or both of these templates. The templates did not change. The parameters did.

The team tracked migration status in a shared board with three states: archaeology complete, EKS deployed and verified in non-production, and EKS promoted to production. At peak velocity, multiple applications were moving through the board each week. The pace was possible because each migration was a data-entry exercise against a known template, not a unique engineering problem.

6. Production Night: The Honest Account

Every migration has a production night. This platform had three: two rollbacks and one successful cutover. Documenting all three honestly is the point.

6.1 Cutover Protocol

The cutover process was not improvised. Code freeze went into effect 72 hours before the window. Every line of business signed off on pre-production validation and accepted the rollback risk. URL smoke testing ran per LOB for a minimum of 30 minutes before declaring readiness. Every team had a rollback trigger defined before the window opened: if any metric crossed its threshold, rollback was the default decision. No one had to argue for rollback; the threshold did it automatically.

6.2 Rollback One: Storage and Active Directory

The first production attempt failed on two related issues. The Active Directory configuration in production differed from what had been configured in the stage environment; the Storage Gateway needed to authenticate to AD to serve file shares, and the AD group policies in production had additional restrictions that stage had not enforced. That alone might have been survivable, but it was compounded by the second issue: copying file data from the on-premises NFS storage to the Storage Gateway was taking longer than planned. The team had a choice between proceeding with incomplete data and risking file corruption in flows that expected to find files that had not yet been copied, or rolling back, completing the data copy in a controlled window, and re-validating the AD configuration alignment before attempting again. The decision took less than five minutes. Roll back. Complete the copy cleanly. Fix AD. Rebook the window.

This was the right call. A rollback taken from a position of clarity is an operational success, not a failure. The team knew exactly why they were rolling back, exactly what needed to happen before the next attempt, and exactly how to verify that those conditions were met. That kind of structured retreat is what disciplined operations looks like.

6.3 Rollback Two: The MQ PubSub Configuration Miss

The second rollback was harder. Everything from rollback one had been addressed. AD was aligned. Storage data had been pre-copied and verified. The production window opened cleanly. Most flows came up correctly. Then one of the power applications, a high-traffic pub/sub flow that had run as a single instance on-premises, started failing as the team scaled it to four replicas on EKS.

With one replica, the flow worked. With two or more replicas, IBM MQ rejected the additional subscriber connections. The application was using a model queue in pub/sub mode, and when the on-premises configuration was captured, the shared subscription flag had not been explicitly noted because it had not needed to be: one instance, one subscriber, no contention. MQ 9.1 to MQ 9.3 introduced a behavioral change that required explicit enablement of the SharedSub and SharedInQueueManager flags for multiple subscribers to attach to the same durable subscription. That flag combination was not in the migrated queue manager configuration.

The debugging session took four to five hours. IBM documentation on this specific combination of model queues, durable subscriptions, and the 9.1-to-9.3 behavioral change was not organized for someone who did not already know the answer. IBM support was engaged but slow: requests for logs, suggestions to upgrade to newer fix packs, suggestions to open PMRs. The team ultimately found the resolution by reading the MQ 9.3 release notes line by line and cross-referencing with the MQSC command reference. The interim decision, agreed to with the business, was to run the application with a single pod until the fix was validated in non-production and could be promoted cleanly. The second rollback recovered everything else. Only that one application remained on-premises, temporarily.

The fix was applied, validated, and promoted within the week. The application ran correctly at four replicas in production the following sprint. But the lesson cost four hours and two rollbacks to learn.

Human Moment

After two rollback nights and a four-hour debugging session reading IBM documentation line by line, the moment the PubSub subscribers connected cleanly across all four replicas was when the team knew: the platform had arrived. Not the architecture review. Not the Terraform plan. The moment when a behavior that had defeated them twice ran correctly, in production, without anyone holding their breath.

6.4 The Successful Cutover

The third attempt was clean. All configuration had been verified against the production environment, not against the closest available approximation of it. Storage data had been pre-copied and checksummed. AD group policy alignment had been confirmed with the identity team. The PubSub flags had been set, tested under replica counts of two, three, and four in non-production, and the queue manager configuration had been committed to Git through the standard pipeline. The window opened. The flows came up. The smoke tests passed. The LOBs signed off. Minimal residual issues surfaced in the hours following cutover, each resolved the same day. The platform was live on EKS.

The two rollbacks were not failures of planning. They were the cost of production reality revealing what non-production had not. The organization had accepted that cost as part of the rollback protocol. What transformed those rollbacks from failures into operational maturity was the structured approach to each one: clear triggers, clear decisions, clear remediation steps, and no ambiguity about who owned each action.

7. Security Architecture

The security architecture was not an afterthought applied at the end of the migration. It was designed alongside the infrastructure and evolved as the platform matured. The foundational layer was AWS Organizations with a structured OU hierarchy: production, non-production, security, shared services, and business-unit-specific OUs. Each OU had Service Control Policies that defined the permission ceiling for every account within it, regardless of what IAM policies granted at the account level. SCPs denied disabling CloudTrail, GuardDuty, or Security Hub in any account. They denied creating public S3 buckets. They denied IAM policies with wildcard actions on sensitive services. These denials were not audited; they were enforced by the platform itself.

Cross-account IAM followed a central assume-role pattern for CI/CD. The Azure DevOps pipelines ran with a role in a shared services account, and assumed deployment roles in target accounts for each environment. The deployment roles had narrow permissions scoped to the resources they needed to create or modify. The separation meant that a compromised pipeline credential could not escalate privileges beyond what the assumed role permitted in the target account.

Within the EKS cluster, IRSA (IAM Roles for Service Accounts) via OIDC federation provided pod-level AWS identity without instance metadata access. Each integration server's Kubernetes ServiceAccount was annotated with the ARN of a specific IAM role. That role had only the permissions needed by the flows running in that integration server: access to specific S3 paths, access to specific Secrets Manager secrets, nothing more. The External Secrets Operator used its own dedicated role to read secrets and write them into Kubernetes Secrets objects; application pods never accessed Secrets Manager directly.

RBAC within Kubernetes followed the same least-privilege principle. Each integration namespace had a namespace-scoped Role and RoleBinding that granted only the verbs needed for pipeline operations: create and update for Deployments and ConfigMaps, read for Secrets (which were populated by the External Secrets Operator, not application code). Cluster-wide RoleBindings were avoided except for the platform controllers that genuinely required cluster scope.

Security Hub and GuardDuty operated as the continuous compliance and threat detection layers. Security Hub findings were triaged by severity; high-confidence findings triggered an EventBridge rule that invoked a Lambda function to open a Terraform pull request with the remediation change. GuardDuty findings were investigated manually for the first several months: an InstanceCredentialExfiltration finding in the first quarter post-cutover required correlating the GuardDuty alert with CloudTrail API calls and VPC Flow Logs to determine whether the finding represented actual exfiltration or a legitimate service behavior that needed to be whitelisted. It was the latter, but the process of determining that built the team's investigation capability for future findings.

8. The Road Ahead

EKS was not the destination. It was the platform from which the next generation of decisions could be made with better data. By the time the migration completed, the team had something it did not have before: a clear view of every integration's resource consumption, traffic pattern, and business criticality. That visibility made the next set of decisions possible.

Strategic Principle

The integration hub was the right answer for the on-premises era. In the cloud-native era, the right answer is often no hub at all. The hub exists because individual applications could not communicate directly. Cloud-native services remove that constraint. The migration question is not "how do we move the hub to the cloud?" but "which integrations still need a hub, and which can be replaced by direct service communication?"

Three destination patterns are driving the strategic roadmap beyond EKS.

Apache Pulsar is the target for event streaming workloads that today use IBM MQ pub/sub. Pulsar's multi-tenancy model, topic namespace isolation, and tiered storage align with the enterprise's governance requirements better than Kafka for this use case. Flows that are fundamentally publish-subscribe, with multiple consumers and durable subscription semantics, are candidates for Pulsar migration as the next platform generation stabilizes.

AWS Lambda is the target for integration logic that is stateless, event-triggered, and does not require the full ACE message flow runtime. Many of the simpler routing and transformation flows on EKS are candidates for Lambda rewrites. A Lambda-based integration eliminates the ACE license cost, the container runtime overhead, and the Helm chart lifecycle for that workload. The rewrite cost is real, but for flows with clear bounded logic and low transformation complexity, the operational simplification justifies it.

Decentralized AWS services is the pattern for flows that bridge two systems that now both have AWS-native APIs. EventBridge for event routing between services that emit CloudWatch Events. SQS for point-to-point queue-based decoupling between services that do not need the full MQ feature set. SNS for fan-out notification patterns. Step Functions for multi-step orchestration that would otherwise require a stateful ACE message flow. API Gateway for request-response integrations between internal services. Each of these replaces a category of ACE flows with a managed service that requires no container to run, no license to maintain, and no cluster node to right-size.

What stays on EKS, at least for the foreseeable future, is the long tail of mission-critical legacy integrations with complex transformation logic, regulatory compliance requirements, or deep IBM MQ dependencies that cannot be migrated cheaply. The EKS platform continues to serve these workloads well. It is stable, governed, and increasingly self-service for the teams that depend on it. The goal is not to eliminate EKS; it is to make EKS the platform of last resort for the integrations that genuinely need it, not the default landing zone for everything.

2015

Platform Stabilization

Decoupled audit processing, restored GAL reliability, rebuilt cross-team trust. Foundation for migration credibility. (See Part 1)

2018 - 2019

OCP Containerization

Containerized IBM ACE and MQ on Red Hat OpenShift. Built operator experience, Helm fluency, and namespace isolation patterns.

2020 - 2022

EKS Migration (370+ Workloads)

Full migration from OCP/AIX to AWS EKS. Three-layer separation model, two production rollbacks, zero data loss, 9-month delivery.

2023 - 2024

Secrets Modernization and Storage Optimization

Replaced Kamus with External Secrets Operator. Migrated from FSxN to Storage Gateway (S3-backed). Introduced Karpenter for node provisioning efficiency.

2025+

Selective Decommission

Targeted migration of eligible flows to Lambda, Pulsar, and decentralized AWS services. EKS becomes the governed platform for mission-critical legacy integrations.

9. Lessons and Principles

Six principles shaped every significant decision in this migration. They are not theoretical; each one came from a specific moment where the team had to choose between two paths and lived with the consequences of that choice.

Principle	Core Insight	Action
Fail Forward, Publicly	Two rollbacks turned into proof of operational maturity. The team that can execute a structured rollback in production is more trustworthy than the team that never needs one because it never ships anything difficult.	Build rollback protocols into every migration plan before the production window opens. Define the trigger criteria. Assign the decision owner. Remove the ambiguity.
Standardize the Platform, Parameterize the Workload	One Helm chart, one pipeline template, parameterized inputs. The factory velocity came entirely from this decision. Teams that skip it migrate 370 unique problems instead of 370 instances of a standard solution.	Invest in platform tooling before starting migrations. The time spent building reusable templates is not overhead; it is the velocity multiplier for everything that follows.
Separate Runtime from Configuration from Infrastructure	Three-layer model enabled independent deployability at each layer. If changing a database password requires redeploying the application, you have a coupling problem. The coupling will always cost you at the worst possible moment.	Map your dependency graph before designing your deployment model. Identify which changes should propagate to which layers, and design the deployment system to enforce that separation.
Treat Storage as a First-Class Migration Concern	Storage caused the first rollback. NFS to FSxN to Storage Gateway was its own multi-year evolution with real cost implications at each step. Storage is never a detail.	Map every storage dependency before committing to a migration timeline. Include data volume, access patterns, and authentication requirements in the dependency capture for every integration.
Legacy Does Not Mean Wrong	IBM middleware served reliably for a decade before this migration. The decision to modernize was correct, but it acknowledged changed context, not past failure. Framing matters: "unlocking new capability" is different from "fixing old mistakes."	Evaluate modernization decisions on forward ROI, not on contempt for what came before. The team that built the AIX platform made the right decisions for their constraints. Honor that while moving forward.
The Best Migration is the One You Do Not Do	Some integrations are perfectly served by the stable EKS platform. Migrating them to Lambda or Pulsar because it is architecturally fashionable, not because it delivers measurable value, wastes resources and introduces unnecessary risk.	Score every workload on migration ROI before committing resources. The EKS platform is not a technical debt; it is a governed, stable runtime for integrations that benefit from its capabilities. Use it intentionally.

10. About the Author

Hemanth Shivanna

Senior Principal AI & Agentic Solutions Delivery Consultant at Genpact | Co-Founder (2024), Elite Technology Solutions | Enterprise Product & Platform Security Leader

Across 19 years at publicly traded and Fortune 500 enterprises in automotive, fleet services, and financial services, Hemanth built and led a 35+ person global team spanning platform engineering, SRE, observability, and integration modernization. He leads from the front: writing runbooks alongside the engineers who use them, joining bridge calls at 2 AM, and coaching global teams to own outcomes rather than follow scripts.

Hemanth has led initiatives that achieved $1.7M in annual savings through Splunk optimization, a 40% reduction in mean time to resolution for critical incidents, and the compression of a 3-year IBM IIB-to-ACE modernization program into 9 months. He has managed cross-functional teams across the United States, Canada, and India, supporting platforms that serve $6B+ in enterprise operations.

His subject matter expertise includes IBM MQ and Message Broker (App Connect Enterprise), Splunk, Cribl Stream, MuleSoft, AWS cloud infrastructure, and Terraform-based Infrastructure as Code.

AWS Solutions Architect Associate Salesforce Agentic AI Specialist ITIL Certified Microsoft Certified Cisco Certified MBA

11. Technology Stack Summary

Category	Technologies
Container Orchestration	AWS EKS, Kubernetes, Helm, ArgoCD
Integration Runtime	IBM App Connect Enterprise (ACE) v12
Messaging	IBM MQ v9.x (StatefulSet, HA replicas)
Infrastructure as Code	Terraform (S3/DynamoDB state), Azure DevOps YAML pipelines
Service Mesh	Istio (VirtualService, ServiceEntry, DestinationRule)
Secrets Management	External Secrets Operator, AWS Secrets Manager
Storage	AWS Storage Gateway (S3-backed), PV/PVC, S3 lifecycle tiering
Security	AWS Organizations, SCPs, Security Hub, GuardDuty, IRSA/OIDC
Registry	Artifactory
Autoscaling	Cluster Autoscaler, Karpenter, HPA (selective)
Networking	AWS VPC CNI, NetworkPolicy, Istio egress control
Future State	Apache Pulsar, AWS Lambda, EventBridge, SQS/SNS, Step Functions