whitepaper

Untangling the Middleware Knot

Solving the Unsolvable: How Decoupling Business-Critical Integrations from Audit Logs Improved Reliability by 73%

73% Reliability Improvement
24h → 4h Audit Delay Reduction
200+ Integrations Stabilized
20+ Business Units Impacted

1. Executive Summary

73% reliability improvement. Audit delays reduced from 24 hours to 4 hours. Over 200 integrations stabilized across 20+ business units. These were the measurable outcomes of a single architectural change: decoupling business-critical integrations from audit log processing on an enterprise middleware platform.

At a major automotive remarketing and fleet services enterprise, IBM Message Broker served as the central integration hub (internally known as "IHUB") connecting more than 200 integrations across 20+ business units. A critical capability of this platform, the Global Audit Log (GAL), had suffered from significant processing delays for an extended period. Millions of audit messages were accumulating in a growing backlog, and data intended to be available in near-real-time was delayed by up to 24 hours, effectively rendering the capability unusable for its original purpose.

The prevailing assumption was that the delay could not be improved further. This paper documents how a single engineer challenged that assumption, built a working proof of concept on personal time, brought cross-functional stakeholders together to validate the approach, and delivered the fix. The key architectural change was decoupling business-critical integrations from audit log processing, ensuring that non-critical audit volume could no longer impede time-sensitive business operations.

Key Takeaway

Technical problems that persist long enough become organizational blind spots. The most impactful engineering work often begins not with a new requirement, but with someone refusing to accept the status quo as a permanent constraint.

2. The Platform: IBM Message Broker as Enterprise Integration Hub

2.1 Role of IHUB in Enterprise Operations

IBM Message Broker (now IBM App Connect Enterprise) is an enterprise-grade integration middleware platform that enables applications to communicate through structured message flows. At this organization, the platform was deployed under the internal name "IHUB," serving as the connective tissue between business-critical systems across financial services, vehicle logistics, auction operations, fleet management, and dealer-facing applications.

The platform supported more than 200 active integrations, each carrying transaction data, event notifications, and operational messages essential to daily business operations. Given the regulated nature of the automotive financial services industry, the reliability and auditability of these message flows were not optional considerations; they were operational requirements tied directly to business continuity and compliance.

2.2 The Global Audit Log (GAL/GEH): Purpose and Design Intent

The Global Audit Log, also known internally as GAL/GEH (Generic Error Handler), was a built-in capability within the IHUB platform designed to provide application development teams with a comprehensive audit trail. Its intended function was straightforward: capture message flow activity so that any team, at any point in time, could verify transaction integrity, debug integration failures, or satisfy compliance requirements.

When functioning as designed, GAL/GEH acted as a centralized verification layer across all integrations. Development teams relied on it to confirm that messages were processed correctly, identify where failures occurred in complex multi-step flows, and produce audit records for internal and external review. However, the audit application had a tight dependency on the underlying middleware platform. This coupling meant that any performance degradation in the audit subsystem did not remain isolated; it created bottlenecks that affected handoffs across the broader middleware layer, directly impacting service delivery to downstream business operations.

IHUB Integration Architecture
20+ Business
Units
100+ App
Teams
IBM Message
Broker (IHUB)
200+
Integrations
Global Audit Log (GAL) captures audit trail across all message flows
Figure 1. IHUB platform architecture and GAL's role as the enterprise audit layer.

3. The Problem: Global Audit Log Degradation at Scale

Understanding the platform's architecture is essential to grasping why this problem was so consequential. The tight coupling between audit processing and business-critical message flows meant that degradation in one subsystem cascaded across the entire platform.

3.1 Symptoms

The audit application (GAL/GEH) had become a bottleneck on the middleware platform. Millions of messages were backing up in the audit queues, and the audit process was delayed by up to 24 hours at peak. The performance degradation was not confined to the audit subsystem alone. Because the audit application was tightly coupled to the middleware, the growing message backlog and processing delays created bottlenecks that affected handoffs between systems, directly impacting audit reporting and delaying service delivery to the business.

The consequences extended beyond the audit function. Development teams, unable to rely on GAL/GEH for timely audit data, began building workaround solutions. Some teams stopped using the audit capability entirely. The practical result was that an enterprise capability intended to provide unified visibility across 200+ integrations had become a system that people worked around rather than worked with. Customer-facing operations were affected, and trust in the platform's ability to support business-critical workflows was eroding.

3.2 Impact on the IHUB Platform and Service Delivery

The root cause was both architectural and operational: the audit application had tight dependencies on the middleware, and business-critical transaction messages shared the same IBM MQ processing queues with non-business-critical audit messages. As audit volume grew into millions of messages, these competed directly with business-critical traffic for queue depth, processing cycles, and I/O bandwidth. The result was that audit processing not only fell behind on its own workload but actively impeded the throughput of business-critical message flows.

Dimension Before Optimization Risk to the Business
Audit Processing Time Up to 24 hours; millions of messages backlogged Unable to verify transactions in time-sensitive scenarios
Queue Architecture Business and audit messages on shared MQ queues Non-critical audit volume directly delaying business-critical processing
Customer Turnaround Sale and post-sale workflows affected by queue contention Increased arbitration processing time; reduced dealer satisfaction
Queue Stability Audit queue susceptible to corruption under heavy load Unpredictable disruptions requiring manual intervention
Team Adoption Declining usage; workaround solutions emerging Loss of centralized audit visibility
Compliance Posture Audit records available only after multi-day delay Increased exposure during audit windows

4. Why the Problem Persisted: Organizational Inertia and Accepted Constraints

The technical problem was significant, but the organizational dynamics that allowed it to persist were equally instructive. This section examines a pattern that recurs in enterprise environments: the normalization of technical limitations.

4.1 The Normalization of Technical Limitations

One of the most instructive aspects of this case is not the technical solution itself but the organizational dynamics that allowed the problem to persist as long as it did. The GAL/GEH performance issue was well-known. It had been discussed in various forums. Yet the prevailing conclusion, arrived at gradually over time rather than through any single decision, was that the situation either could not be improved further or that the business value of improvement was insufficient to justify the effort.

This pattern is common in enterprise environments. When a limitation persists long enough, the organization adapts around it. Teams develop compensating behaviors. Expectations are lowered. What began as a recognized deficiency becomes an accepted constraint, rarely questioned because everyone involved has already incorporated the limitation into their planning and workflows.

Pattern Recognition

In mature enterprise environments, the most costly technical problems are often those that no one is actively trying to solve. They persist not because they are unsolvable, but because the organization has stopped recognizing them as problems.

4.2 The Cost of Accepting the Status Quo

The unstated cost of the GAL/GEH delay was significant. Every application team that could not rely on GAL/GEH for timely verification was absorbing that cost individually: through manual verification steps, through additional logging at the application layer, through delayed incident resolution, and through increased risk during audit periods. The total organizational cost, distributed across 20+ business units and 100+ application teams, was far higher than what would have been required to investigate and address the root cause.

This is a recurring pattern in large technology organizations: distributed impact is invisible impact. When no single team bears the full cost of a platform limitation, no single team has the incentive to pursue a fix. The problem falls into an ownership gap between the platform team (which may view GAL/GEH as functioning within its known parameters) and the application teams (which treat the limitation as someone else's infrastructure concern).

5. The Turning Point: A Self-Initiated Proof of Concept

Recognizing that the organizational dynamics would not resolve themselves, the path forward required a different approach, one that bypassed the debate entirely and let evidence lead.

5.1 Taking Ownership of a Platform-Wide Problem

The decision to investigate the GAL/GEH performance issue was not driven by a management directive or a formal project request. It was driven by the recognition that the degradation trajectory posed a real and growing risk to the IHUB platform and to the customers who depended on it. The audit application's tight middleware dependencies meant that the problem would only worsen as transaction volumes grew. If left unaddressed, the bottleneck could eventually affect the stability of production integrations, creating a business continuity issue rather than just an operational inconvenience.

Taking ownership of this problem required looking beyond the immediate symptoms and understanding how the audit subsystem interacted with the broader platform at scale. It required the willingness to question whether the accepted constraint was actually a technical limitation or simply an unexplored problem that had fallen into an ownership gap between teams.

5.2 Building the Case Through Action

Rather than submitting a proposal and waiting for approval, the engineer built a proof of concept on personal time, outside of normal work responsibilities. This was a deliberate choice: by producing a working demonstration rather than a theoretical proposal, the conversation shifted from "should we invest time investigating this?" to "here is evidence that a significantly better outcome is achievable."

The proof of concept introduced a fundamentally different approach to how the platform handled message processing. The core architectural change was to decouple the critical audit application from its tight middleware dependencies. Rather than continuing to process all messages through shared queues, the POC demonstrated that building and implementing additional MQ clusters would provide the capacity and reliability needed to separate business-critical and non-business-critical workloads entirely. Business-critical integrations would no longer share queue resources with audit traffic, eliminating the bottleneck at its source.

Leadership Principle

A working prototype is worth more than a hundred slides. When an engineer brings a proof of concept to a meeting instead of a proposal, the burden of proof shifts from "why should we try this" to "why would we not pursue this."

5.3 Challenging the Constraints

With the proof of concept in hand, the next step was to bring the relevant stakeholders together. This required persistence. The assumption that GAL/GEH could not be improved had become entrenched, and overcoming that assumption meant creating enough evidence and urgency to get the right people into a room.

The initial proof of concept served as the catalyst. When presented with data showing a dramatically different performance profile under the proposed approach, stakeholders who had previously accepted the limitation as permanent were willing to engage in a deeper technical discussion about what was possible.

6. Collaborative Problem-Solving: From Prototype to Production

The proof of concept opened the door. Translating that prototype into a production-grade solution required a coordinated effort across multiple teams and disciplines.

6.1 Cross-Team Engagement and Controlled Environment Testing

Moving from a personal proof of concept to a production-grade solution required close collaboration across multiple teams. The engineer worked directly with app dev teams, the internal ICE team (responsible for integration and configuration engineering), operations, vendors, and IBM's professional services organization to build, test, and optimize the solution in a controlled environment before promoting it to production.

This collaborative approach was essential. The proof of concept demonstrated that the performance gap was addressable, but the production environment had constraints (including data volumes, concurrency requirements, and fault tolerance expectations) that the prototype had not fully accounted for. The additional MQ clusters needed to be sized, configured, and validated against real-world traffic patterns. Working closely with app dev teams ensured that the decoupled architecture would integrate cleanly with existing integration designs without requiring application-level changes.

6.2 Performance Testing and Validation

The engineer led the performance testing effort, designing test scenarios that reflected real-world audit volumes and processing patterns. Each configuration variant was tested against the same baseline, producing quantitative comparisons that removed ambiguity from the evaluation process.

The testing methodology was structured around three criteria:

  • Processing Throughput: How much audit data could be processed per unit of time under the proposed configuration?
  • Resource Impact: What was the effect on overall IHUB platform performance during audit processing?
  • Data Integrity: Did the new approach maintain complete and accurate audit records with zero data loss?

The results were presented to all involved stakeholders, providing a clear, evidence-based comparison between the existing approach and the proposed solution. The performance difference was not incremental; it was substantial enough to justify immediate implementation.

Processing Flow: Before and After Optimization
Before: Shared Queue Processing (Business + Audit Messages Competing)
All Messages
(Business + Audit)
Single MQ
Queue
Resource
Contention
Business Delays
+ Audit Backlog
After: MQ Cluster with Isolated Processing Lanes
Business-Critical
Messages
Dedicated MQ
Cluster Queue
Priority
Processing
Fast Turnaround
(Sale / Post-Sale)
Non-Critical
(Audit) Messages
Isolated MQ
Cluster Queue
Independent
Processing
Stable Audits
(<4 Hours)
Figure 2. MQ cluster re-architecture: isolating business-critical and audit processing lanes.

7. Results and Business Impact

The production deployment delivered immediate, measurable results that validated the architectural approach and exceeded initial expectations.

7.1 Immediate Performance and Reliability Gains

The results following production deployment were immediate and measurable. Audit delays dropped from 24 hours to 4 hours at peak, and overall audit processing reliability improved by 73%. The improvement was visible from the first day the new configuration was deployed. The additional MQ clusters provided the capacity headroom needed to handle volume spikes without degrading either audit or business-critical processing.

73% Reliability Improvement
24h → 4h Audit Delay Reduction
200+ Integrations Benefiting

7.2 Business Impact: Sale and Post-Sale Turnaround

The architectural separation of business-critical and audit message processing delivered benefits that extended well beyond the audit subsystem. By eliminating the resource contention caused by audit volume competing with transaction processing, the platform's handling of business-critical messages improved measurably. Customer turnaround time for both sale and post-sale operations was reduced, directly impacting revenue-generating workflows.

One of the most significant downstream improvements was in post-sale arbitration processing. Prior to the optimization, the shared queue architecture meant that high-volume audit processing could slow down arbitration workflows, adding delays to a process that directly affected customer experience and dealer satisfaction. With business-critical messages now flowing through dedicated, isolated MQ cluster queues, arbitration processing operated at full throughput without interference from audit traffic.

7.3 Beyond the Metrics: Trust and Behavioral Change

The queue corruption events that had required periodic manual intervention stopped entirely. The conditions that made the audit queue susceptible to corruption under heavy load had been eliminated by the isolation architecture.

The MQ cluster re-architecture eliminated this failure mode entirely. By isolating audit processing onto dedicated queues with appropriate capacity and throughput characteristics, the conditions that led to queue corruption were removed. The audit subsystem moved from periodic instability to consistent, reliable operation.

7.4 Regaining Customer Trust and Cross-Team Collaboration

Perhaps the most meaningful outcome was the restoration of trust. Prior to the optimization, the relationship between the platform team and the app dev teams it served had been strained by persistent audit delays and middleware instability. Teams had learned to expect poor performance from GAL/GEH and had adjusted their behavior accordingly, either by avoiding the audit capability or by escalating issues that the platform team had no effective answer for.

After the deployment demonstrated sustained, reliable performance at the 4-hour processing mark, trust began to rebuild. App dev teams re-engaged with GAL/GEH as a dependable capability. The collaborative process of building, testing, and optimizing the solution had also strengthened working relationships across app dev, operations, vendors, and platform teams.

7.5 Restoring GAL/GEH to Its Intended Purpose

The most significant outcome was not the speed improvement in isolation but what it enabled. With audit data processing completing in 4 hours instead of 24, application development teams across the enterprise could use GAL/GEH for its originally intended purpose: timely verification of integration data. Teams that had abandoned the audit capability in favor of workaround solutions began returning to the centralized platform, restoring the visibility that it was designed to provide.

7.6 Changing How Developers Designed Solutions

An important secondary effect was a shift in how development teams approached GAL/GEH usage in their integration designs. Prior to the optimization, teams had learned to minimize their reliance on the audit capability because of the performance overhead. Integration designs were crafted to limit audit usage to the absolute minimum, often at the expense of audit coverage.

After the optimization demonstrated that GAL/GEH could process volumes efficiently, teams began redesigning their integrations with appropriate audit coverage. GAL/GEH usage became selective and intentional rather than avoidant. Development teams could now make design decisions based on business requirements rather than platform limitations.

Outcome

The optimization did not merely fix a performance problem. By decoupling business-critical integrations from audit logs, it restored an enterprise capability that teams had stopped trusting, eliminated queue corruption events that had caused unpredictable disruptions, reduced audit delays from 24 hours to 4 hours, improved reliability by 73%, reduced customer turnaround time for sale and post-sale arbitration, and removed a growing source of operational risk from the platform.

7.7 Platform Stability and Risk Reduction

With the millions-message audit processing backlog eliminated, the IHUB platform reclaimed the compute and I/O resources that had been consumed by prolonged GAL/GEH processing cycles. This translated directly into improved headroom for production message flows and reduced the risk of platform-level performance degradation during peak periods. For a system supporting 200+ integrations across the enterprise, this stability improvement was a material reduction in operational risk.

8. Lessons for Enterprise Engineering Leaders

The technical solution delivered measurable results, but the principles underlying the approach have broader applicability. These four lessons distill the strategic thinking that made this initiative possible.

Principle 1: Challenge Accepted Constraints

The most consequential decision in this initiative was the simplest one: refusing to accept that a known problem was unsolvable. Enterprise environments are full of performance limitations, architectural compromises, and operational workarounds that persist not because they are necessary but because no one has recently questioned whether they remain valid. The longer a constraint persists, the more deeply it becomes embedded in organizational planning and expectations.

Action: Cultivate a practice of periodically revisiting long-standing limitations, particularly those that affect platform-level capabilities. The question to ask is not "has anything changed since the last time we looked at this?" but rather "if we were designing this system today, would we accept this performance profile?" If the answer is no, the constraint deserves fresh investigation.

Principle 2: Let Evidence Lead the Conversation

The proof-of-concept approach, building a working demonstration before requesting organizational buy-in, proved to be the most effective path to action. Proposals generate debate. Prototypes generate decisions.

This is particularly true when the proposal challenges an established organizational belief. If the prevailing consensus is that a problem cannot be solved, no amount of theoretical argument will shift that belief. A working prototype provides tangible evidence that changes the conversation from speculation to evaluation.

Action: Allocate time for engineering-led proof of concepts. Create a culture where building a small prototype is a valid first step toward solving a large problem.

Principle 3: Persistence Creates Opportunities for Collaboration

Bringing the right stakeholders to the table required sustained effort. The engineer's persistence in raising the issue, backed by proof-of-concept evidence, eventually created the conditions for productive cross-team collaboration. Without that persistence, the technical solution alone would not have been sufficient. Enterprise problems require enterprise coordination, and that coordination rarely happens without someone actively making the case for action.

Action: Identify and empower platform advocates within teams. Give engineers the latitude to pursue investigations that cross organizational boundaries.

Principle 4: Platform Health is Business Health

The GAL/GEH issue was initially perceived as a developer inconvenience rather than a business risk. This perception allowed the problem to persist without urgency. By reframing the issue in terms of platform stability and business continuity risk, the conversation shifted from "nice to fix" to "necessary to fix."

Action: Consistently frame platform health issues in terms of their business impact. Measure the organizational cost of platform limitations, particularly for middleware platforms that serve as shared infrastructure across multiple business units.

Principle Core Insight Actionable Application
Challenge accepted constraints Long-standing limitations deserve periodic reassessment Schedule quarterly reviews of known platform constraints
Let evidence lead Evidence shifts conversations faster than proposals Allocate time for engineering-led proof of concepts
Persistence enables collaboration Cross-team engagement requires sustained advocacy Identify and empower platform advocates within teams
Platform health is business health Distributed impact is invisible without explicit quantification Measure the organizational cost of platform limitations

9. About the Author

Hemanth Shivanna
Co-Founder & AI Solutions Architect, Elite Technology Solutions | Enterprise Product & Platform Security Leader

Across 19 years at publicly traded and Fortune 500 enterprises in automotive, fleet services, and financial services, Hemanth built and led a 35+ person global team spanning platform engineering, SRE, observability, and integration modernization. He leads from the front: writing runbooks alongside the engineers who use them, joining bridge calls at 2 AM, and coaching global teams to own outcomes rather than follow scripts.

Hemanth has led initiatives that achieved $1.7M in annual savings through Splunk optimization, a 40% reduction in mean time to resolution for critical incidents, and the compression of a 3-year IBM IIB-to-ACE modernization program into 9 months. He has managed cross-functional teams across the United States, Canada, and India, supporting platforms that serve $6B+ in enterprise operations.

His subject matter expertise includes IBM MQ and Message Broker (App Connect Enterprise), Splunk, Cribl Stream, MuleSoft, AWS cloud infrastructure, and Terraform-based Infrastructure as Code.

AWS Solutions Architect Associate Salesforce Agentic AI Specialist ITIL Certified Microsoft Certified Cisco Certified MBA
Part 1

The Leadership Story

How trust was built, evidence replaced opinion, and a single engineer's initiative transformed an enterprise-wide constraint

Part 1 Disclaimer

This project was completed in 2015. The specific technology details (IBM Message Broker v7/v8, MQ Clustering configurations, AIX platform specifics) reflect the tooling of that era and are not the focus of this narrative. What endures, and what this Part 1 explores, is the leadership approach: how trust was built across skeptical teams, how evidence replaced opinion, and how a single engineer's initiative transformed an enterprise-wide constraint into a solved problem. The technical principles of decoupling, isolation, and evidence-based architecture remain universally applicable.

Part 1, Section 1: The Organizational Blind Spot

Section 4 documented the organizational dynamics that allowed the 24-hour audit delay to persist. This Part 1 narrative goes deeper into the leadership story: what it looked like on the ground, how the constraint was challenged, and what other leaders can apply from the experience.

Pattern Recognition

When everyone accepts a problem as permanent, the problem compounds. The 24-hour audit delay did not remain a static inconvenience. It became the justification for downstream delays, design compromises, and eroded trust. "That is just how IHUB works" functioned as an organizational permission slip for further degradation. The constraint was no longer just a technical limitation. It had become a cultural one.

The actual cost of this normalization was distributed and therefore invisible in aggregate. Each of the 100+ application teams absorbed its share individually: manual verification steps added to developer workflows, additional logging built at the application layer to compensate for unreliable audit data, delayed incident resolution when teams could not quickly verify whether a transaction had been processed. No single team bore enough of the cost to justify a platform-wide initiative. So no single team championed one.

Meanwhile, the underlying dynamics were worsening. As integration volume across 200+ active connections grew, audit message volume grew proportionally. The shared MQ queues that served both business-critical and audit traffic were absorbing more pressure. The backlog of millions of messages was not a stable condition. It was an expanding one. The platform was drifting toward a point where the audit subsystem's resource consumption would begin visibly degrading the business-critical flows that the organization depended on.

The problem was not going to solve itself through organizational patience. It required someone to look at the constraint not as a given, but as a hypothesis worth testing.

Part 1, Section 2: Taking Ownership Without Permission

The decision to investigate the GAL/GEH performance problem did not originate from a project charter or a management directive. It originated from a pattern recognition that the platform's trajectory was unsustainable. The audit application's tight coupling to IHUB's message processing layer meant that as integration volumes continued to grow, the bottleneck would not stay contained within the audit subsystem. It would migrate upstream into business-critical transaction flows. That migration had already begun, quietly, in the form of slightly elevated processing times and occasional queue depth warnings that had not yet crossed incident thresholds.

The engineer who recognized this pattern made a choice that sits at the core of effective platform leadership: taking ownership of a problem that was not assigned to them. The GAL/GEH performance issue fell into the gap between platform ownership and application ownership. The platform team could reasonably say that the audit application functioned within its documented parameters. Application teams could reasonably say that queue architecture was not their domain. In that gap, the problem had persisted for years.

Rather than submitting a proposal that would enter an approval process against competing priorities, the engineer built a proof of concept on personal time, outside of formal work responsibilities. The scope was deliberately narrow: demonstrate that MQ cluster isolation could provide the capacity separation needed to decouple audit message processing from business-critical traffic. Not a production solution. Not a fully engineered architecture. A working demonstration of the principle.

Leadership Principle

A working prototype is worth more than a hundred slides. When an engineer brings a proof of concept to a meeting instead of a proposal, the burden of proof shifts from "why should we try this" to "why would we not pursue this." The prototype reframes the organizational question entirely. It does not ask for permission to explore. It presents a conclusion and asks for resources to confirm it.

The POC produced a measurable result under controlled conditions: audit processing that had taken 24 hours in the shared queue architecture completed in a fraction of that time when given isolated queue resources. The performance gap was not subtle. It was categorical. The constraint that the organization had accepted as permanent was, under a different architecture, not a constraint at all. It was an artifact of a design choice that could be revisited.

The significance of this moment extended beyond the technical finding. The POC changed the organizational question. Before it existed, any effort to address GAL/GEH performance had to first overcome the burden of proof: convince leadership and peer teams that improvement was possible, that the investment was warranted, that the risk of changing a production system was justified by an uncertain payoff. After the POC existed, the burden of proof inverted. The question was no longer "can this be improved?" The question became "given that we can clearly improve this, why would we not?"

That inversion is the primary strategic value of evidence-based advocacy. It does not win arguments. It ends them.

Part 1, Section 3: Bringing People to the Table

A proof of concept that stays in a lab changes nothing. The engineer understood that the POC's value was as a forcing function, not as a finished deliverable. Its purpose was to create conditions under which the right stakeholders would engage seriously with a problem they had previously written off. That meant bringing people to the table who had reasons to be skeptical, reasons to protect their existing workloads and priorities, and institutional memory of past attempts to improve a platform that had not delivered on its promises.

The engagement approach was systematic. Application development teams, particularly the ICE team responsible for integration and configuration engineering, were among the first audiences. These were the teams whose developers had daily contact with GAL/GEH limitations and who had built the workaround solutions that had become part of their standard practice. They had the most direct experience of the problem's costs and the most credible voice in assessing whether a proposed solution addressed the real pain points.

IBM vendor support was engaged early and collaboratively. The MQ cluster architecture required configuration choices that fell within IBM's domain of expertise, and proceeding without vendor alignment would have created both technical risk and a significant relationship friction. Rather than presenting IBM with a fait accompli, the engineer brought them into the testing and validation process, framing the engagement as a shared technical exploration rather than a demand for support on an uncommitted direction.

Platform operations and service owners from each line of business were included in the evidence review. These audiences had different concerns from the development teams. They were less interested in technical architecture and more focused on operational risk, rollback capability, and the reliability of any new configuration under production load conditions. The testing data was presented in terms that addressed their specific concerns directly.

The trust-building approach across all these groups shared three consistent elements: all performance data was shared transparently, without filtering for favorable results; skeptics were invited to observe testing sessions directly rather than receiving summarized findings; and the solution was consistently framed as a shared outcome rather than an individual contribution. The language used throughout the engagement was "what we can achieve together" rather than "what I have built."

The Human Moment

There was a specific moment during the stakeholder engagement process that marked the turning point in organizational buy-in. A service owner who had managed one of the most heavily impacted lines of business, and who had submitted audit delay escalations that went unresolved for years, observed a test run in which audit processing completed in under an hour. They had come to the session skeptical. The shift in their posture during that session, from arms-crossed patience to genuine engagement, was visible to everyone in the room. When that service owner became an advocate, other service owners followed. The technical case had already been made. This moment made it real.

Navigating the IBM vendor relationship deserves specific attention. Vendor relationships in enterprise environments carry their own politics. IBM had deep familiarity with the existing MQ configuration and a natural interest in any architectural changes being implemented correctly. By engaging IBM as a technical partner in the validation process rather than as a gatekeeper or a rubber-stamp, the relationship remained collaborative throughout. IBM's input on cluster configuration tuning proved genuinely valuable during subsequent testing iterations.

Part 1, Section 4: Executing with Rigor Under Scrutiny

Moving from proof of concept to production-grade solution required shifting from demonstration mode to engineering mode. The POC had established the principle. The execution phase required establishing the evidence: comprehensive performance testing, documented configuration decisions, phased rollout with rollback capability, and transparent reporting at each stage to the stakeholder group that had been assembled during the alignment phase.

The performance testing effort was led by the author and structured around three criteria that had been agreed upon with stakeholders in advance. This advance agreement was not procedural formality. It was a trust mechanism. When the criteria for success are defined before testing begins, the results carry a credibility that post-hoc criteria cannot achieve.

Test Criterion What Was Measured Why It Mattered to Stakeholders
Processing Throughput Volume of audit messages processed per unit of time under the proposed MQ cluster configuration, compared against baseline shared-queue performance Service owners needed to see that audit processing would reliably complete within an acceptable window, not just perform better in isolation
Resource Impact Effect on overall IHUB platform performance during sustained audit processing, including CPU, I/O, and queue depth on business-critical channels Platform operations needed confirmation that the new architecture would not trade one form of contention for another
Data Integrity Completeness and accuracy of audit records under the isolated queue configuration, with zero data loss as the acceptance threshold Compliance and application teams required assurance that the architectural change did not degrade audit record quality in exchange for speed

The MQ cluster re-architecture introduced dedicated queue resources for business-critical message processing, completely isolated from the audit processing lane. Business-critical integrations routed to dedicated MQ cluster queues where they would no longer compete with audit traffic for depth, processing cycles, or I/O bandwidth. The audit subsystem received its own isolated queue cluster, sized to handle peak audit volumes without creating backpressure on the broader platform.

The rollout followed a phased approach. Validation in lower environments preceded any production deployment, and each phase included defined rollback procedures. This was not caution for its own sake. It was what a stakeholder group that had been burned by platform instability in the past required before extending trust to a new configuration.

The execution did not proceed without friction. Tuning the cluster configuration to perform correctly under production-representative load conditions required multiple iterations with IBM's technical team. Not every configuration variant worked as expected on the first attempt. Some combinations produced unexpected behavior under specific load patterns that had not been present in the lower-environment testing. Each failure was documented and shared with the stakeholder group transparently, with the specific tuning change being tested next. This transparency, particularly in moments of difficulty, was what maintained the stakeholder trust that had been built during the alignment phase. Teams that are kept informed through setbacks stay engaged. Teams that receive only good news become skeptical when problems eventually surface.

The discipline of sharing unfavorable data as readily as favorable data was a deliberate choice and one of the more consequential decisions made during the execution phase. It would have been organizationally easier to share only the successful test runs. The commitment to sharing everything, including the configurations that failed and why, created a standard of transparency that later made the final results more credible to every stakeholder who had observed the process.

Part 1, Section 5: The Results That Changed Behavior

Production deployment confirmed what the testing had indicated, and then went further. The performance improvement was immediate and visible from the first operational day. The audit processing time that had measured in 24-hour cycles dropped to under 4 hours. The millions of messages that had accumulated in the backlog cleared systematically as the isolated queue cluster processed them at a rate the shared architecture had never achieved. The platform reclaimed compute and I/O resources that audit processing had consumed for years.

73% Reliability Improvement
24h → 4h Audit Delay Reduced
Zero Queue Corruption Incidents

The queue corruption events that had required periodic manual intervention stopped occurring entirely. The conditions that had made the audit queue susceptible to corruption under heavy load, specifically the resource pressure created by millions of backlogged messages competing for shared queue capacity, had been eliminated by the isolation architecture. The audit subsystem moved from periodic instability requiring reactive intervention to consistent, uninterrupted operation.

Customer-facing operations registered the improvement in measurable terms. Vehicle sale processing and post-sale arbitration workflows, which had experienced elevated turnaround times when audit processing was consuming shared queue resources at peak load, now ran at full throughput without audit traffic interference. For the service owners who managed these workflows, the improvement was not an abstract platform metric. It was visible in the operational data they reviewed daily.

The most significant secondary effect was behavioral. Development teams that had designed their integrations to minimize GAL/GEH usage, specifically because the audit capability's performance overhead had made heavy usage counterproductive, began reconsidering those design choices. The constraint that had shaped their integration architectures for years was no longer present. Developers could now make audit coverage decisions based on business requirements and compliance needs rather than platform limitations. GAL/GEH usage became intentional and appropriate rather than minimal and avoidant.

Outcome

The optimization did not merely fix a performance metric. It restored a capability that the enterprise had written off. Service owners who had submitted escalations for years and received no actionable resolution began working with GAL/GEH as a reliable tool again. The trust deficit that had accumulated over years of audit delays did not disappear overnight, but it began to reverse, grounded in the sustained reliability that followed deployment. Platform confidence is rebuilt through consistent performance, not through promises. That consistency was what the architecture change made possible.

For platform engineering leaders, this outcome carries a specific lesson. The 73% reliability improvement was the headline metric. The behavioral change in how 100+ application teams approached integration design was the compound return. When a platform constraint is removed, the downstream effects on how developers build solutions accumulate over time. The full value of this architectural change could not be measured on the day it deployed. It accrued through every integration designed correctly in the years that followed.

Part 1, Section 6: Principles That Transfer

The technical context of this initiative, IBM Message Broker, MQ clustering, AIX platform infrastructure, belonged to 2015. The leadership principles that made it possible belong to any year. Each of the following principles is distilled from a specific decision or behavior that shaped the initiative's outcome, not from retrospective theorizing about what might have worked.

Principle Core Insight Actionable Recommendation
Challenge Accepted Constraints Periodically Long-standing limitations deserve re-examination. What was technically true three years ago may not be true today. Organizational memory is long; platform capabilities evolve faster than assumptions about them. Schedule quarterly constraint audits where the team asks: "If we were designing this integration layer today, would we accept this limitation?" If the answer is no, the constraint earns an investigation slot.
Prototype Before You Propose Evidence moves organizations faster than arguments. A working prototype shifts the question from "should we investigate?" to "why would we not pursue this?" Proposals generate debate. Prototypes generate decisions. Protect time for POC work on problems that matter, even when that time must come from outside formal assignments. The organizational value of a working demonstration exceeds its engineering cost by a significant margin.
Share Data, Not Just Conclusions Transparency builds trust faster than any presentation. When stakeholders see raw data, including unfavorable results, they draw their own conclusions. Those conclusions are more durable than ones handed to them. Make all performance data and test results accessible to stakeholders, including skeptics. Share failures and their explanations as readily as successes. The habit of transparent reporting under scrutiny creates the credibility that makes final results believable.
Design Solutions That Outlast You The best engineering work is invisible after it ships. A solution that requires its original author's continued presence to function is not a finished solution. Durability requires documentation of the "why," not just the "how." Document the reasoning behind every significant architectural decision alongside the implementation. Ensure the team that inherits the solution understands the constraints it was designed to address, so they can adapt it intelligently when those constraints change.
Measure the Human Impact Spreadsheet metrics tell the board story. The moment a service owner's skepticism becomes advocacy tells the real story. Platform trust is a business asset, and its restoration is measurable in behavioral change, not just in performance graphs. After every major delivery, ask service owners directly: "Do you trust this platform more today than before?" The answer to that question is the leading indicator for whether the technical improvement will produce durable organizational value.

Part 1: Reflective Close

Looking back at this initiative from the vantage point of a decade later, the most instructive element is not the technical solution. The MQ cluster re-architecture was a sound approach to a well-understood class of resource contention problem. In a different environment, with different tooling, a similar pattern would produce similar results. The technical work was significant, but it was not the hard part.

The hard part was persuading an organization that had accepted a constraint as permanent to invest the engineering attention and organizational trust required to challenge it. That persuasion did not happen through argument. It happened through evidence, through the sustained effort of bringing skeptical stakeholders into direct contact with data that contradicted their assumptions, and through the discipline of maintaining transparency about both the successes and the failures along the way.

The problems worth solving are often the ones nobody assigned to you. Formal project pipelines are efficient at routing resources toward known priorities. They are inefficient at identifying constraints that have become invisible through normalization. The engineer who recognized the GAL/GEH trajectory as a growing risk, rather than a stable inconvenience, was operating from a system-level view that formal assignment structures do not automatically produce. Cultivating that view, the willingness to see organizational blind spots and take ownership of them without waiting for permission, is one of the most transferable capabilities in platform engineering leadership.

The principles from this 2015 initiative have continued to surface across every platform transformation engagement since: in observability modernization programs, in cloud migration programs, in incident response redesigns. The specific technology changes. The pattern of normalized constraints, evidence-based advocacy, and trust rebuilt through sustained performance does not.

Looking Ahead

A decade later, the same integration platform would face a far larger transformation: migrating 370+ workloads from on-premises AIX infrastructure to cloud-native Kubernetes. The scale was different. The timeline was compressed. The stakes were higher. But the foundational approach, build evidence, bring the right people into the room, execute with transparency, design for durability, carried directly from this 2015 initiative into that one. Part 2 tells that story.

Explore More Insights

Read additional white papers and leadership perspectives on enterprise technology transformation, platform engineering, and organizational change.

Back to The Shift Blog