whitepaper

Taming the Integration Beast

MQ-ACE Platform Stabilization: From 36 Outages to 3

36 → 3 Critical Outages YoY
60% Stability vs. Baseline
99.9%+ Platform Uptime
Zero Queue Corruption

Executive Summary

At a multi-billion-dollar B2B services enterprise, the IBM MQ and App Connect Enterprise (ACE) platform processed every business-critical transaction across 20 business units and 100+ application teams. Queue corruption was destroying persistent messages. Critical outages hit 36 times per year. The modernization needed to fix it had been postponed three consecutive years because no one trusted the platform to survive the change.

Through systematic root-cause engineering, process governance, and a deliberate trust-rebuilding campaign with 18+ application development teams, the platform was stabilized to 99.9%+ uptime with zero queue corruption. Critical outages dropped from 36 to 3 per year, a 92% reduction. Performance and stability improved 60% against the newly established baseline. The three-year migration deadlock was broken, and the subsequent modernization program migrated 370+ integrations from IIB to ACE on EKS, compressing a three-year plan into nine months.

This white paper documents the situation-task-action-result arc of the stabilization, the technical and organizational playbook that made it possible, and the leadership principles that transformed a platform in crisis into a foundation for enterprise modernization.

Headline Metric Impact
Critical outages per year36 → 3 (92% reduction)
Performance & stability vs. baseline60% improvement in 9 months
Queue corruption eventsRecurring → Zero (eliminated)
Platform uptimeFragile → 99.9%+ (stabilized)
Migration status3 years deferred → Completed
Integrations modernized370+ IIB to ACE on EKS
Annual cost savings$2M+ realized

1. Situation: An Integration Platform in Crisis

The MQ-ACE platform connected 20 business units, supported over 100 application development teams, and processed the transactional data that kept a $6B business running around the clock. When it worked, it was invisible. When it failed, everything stopped.

It was failing constantly.

Chronic Technical Failures

The MQ-ACE environment had become the most fragile platform in the enterprise. Queue corruption was not an occasional anomaly; it was a recurring pattern that resulted in the loss of persistent messages: actual data loss in a system designed for guaranteed delivery. The corruption stemmed from a combination of aging backend data stores, inadequate capacity planning, and infrastructure-level issues including MQ lock contention and disk latency that created cascading failures under load.

Every corruption event triggered the same painful cycle: an immediate P1 incident, hours of triage, escalation to IBM support, First-Failure Data Capture (FFDC) analysis, and the opening of a Problem Management Record (PMR). These PMRs would escalate to IBM's specialist teams in the UK, and then silence. Resolutions would not materialize for weeks. The team spent more time managing vendor escalations than engineering solutions.

What We Were Up Against

The platform was seen as fragile, with frequent critical and business integration downtime. The App Dev team had lost confidence in the product, support, and operations.

The Numbers That Told the Story

Indicator State
Critical outages per year36
Performance & stability vs. baselineBaselined pre-migration
Queue corruption eventsRecurring (persistent message data loss)
Consecutive years migration postponed3 years
Business units dependent on platform20
Application teams affected100+
Business revenue dependent on platform$6B

Organizational Paralysis

The technical instability had infected the organizational culture. The enterprise needed to migrate and modernize the MQ-ACE platform, upgrading operating systems, storage layers, and MQ/Integration versions, but the migration had been postponed three consecutive years because no one believed the platform was stable enough to survive the change. The previous platform lead had departed, leaving a gap in both expertise and stakeholder confidence.

A deeply entrenched mindset had taken root across the organization: "Don't rock the boat." The prevailing logic was simple: if the platform barely holds together as-is, any change risks making it worse. Teams had normalized failure. Application developers had lost faith not just in the technology, but in the operations team's ability to manage it. No one had established formal performance baselines, so there was no objective way to measure improvement or justify change. Critical integrations were running on hope, not engineering.

2. Task: Restore Trust, Prove Stability, Unblock Migration

The assignment was clear but formidable. It was not simply a technical remediation project; it was a trust-rebuilding exercise with measurable outcomes required at every stage.

The Dual Mandate

Immediate: Stabilize the MQ-ACE platform to stop the bleeding. Eliminate queue corruption, reduce critical outages, establish measurable performance baselines, and restore reliability. Every improvement had to be measurable and demonstrable to skeptical stakeholders.

Strategic: Earn enough organizational trust to unblock the migration that had been deferred for three years. This meant not just fixing the platform, but proving, with data, that the team could manage change safely in a production environment supporting a $6B business. The migration would upgrade OS, storage, and MQ/Integration versions via lift and shift, with zero-downtime targets on a 24x7x365 platform.

The task required overcoming a deeply entrenched "don't rock the boat" culture. Any proposed change would face intense scrutiny from 18+ application development teams, each of whom needed to sign off on baseline test results before agreeing to migrate. Trust would not be granted; it would have to be earned through evidence.

The Real Problem

This was not a technology problem to be solved. It was a trust deficit to be overcome, and the only currency accepted was evidence.

3. Action: A Systematic Stabilization Playbook

The approach was methodical and evidence-driven. Rather than attempting a single heroic fix, the work was structured across four parallel workstreams: technical stabilization, process governance, stakeholder trust-building, and migration readiness. Each workstream reinforced the others.

Pillar 1 Technical Root-Cause Engineering
Pillar 2 Process Governance & Change Control
Pillar 3 Stakeholder Trust Rebuilding
Pillar 4 Migration Execution

3.1 Technical Root-Cause Engineering

The first priority was understanding why the platform kept failing, rather than simply reacting to each incident. The team shifted from incident management, fighting fires as they erupted, to problem management with root-cause fixes.

Capacity Tuning

Comprehensive capacity analysis was performed across the MQ-ACE environment. Queue depths, message throughput rates, memory allocation, and thread pool configurations were baselined and right-sized. The platform had been running without capacity tuning for years, with configurations that reflected historical load patterns rather than current demand.

Backend Data Store Migration

The recurring queue corruption was traced, in part, to the underlying backend data store. A migration to a stable backend data store was completed, which directly addressed the persistence layer failures that caused message data loss. This was the single most impactful change in eliminating queue corruption.

Critical Insight

The backend data store migration was the single most impactful change in eliminating queue corruption, addressing the persistence layer failures that had caused actual message data loss in a system designed for guaranteed delivery.

MQ Lock and Disk Latency Resolution

MQ lock contention and disk I/O latency were identified as contributing factors to cascading failures. Infrastructure-level optimizations addressed storage performance and reduced the lock contention windows that had been triggering queue corruption under sustained load.

Correlation Rules

Event correlation rules were built that linked monitoring signals across the MQ-ACE stack. Rather than alerting on individual symptoms, the correlation engine identified failure patterns before they escalated to P1 incidents. This reduced alert noise and surfaced actionable intelligence.

3.2 Process Governance & Change Control

Technical fixes alone would not sustain stability. The team introduced operational discipline that had been absent from the platform's management.

Risk-Based Change Gating

Not all changes carry equal risk. A classification framework was implemented that assessed every proposed change against risk criteria and gated deployments accordingly. High-risk changes required additional validation, rollback plans, and stakeholder sign-off. Evidence was required in every change record, no exceptions. This approach was critical in a regulated environment.

Problem Management Discipline

A structured problem management practice was established, moving beyond incident workarounds to permanent root-cause fixes. Each recurring incident pattern was tracked to a problem record with defined ownership, timeline, and resolution criteria. This discipline is what ultimately drove the outage reduction from 36 to 3.

Pattern Recognition

The shift from incident management to problem management, from fighting fires to eliminating the conditions that caused them, is what drove the 92% reduction in critical outages.

Compliance Partnership

The team partnered with product security and compliance teams to align stabilization and migration activities with SOX validation requirements. Every change carried auditable evidence and traceability.

3.3 Stakeholder Trust Rebuilding

The organizational challenge was as significant as the technical one. Overcoming three years of institutional skepticism required a deliberate trust-building campaign.

Baseline-First Methodology

Before proposing any migration, the team completed a proof-of-concept that captured comprehensive baseline performance results on the existing platform. Every metric, queue throughput, message latency, error rates, resource utilization, was documented. Post-upgrade data was gathered and compared against these baselines. This gave stakeholders empirical evidence, not promises.

App Dev Partnership

The team partnered directly with application service owners, application architects, and IBM contractors across all 18+ App Dev teams. Baseline tests were run using pre-deployed integrations on the new platform, and each team was given the data to compare results against their existing environment. Sign-off was earned team by team, not dictated from above.

IBM Professional Services Validation

The stabilization and migration approach was planned and validated with IBM Professional Services, providing an independent third-party endorsement of the methodology. This was critical for stakeholders who had lost trust in both the platform and the internal team's ability to manage it.

The Currency of Trust

Trust was not granted. It was earned, one baseline test, one stakeholder conversation, one data point at a time.

3.4 Migration Execution

With stability proven and trust earned, the migration that had been deferred for three years was finally executed.

Zero-Downtime HA Design

A high-availability migration architecture was designed and executed that maintained continuous service during the OS, storage, and MQ/Integration version upgrades. On a platform supporting a $6B business with 24x7x365 operations, any downtime was unacceptable.

Automated Pipelines & Rehearsed Runbooks

The migration leveraged automated CI/CD pipelines, standardized containers, rehearsed runbooks with defined rollback plans, and audit-ready change evidence at every stage. The same rigor applied to migration preparation was applied to execution.

370+ Integrations Modernized

The broader modernization program that followed migrated 370+ IBM IIB integrations to ACE on EKS, compressing what had been a three-year modernization plan into nine months through standardized containers and automated CI/CD. The MQ-ACE stabilization made this acceleration possible. Without the trust earned during stabilization, the modernization would never have been approved.

Leadership Principle

A working prototype is worth more than a hundred slides. The stabilization proved the team could manage change safely, and that proof unlocked everything that followed.

4. Result: From 36 Outages to 3. From Paralysis to Progress.

The stabilization delivered measurable, sustained results across every dimension that had been failing.

Core Stabilization Outcomes

Metric Before After Impact
Critical outages/year36392% reduction
Stability vs. baselineBaselined+60%Achieved in 9 months
Queue corruptionRecurringZeroEliminated
Processing backlogPresentNoneEliminated
Platform uptimeFragile99.9%+Stabilized
Migration statusDeferred 3 yrsCompletedTrust restored

Extended Business Impact

The stabilization created a foundation of trust that enabled the broader modernization program:

  • 370+ integrations modernized (IIB to ACE on EKS) - three-year plan compressed to nine months
  • 40% MTTR reduction - correlation rules, structured incident response, observability
  • Zero downtime during migration - HA architecture and rehearsed runbooks validated
  • 99.9% SLA sustained post-migration - stability proved durable, not temporary
  • $2M+ annual cost savings - platform optimization and reduced incident overhead
The Moment That Defined Success

The moment that defined success was not a dashboard metric. It was the first time an App Dev team leader approved a migration window without hesitation, because the data had already made the case.

5. Takeaway: What This Story Teaches

This was not a technology story. It was a trust story that happened to involve technology.

The MQ-ACE platform was not inherently broken beyond repair. It was under-governed, under-monitored, and under-invested. The failures were systemic: inadequate capacity planning, absent problem management, reactive operations, and a vendor support model that was slow to resolve deep infrastructure issues. The organizational response, deferring change to avoid risk, made the situation worse with every passing year.

The stabilization succeeded because it addressed all three dimensions simultaneously:

Principle 1: Technical Precision

Every action was targeted at a documented root cause, not a symptom. Capacity tuning, backend data store migration, and MQ lock resolution each addressed specific failure modes. The discipline of tracing symptoms to causes, and fixing causes rather than working around them, is what separated this effort from the years of reactive firefighting that preceded it.

Actionable Application

Fix the cause, not the symptom. If you find yourself building the same workaround twice, you are managing incidents, not solving problems.

Principle 2: Process Discipline

Risk-based change gating and structured problem management created a framework where improvements could be made safely and at pace. This was the key to sustaining results, not just achieving them. Without process discipline, technical fixes become isolated victories that erode over time.

Actionable Application

Stability without governance is luck. Governance without evidence is bureaucracy. The combination of both is engineering discipline.

Principle 3: Evidence Over Promises

Baseline-first methodology, POC validation, and team-by-team sign-off replaced organizational skepticism with data-backed confidence. Trust was earned, not declared. In environments where trust has been broken, the only path forward is to let the evidence speak, and to make that evidence available to everyone who needs to see it.

Actionable Application

A working prototype is worth more than a hundred slides. When trust is broken, the only currency that works is evidence, shared openly, measured objectively, and validated independently.

Diagnostic Checklist: Red Flags of Platform Instability

  • Recurring incidents are managed through workarounds rather than root-cause fixes
  • Migration or modernization has been deferred more than once due to stability concerns
  • No formal performance baselines exist to measure improvement
  • Vendor escalations consume more engineering time than internal solution development
  • The prevailing culture is "don't rock the boat," where change is seen as risk, not opportunity
  • Application teams have lost confidence in the platform team's ability to deliver
  • Compliance and security reviews are treated as obstacles rather than partners
Looking Forward

Greenfield programs excite me because you design not just the platform, you design the culture and habits that make it thrive.

6. About the Author

Hemanth Shivanna

Hemanth Shivanna

Across 19 years at publicly traded and Fortune 500 enterprises in automotive, fleet services, and financial services, Hemanth built and led a 35+ person global team spanning platform engineering, SRE, observability, and integration modernization. His work consistently sits at the intersection of technical depth and organizational change - stabilizing platforms, building trust with stakeholders, and creating the conditions for sustainable modernization.

Hemanth holds multiple industry certifications across cloud architecture, ITSM, and enterprise tooling, and has led cross-functional teams supporting platforms that underpin billions in business revenue.

Explore More Insights

Read more about enterprise platform transformations, observability strategy, and leadership in technology.

Browse All Articles