Executive Summary
At a multi-billion-dollar B2B services enterprise, the IBM MQ and App Connect Enterprise (ACE) platform processed every business-critical transaction across 20 business units and 100+ application teams. Queue corruption was destroying persistent messages. Critical outages hit 36 times per year. The modernization needed to fix it had been postponed three consecutive years because no one trusted the platform to survive the change.
Through systematic root-cause engineering, process governance, and a deliberate trust-rebuilding campaign with 18+ application development teams, the platform was stabilized to 99.9%+ uptime with zero queue corruption. Critical outages dropped from 36 to 3 per year, a 92% reduction. Performance and stability improved 60% against the newly established baseline. The three-year migration deadlock was broken, and the subsequent modernization program migrated 370+ integrations from IIB to ACE on EKS, compressing a three-year plan into nine months.
This white paper documents the situation-task-action-result arc of the stabilization, the technical and organizational playbook that made it possible, and the leadership principles that transformed a platform in crisis into a foundation for enterprise modernization.
| Headline Metric | Impact |
|---|---|
| Critical outages per year | 36 → 3 (92% reduction) |
| Performance & stability vs. baseline | 60% improvement in 9 months |
| Queue corruption events | Recurring → Zero (eliminated) |
| Platform uptime | Fragile → 99.9%+ (stabilized) |
| Migration status | 3 years deferred → Completed |
| Integrations modernized | 370+ IIB to ACE on EKS |
| Annual cost savings | $2M+ realized |
1. Situation: An Integration Platform in Crisis
The MQ-ACE platform connected 20 business units, supported over 100 application development teams, and processed the transactional data that kept a $6B business running around the clock. When it worked, it was invisible. When it failed, everything stopped.
It was failing constantly.
Chronic Technical Failures
The MQ-ACE environment had become the most fragile platform in the enterprise. Queue corruption was not an occasional anomaly; it was a recurring pattern that resulted in the loss of persistent messages: actual data loss in a system designed for guaranteed delivery. The corruption stemmed from a combination of aging backend data stores, inadequate capacity planning, and infrastructure-level issues including MQ lock contention and disk latency that created cascading failures under load.
Every corruption event triggered the same painful cycle: an immediate P1 incident, hours of triage, escalation to IBM support, First-Failure Data Capture (FFDC) analysis, and the opening of a Problem Management Record (PMR). These PMRs would escalate to IBM's specialist teams in the UK, and then silence. Resolutions would not materialize for weeks. The team spent more time managing vendor escalations than engineering solutions.
The platform was seen as fragile, with frequent critical and business integration downtime. The App Dev team had lost confidence in the product, support, and operations.
The Numbers That Told the Story
| Indicator | State |
|---|---|
| Critical outages per year | 36 |
| Performance & stability vs. baseline | Baselined pre-migration |
| Queue corruption events | Recurring (persistent message data loss) |
| Consecutive years migration postponed | 3 years |
| Business units dependent on platform | 20 |
| Application teams affected | 100+ |
| Business revenue dependent on platform | $6B |
Organizational Paralysis
The technical instability had infected the organizational culture. The enterprise needed to migrate and modernize the MQ-ACE platform, upgrading operating systems, storage layers, and MQ/Integration versions, but the migration had been postponed three consecutive years because no one believed the platform was stable enough to survive the change. The previous platform lead had departed, leaving a gap in both expertise and stakeholder confidence.
A deeply entrenched mindset had taken root across the organization: "Don't rock the boat." The prevailing logic was simple: if the platform barely holds together as-is, any change risks making it worse. Teams had normalized failure. Application developers had lost faith not just in the technology, but in the operations team's ability to manage it. No one had established formal performance baselines, so there was no objective way to measure improvement or justify change. Critical integrations were running on hope, not engineering.
2. Task: Restore Trust, Prove Stability, Unblock Migration
The assignment was clear but formidable. It was not simply a technical remediation project; it was a trust-rebuilding exercise with measurable outcomes required at every stage.
The Dual Mandate
Immediate: Stabilize the MQ-ACE platform to stop the bleeding. Eliminate queue corruption, reduce critical outages, establish measurable performance baselines, and restore reliability. Every improvement had to be measurable and demonstrable to skeptical stakeholders.
Strategic: Earn enough organizational trust to unblock the migration that had been deferred for three years. This meant not just fixing the platform, but proving, with data, that the team could manage change safely in a production environment supporting a $6B business. The migration would upgrade OS, storage, and MQ/Integration versions via lift and shift, with zero-downtime targets on a 24x7x365 platform.
The task required overcoming a deeply entrenched "don't rock the boat" culture. Any proposed change would face intense scrutiny from 18+ application development teams, each of whom needed to sign off on baseline test results before agreeing to migrate. Trust would not be granted; it would have to be earned through evidence.
This was not a technology problem to be solved. It was a trust deficit to be overcome, and the only currency accepted was evidence.
3. Action: A Systematic Stabilization Playbook
The approach was methodical and evidence-driven. Rather than attempting a single heroic fix, the work was structured across four parallel workstreams: technical stabilization, process governance, stakeholder trust-building, and migration readiness. Each workstream reinforced the others.
3.1 Technical Root-Cause Engineering
The first priority was understanding why the platform kept failing, rather than simply reacting to each incident. The team shifted from incident management, fighting fires as they erupted, to problem management with root-cause fixes.
Capacity Tuning
Comprehensive capacity analysis was performed across the MQ-ACE environment. Queue depths, message throughput rates, memory allocation, and thread pool configurations were baselined and right-sized. The platform had been running without capacity tuning for years, with configurations that reflected historical load patterns rather than current demand.
Backend Data Store Migration
The recurring queue corruption was traced, in part, to the underlying backend data store. A migration to a stable backend data store was completed, which directly addressed the persistence layer failures that caused message data loss. This was the single most impactful change in eliminating queue corruption.
The backend data store migration was the single most impactful change in eliminating queue corruption, addressing the persistence layer failures that had caused actual message data loss in a system designed for guaranteed delivery.
MQ Lock and Disk Latency Resolution
MQ lock contention and disk I/O latency were identified as contributing factors to cascading failures. Infrastructure-level optimizations addressed storage performance and reduced the lock contention windows that had been triggering queue corruption under sustained load.
Correlation Rules
Event correlation rules were built that linked monitoring signals across the MQ-ACE stack. Rather than alerting on individual symptoms, the correlation engine identified failure patterns before they escalated to P1 incidents. This reduced alert noise and surfaced actionable intelligence.
3.2 Process Governance & Change Control
Technical fixes alone would not sustain stability. The team introduced operational discipline that had been absent from the platform's management.
Risk-Based Change Gating
Not all changes carry equal risk. A classification framework was implemented that assessed every proposed change against risk criteria and gated deployments accordingly. High-risk changes required additional validation, rollback plans, and stakeholder sign-off. Evidence was required in every change record, no exceptions. This approach was critical in a regulated environment.
Problem Management Discipline
A structured problem management practice was established, moving beyond incident workarounds to permanent root-cause fixes. Each recurring incident pattern was tracked to a problem record with defined ownership, timeline, and resolution criteria. This discipline is what ultimately drove the outage reduction from 36 to 3.
The shift from incident management to problem management, from fighting fires to eliminating the conditions that caused them, is what drove the 92% reduction in critical outages.
Compliance Partnership
The team partnered with product security and compliance teams to align stabilization and migration activities with SOX validation requirements. Every change carried auditable evidence and traceability.
3.3 Stakeholder Trust Rebuilding
The organizational challenge was as significant as the technical one. Overcoming three years of institutional skepticism required a deliberate trust-building campaign.
Baseline-First Methodology
Before proposing any migration, the team completed a proof-of-concept that captured comprehensive baseline performance results on the existing platform. Every metric, queue throughput, message latency, error rates, resource utilization, was documented. Post-upgrade data was gathered and compared against these baselines. This gave stakeholders empirical evidence, not promises.
App Dev Partnership
The team partnered directly with application service owners, application architects, and IBM contractors across all 18+ App Dev teams. Baseline tests were run using pre-deployed integrations on the new platform, and each team was given the data to compare results against their existing environment. Sign-off was earned team by team, not dictated from above.
IBM Professional Services Validation
The stabilization and migration approach was planned and validated with IBM Professional Services, providing an independent third-party endorsement of the methodology. This was critical for stakeholders who had lost trust in both the platform and the internal team's ability to manage it.
Trust was not granted. It was earned, one baseline test, one stakeholder conversation, one data point at a time.
3.4 Migration Execution
With stability proven and trust earned, the migration that had been deferred for three years was finally executed.
Zero-Downtime HA Design
A high-availability migration architecture was designed and executed that maintained continuous service during the OS, storage, and MQ/Integration version upgrades. On a platform supporting a $6B business with 24x7x365 operations, any downtime was unacceptable.
Automated Pipelines & Rehearsed Runbooks
The migration leveraged automated CI/CD pipelines, standardized containers, rehearsed runbooks with defined rollback plans, and audit-ready change evidence at every stage. The same rigor applied to migration preparation was applied to execution.
370+ Integrations Modernized
The broader modernization program that followed migrated 370+ IBM IIB integrations to ACE on EKS, compressing what had been a three-year modernization plan into nine months through standardized containers and automated CI/CD. The MQ-ACE stabilization made this acceleration possible. Without the trust earned during stabilization, the modernization would never have been approved.
A working prototype is worth more than a hundred slides. The stabilization proved the team could manage change safely, and that proof unlocked everything that followed.
4. Result: From 36 Outages to 3. From Paralysis to Progress.
The stabilization delivered measurable, sustained results across every dimension that had been failing.
Core Stabilization Outcomes
| Metric | Before | After | Impact |
|---|---|---|---|
| Critical outages/year | 36 | 3 | 92% reduction |
| Stability vs. baseline | Baselined | +60% | Achieved in 9 months |
| Queue corruption | Recurring | Zero | Eliminated |
| Processing backlog | Present | None | Eliminated |
| Platform uptime | Fragile | 99.9%+ | Stabilized |
| Migration status | Deferred 3 yrs | Completed | Trust restored |
Extended Business Impact
The stabilization created a foundation of trust that enabled the broader modernization program:
- 370+ integrations modernized (IIB to ACE on EKS) - three-year plan compressed to nine months
- 40% MTTR reduction - correlation rules, structured incident response, observability
- Zero downtime during migration - HA architecture and rehearsed runbooks validated
- 99.9% SLA sustained post-migration - stability proved durable, not temporary
- $2M+ annual cost savings - platform optimization and reduced incident overhead
The moment that defined success was not a dashboard metric. It was the first time an App Dev team leader approved a migration window without hesitation, because the data had already made the case.
5. Takeaway: What This Story Teaches
This was not a technology story. It was a trust story that happened to involve technology.
The MQ-ACE platform was not inherently broken beyond repair. It was under-governed, under-monitored, and under-invested. The failures were systemic: inadequate capacity planning, absent problem management, reactive operations, and a vendor support model that was slow to resolve deep infrastructure issues. The organizational response, deferring change to avoid risk, made the situation worse with every passing year.
The stabilization succeeded because it addressed all three dimensions simultaneously:
Principle 1: Technical Precision
Every action was targeted at a documented root cause, not a symptom. Capacity tuning, backend data store migration, and MQ lock resolution each addressed specific failure modes. The discipline of tracing symptoms to causes, and fixing causes rather than working around them, is what separated this effort from the years of reactive firefighting that preceded it.
Fix the cause, not the symptom. If you find yourself building the same workaround twice, you are managing incidents, not solving problems.
Principle 2: Process Discipline
Risk-based change gating and structured problem management created a framework where improvements could be made safely and at pace. This was the key to sustaining results, not just achieving them. Without process discipline, technical fixes become isolated victories that erode over time.
Stability without governance is luck. Governance without evidence is bureaucracy. The combination of both is engineering discipline.
Principle 3: Evidence Over Promises
Baseline-first methodology, POC validation, and team-by-team sign-off replaced organizational skepticism with data-backed confidence. Trust was earned, not declared. In environments where trust has been broken, the only path forward is to let the evidence speak, and to make that evidence available to everyone who needs to see it.
A working prototype is worth more than a hundred slides. When trust is broken, the only currency that works is evidence, shared openly, measured objectively, and validated independently.
Diagnostic Checklist: Red Flags of Platform Instability
- Recurring incidents are managed through workarounds rather than root-cause fixes
- Migration or modernization has been deferred more than once due to stability concerns
- No formal performance baselines exist to measure improvement
- Vendor escalations consume more engineering time than internal solution development
- The prevailing culture is "don't rock the boat," where change is seen as risk, not opportunity
- Application teams have lost confidence in the platform team's ability to deliver
- Compliance and security reviews are treated as obstacles rather than partners
Greenfield programs excite me because you design not just the platform, you design the culture and habits that make it thrive.