Software update crippled Australia’s fast-payments network: RBA post-mortem report

Payments system RBA error server

Australia’s fast payments network, the New Payments Platform (NPP), last month had a significant bulk of its transaction handling capacity knocked out for hours due to an incorrect software setting, resulting in payments processing delays of up to five days, a new report by the Reserve Bank of Australia (RBA) has revealed.

As a result of the error, around 500,000 NPP unique payments (17 per cent of the daily average volume for a typical Wednesday – the day of the outage) sent by the public were delayed by at least four hours, with some payments only making their way through to intended recipients five days after the outage, the post-mortem ‘Final Incident Report’ revealed.

While nearly all relevant payments systems were back online within five hours, the bulk of the recovery work (which required payments sending institutions to reconcile and reprocess payments) was only completed by Monday 17 October, according to the RBA – five days after the initial outage.

The outage was found to be the result of an “incorrect setting” applied during a planned update to software that manages RBA’s virtual servers.

These servers support two critical payments settlement systems: the overarching Reserve Bank Information Transfer System (RITS), the central bank’s interbank settlement system; and the sub-RITS, NPP-feeder system known as the Fast Settlement Service (FSS), which supports round-the-clock, real-time payments settlements.

“On 12 October at around 19:00, an operational error occurred during a planned [Reserve] Bank-wide change using the software that provisions the RBA’s virtual servers,” the RBA wrote in its report.

“This error triggered a process that disrupted a significant number of servers in a random pattern over a period of approximately 25 minutes.”

The RBA added: “The scale of servers affected was caused by a failure to comply with the RBA’s Technology Change Management policy and control gaps associated with the virtual server solution design contributed to the rapid propagation of the error.”

“While the strong redundancy features of RITS and FSS enabled parts of the system to continue operating normally, some services became unavailable and the resilience of the system was severely degraded. The scale and haphazard pattern of disruption significantly complicated the incident response.”

Indeed, the RBA initially appeared to downplay the extent of the impact on its services, with internal monitoring suggesting “that the number and extent of [transaction] aborts was reasonably low and likely due to decreased processing capacity of the FSS”. This occurred at 20:33 in the evening.

Only after an NPP Incident Response Group (NPP IRG) meeting was convened – at 23.15 – was the RBA made aware of the extent of the impact of the server outage and that a “far greater percentage of transactions” were aborting.

Payments messaging disrupted

In the case of the FSS, the software configuration error “prevented FSS from sending out notifications of successful settlement, which caused widespread disruption to NPP payment processing and had severe repercussions for members, including significant inconvenience and delays to customers,” the report noted.

The wider RITS network was prevented from sending and receiving Low Value Clearing and Low Value Settlement Services (LVCS and LVSS) files – these provide instructions from members to the RITS advising of their settlement obligations from low-value clearing exchanges.

By around 23:45 – just over four and a half hours after the servers were inadvertently taken offline – affected servers were restored and the FSS immediately began successfully processing incoming and outgoing messages.

“From this point, all settlement notifications were successfully sent out within the allowable response times.”

However, NPP participants were required to work through a backlog of around 500,000 aborted transactions, which were in an unknown settlement state, for both inbound and outbound transactions, the report wrote.

“LVSS inbound file transfers were restored at around 00:14 and outbound file transfers at around 00:54 on 13 October.

“Settlement response files for the manual FSIs and the earlier 19:15 multilateral run were sent to members from 00:54.”

Post-mortem actions

In the wake of the incident review, the RBA said it will engage in a comprehensive set of actions, with the central bank to:

  1. Undertake a further review of the governance and control arrangements relevant to the incident, including other contributing factors.
  2. Investigate improvements to broaden monitoring of end-to-end message flow for FSS
  3. Review system RBA recovery procedures for FSS
  4. Review clarity of RBA notifications and updates to Members about RBA system issues
  5. Discuss procedures on the timing of industry incident meetings with NPPA and AusPayNet
  6. Review and clarify communication roles between RBA and NPPA
  7. Seek guidance from NPPA about industry capability to replay NPP payments at scale