Fantom network incident analysis

Fantom network incident analysis

On Thursday, February 25 2021 at 3.04 PM UTC, Fantom Opera mainnet halted new block confirmations, which resulted in a temporary outage.

The core developer team and Fantom validator community immediately responded and successfully resumed the network within 7 hours.

This feat required consensus across 39 validators in many different time zones; while the community quickly diagnosed the issue, most outage time was due to waiting for sufficient stake weight to come back online.

To be clear, no funds were at risk as a result of the network halt.

Cause

One of the biggest validators slowed down the block emission, which caused a second big validator to slow down as well. The other validators kept producing blocks, but the two lagging ones were not able to catch up. These two validators are big enough to represent more than 1/3W of stake, and they caused a domino effect that halted new block confirmations.

Recovery

Our core developer team coordinated with the validators to apply a temporary patch to slow down the event emission rate to allow lagging nodes to catch up. Once all the nodes were synced again, they started to confirm blocks and finalize epochs.

Validator nodes size

The incident has highlighted that the network needs to make some changes regarding the validator nodes' requirements and their staking power distribution.

We identified two issues:

  1. The recent increase in the value of FTM made creating new nodes prohibitive. While this is not directly related to the incident, it accentuated point 2.
  2. Most of the staking power is concentrated among, which have more delegations than other nodes.

We'll make a proposal for stakers and delegators to vote on the on-chain governance. We're still defining the parameters but will likely involve lowering the minimum amount of FTM required to run a node and putting a cap on the node size.

Resolution

  1. Governance proposal to change requirements to run a validator node and incentivize the creation of new high-quality nodes. We'll make the proposal in the short term.
  2. Deploy the go-opera upgrade. In this upgrade, we completely rewrote the p2p code, optimizing CPU time and memory usage. More changes are coming to the emitter and the general network performance. In earlier tests, the new code achieved 10k tps on medium/low spec nodes without transaction execution. The new code needs to go through extensive compatibility and stress tests and will be available later on.

Conclusion

We want to publicly thank all Fantom validators that coordinated and worked together to solve the issue and bring the network live as soon as possible. A big thank you to the community as well for being incredibly supportive and understanding during those hours. While we're not happy with the incident, the silver lining is that this growing pain happened sooner rather than later, allowing us to become aware of the issues and prepare for the proposed fixes.