Review Downtime Jailing Parameters

This proposal would modify the jailing parameters in the slashing module to make consensus more responsive to validator downtime and degraded performance.

Background

Validators are penalized for failing to participate in block validation by being automatically jailed if they consistently miss blocks. When a validator misses a certain number of blocks within a defined window (determined by SignedBlocksWindow), they are considered to have crossed below the liveness threshold and are penalized by jailing. A jailed validator is temporarily removed from the active validator set and cannot participate in block validation.

Validators who have been automatically jailed due to downtime can rectify the situation and rejoin the active validator set by sending a MsgUnjail message after the jail duration, indicating their readiness to address the issue that led to the missed blocks. They can then resume their validator duties and receive block and epoch rewards.

Jailing is essential as it keeps block times consistently high rather than forcing consensus to wait for a validator, which is consistently too late to sign a block due to technical issues.

For example, after the recent v24 software upgrade, some validators only came online up to a day after the upgrade. Block production during this period was slower than anticipated. After all validators began to sign blocks or were jailed, the block speed quickly increased to 3.2 seconds.

At the current Signed Blocks Window of 30,000, a validator is jailed after:

SignedBlocksWindow * (1-MinSignedPerWindow) * Block Time

Which is currently ~22 hours.

Proposal

This proposal asks that the MinSignedPerWindow be increased to 80%. This increases the jailing mechanism’s responsiveness to 20% of the blocks in the SignedBlocksWindow having passed rather than 95% of the blocks.

  • Validators that have gone offline will remain in the consensus round for a much shorter period before being jailed.
  • Validators that have degraded performance, causing consensus to be regularly delayed, will also be jailed for downtime.

To compensate for this lowered time before jailing, the SignedBlocksWindow should also be lengthened to allow for a reasonable response time; this proposal would also increase this to 80,000.

This combination would lead to jailing after ~12 hours of downtime with 2.7s block times seen after v25 and ~6 hours, 40 minutes at the eventual target of 1.5-second block times if unchanged with this target being reached. This should therefore be reviewed as block time continues to lower.

Current Slashing Module Parameters
Signed Blocks Window: 30,000
Min Signed Per Window: 5.00%
Downtime Jail Duration: 1 min
Slash Fraction Doublesign: 5.00%
Slash Fraction Downtime: 0.00%

Proposed Slashing Module Parameters
Signed Blocks Window: 80,000
Min Signed Per Window: 80.00%
Downtime Jail Duration: 1 min
Slash Fraction Doublesign: 5.00%
Slash Fraction Downtime: 0.00%

Target Onchain Date: 28th May 2024

Although I surely agree with these changes, going to 90% might be a bit harsh to start with :woozy_face:

It is surely unacceptable that validators are offline for nearly a day after an upgrade before they get their act together. It should indeed be possible to remove them from the active set much earlier.

Is the mentioned 4 hours a target? I can imagine that in some cases outage of a validator is legit and it would be a waste to hurt those.
On the other hand, you can do a lot in 4 hours (unless it is during night-time ofcourse, even validators have to sleep sometimes).

So maybe we just have to roll with it and see how it goes. We can always adjust the parameters along the way if needed.

1 Like

We’re supportive of this change. Having validators down for 8+ hours is pretty bad, and likely indicates a complete lack of a monitoring system.

Given that downtime slashing is 0%, 4 hours doesn’t seem to harsh to us.

1 Like

I don’t think lower than 90% will see any impact for the degradation side.

https://analytics.smartstake.io/osmosis/ is the best source I know of for missed blocks over time for all validators.
There looks to be 13 validators who may have had an incident in the last 30 days (or are consistently missing some blocks.
Of those, only 6 had instances that would have triggered a jailing under this new mechanism - all of which were prolonged downtime.

Can extend the block window further depending on what validators think is a reasonable response time.

1 Like

Hi,

Min | Komikuri using external signer (Horcrux) which can introduce some delay since the block speed increasing, the good news that the Horcrux dev updating their mechanism for us having faster block speed.

But so far, there are several incident with the new IAVL system and the v25 upgrade that accidentally block our external signer using the autoconfig.

Our apology for making the block speed slower for this cause.

We do hope if we want to increase the jailing parameter, may be we can proceed after we can having better and stable timing.

and if only 4 hours, I think we are all will be having most problem if we got hardware failure. So far the 30000 blocks window very tolerant for the situation when we got hardware failure or DB problem when update.

Thank You,
Min | Komikuri

Overall, I am in support of modifying the jailing parameters, but I feel these proposed ones might be a bit too strict. It is important to acknowledge the challenges this proposal may pose, especially for smaller validators. These operators, like me, often run near break-even points and may not have the resources to maintain these stricter requirements. For example, there is a CometBFT bug that impacts anyone using a remote signer. This frequently requires operator intervention. The proposed changes could inadvertently force dedicated validators to incur losses, simplify their operations by switching to software signers, or leave the network, which would be detrimental to the diversity and decentralization of the Osmosis community.

Recommendations:

  1. Flexible Downtime Allowance: Consider implementing a flexible downtime allowance that accounts for the reality of validators’ operational circumstances, such as sleep (8 hours please) and travel (maybe longer for international Cosmos events), without compromising the network’s security and performance.
  2. Gradual Implementation: Phase in the new parameters gradually, allowing validators to adapt their operations and avoid abrupt disruptions.

Phase 1

  • Signed Blocks Window: 55,000
  • Min Signed Per Window: 60.00%
  • Downtime Jail Duration: 8 hours
  • Slash Fraction Doublesign: 5.00%
  • Slash Fraction Downtime: 0.05%

Phase 2

  • Signed Blocks Window: 55,000
  • Min Signed Per Window: 70.00%
  • Downtime Jail Duration: 16 hours
  • Slash Fraction Doublesign: 5.00%
  • Slash Fraction Downtime: 0.05%

Phase 3

  • Signed Blocks Window: 55,000
  • Min Signed Per Window: 80.00%
  • Downtime Jail Duration: 24 hours
  • Slash Fraction Doublesign: 5.00%
  • Slash Fraction Downtime: 0.05%

Conclusion

While the proposal to modify the jailing parameters is well-intentioned and aims to enhance network performance, it is vital to balance these changes with the practical realities faced by validators. By considering the needs of all validators, especially the smaller ones who have historically contributed to the network’s decentralization, we can ensure a more inclusive and resilient network.

2 Likes

We are against downtime jailing under 10 hours. Teams traveling to conferences across the world can not yet be expected to fix nodes while in the sky.

Reducing the downtime window to 4 hours or less does not even allow for even casual air travel considering security, air time and delays.

With the cosmos tendency to concentrate VP and rewards with few validators, this is asking a lot from teams that do not share a considerable part of inflation commissions.

Referencing data above, most validators with significant vote power already have high uptime. This proposal is trying to solve a problem that Osmosis is barely affected by.

Implementing these rules will have almost no positive effect since those smaller validators with low uptime rarely propose blocks.

The downsides covered in the next message shows how the downsides far exceed the improvement to the network.

Upon deeper review, Chill Validation strongly disagrees with the mechanics proposed.

The proposed changes will form the basis for a full attack on all Osmosis assets.

The entire theory of having many validators is that enough will be honest to keep your funds safe. In the cosmos, as long as 34% of validators are honest, the blockchain will halt even if 66% of validators attack.

For security, many chains like Osmosis rely on a wide variety of validators to have the best chance of having 34% minimum honest validators.

An attack vector to completely steal all assets on chain is easier when it is possible to eliminate honest validators. This will make attacks impossible to stop.

Most chains allow for roughly 10 hours of validator downtime before a validator is jailed for a few minutes to an hour. This prevents validators from quickly rejoining the chain without proper repairs and serves as a degree of punishment.

What this proposals aims to establish is eventually a 2.33 hour downtime leading to a 24 hour jail time.

The idea is to force a validator to miss epoch rewards, but let’s see how this puts the chain at high risk of attack.

In the past CosmWASM smart contracts have been used to halt chains. Imagine a scenario where a smart contract can selectively offline any 15% of validators. After 2.33 hours, 15% of honest validators can be eliminated and locked out for the next 24 hours.

Assuming the chain was already at risk with near 66% attacking validators, it would only take 2.33 hours to compromise all assets by decreasing the number of honest validators.

Where the number of attackers is low and vote power is equally distributed, this attack can be repeated up to 10 times in the same day to conclude an attack in 23.3 hours.

By eliminating 20% of validators every 2.3 hours 10 times in the same day, you can eliminate 135/150 validators.

By eliminating 30% of validators every 2.3 hours 10 times in the same day, you can eliminate 147/150 validators.

Since vote power is concentrated near the top, it only requires attacking 43/150 Osmosis nodes to eliminate 67% of vote power. In just 4.6 hours, eliminating 20% of validators twice removes 60 validators. More than enough to cut the cost of attack by 67%.

The proposal to jail validators for a 24 hour period makes it impossible to defend against such an attack.

The tendency to whitelist “trusted teams” to upload new WASM contracts without review only makes this risk even greater. In practice, the halts on Juno Network were just a warning for what is to come.

We can understand the desire to punish a validator for offline time by removing a day of epoch rewards. However the extended jail period puts one of the highest value chains in the Cosmos at increased risk of low cost attacks.

Add a flag to remove a day of rewards, but do not jail validators for extended periods of time.

A jail time closer to the downtime period allows honest validators to rejoin and protect the network

4 Likes

This is a very very extensive post. Thank for this!

I’ve also posted this idea on Discord:
"Maybe a better route would be even better if you would like to stimulate online time.

What if the rewards in the epoch are distributed based on signed percentage of blocks from epoch to epoch? That way you directly stimulate being online as much as possible, because otherwise it directly affects the income of the validator."

That way missing blocks would 1:1 hurt the validator in general.

2 Likes

As discussed on discord, I feel like this will cause more harm than good.

Going to 90% requirement, with a target 1.5s blocks, and ~4 hour window for recovery is a recipe for disaster.

The people running a node for a few minutes on McDonald’s wifi aren’t helping and should be purged.

That said, I’d wager to say that most validators aren’t getting jailed unless it’s an emergency/catastrophe.

Shit happens.

I like the idea of performance based epoch payouts, although I understand that will require more work.

If we do go through with this, there needs to be at least a 12-24 hour window for recovery to resync from a snapshot and catch up with a fast chain.

Whether that is a 50% uptime or extending signing window is up for debate.

I do agree with the 24 hour jailing though.

1 Like

I’m not sure that selective jailing scenario is realistic. Surely, if this existed, then any jailing mechanic could be exploited to selectively jail all the honest validators over a few blocks?

The 4-hour response time was based on the typical time for validators to get back up and running after upgrades, but as there are 3 responses here requesting a longer time, I’m going to adjust the window to 8 hours. This is still substantially shorter than the current 22-hour period and should see some improvements.
As block times decrease, we should review this to ensure it remains at 8 hours. This fits in with @Defiantlabs ’ approach of a phased implementation.

The approach they proposed included a slashing constant when this proposal was not going to include one. However, it feels like the solution to the jailing time concerns raised by @Chill_Validation is to use a slashing mechanism rather than elongate the downtime by so much.

I propose the following compromise based on all the above comments:

Current Slashing Module Parameters
Signed Blocks Window: 30,000
Min Signed Per Window: 5.00%
Downtime Jail Duration: 1 min
Slash Fraction Doublesign: 5.00%
Slash Fraction Downtime: 0.00%

Proposal 1 - This proposal
Signed Blocks Window: 55,000
Min Signed Per Window: 80.00% - Resulting in just over 8 hours response time at current block speed
Downtime Jail Duration: 1 min
Slash Fraction Doublesign: 5.00%
Slash Fraction Downtime: 0.00%

Proposal 2 - One week after this proposal once the impact is seen
Signed Blocks Window: 55,000
Min Signed Per Window: 80.00%
Downtime Jail Duration: 30 min—Slight increase to reduce a potential impact to stakers of a slashed validator unjailing and repeatedly triggering the slash.
Slash Fraction Doublesign: 5.00%
Slash Fraction Downtime: 0.003% - Approx 1 day rewards

Proposal 3 - Post-v26/block time decrease
Signed Blocks Window: 190,000 - To account for decreasing block speed and increased minimum
Min Signed Per Window: 90.00%
Downtime Jail Duration: 30 min
Slash Fraction Doublesign: 5.00%
Slash Fraction Downtime: 0.003%

any jailing mechanic could be exploited to selectively jail all the honest validators over a few blocks?

Once 34% go offline the chain will halt. Fast jailing and long jail times is a great attack vector since every 2.33 hours the chain is back to 100% of remaining validators. Any lower validators that get promoted have trivial vote power. Attacking high vote power validators first is the best method of attack.

We don’t want to support a scenario that enables this attack.

We suggest keeping downtime slashing at 0%. Slashing does not hurt the validators much since most stake is by delegators. Punishing delegators for undesirable validators is punishing mostly people who are not directly responsible for the problem.

Removing or reducing rewards for an epoch is better.

Reduced rewards may result the same mathematically as slashing, but for delegators, it is much better to have reduced rewards than having tokens slashed away. The goal is validator improvement, not punishing supporters of the chain.

Validators spending time making refund scripts does not improve network performance. Better to encourage focusing time on actual system improvement.

2 Likes

Let me start by just saying that I do not validate for Osmosis, but I do for other chains.

What situation are you looking to resolve here? Are there performance issues due to blocktimes? When this goes in place, do you have any idea how valuable it will be to have gone through this trouble to do so? Meaning, is the proposed implementation going to improve the Osmosis network by 1%, 5%, 0.0001%?
Do you have statistical data, graphs that show how block times have negatively affected apps/users? If so, what about estimations of improvement, where and how?

Is this change good, bad, or neutral for the end user/trader/staker?

My personal take is this:
We run distributed networks, most chains with 100 validators or more. This is more an enough compensation to ensure that the network will run mostly fine in some bad situations.
So far in my experience, when a network is running poorly for whatever reason, resolutions have never involved jailing validators. Usually these resolutions fall into code fixes or configuration changes.

My advice would be to not fix anything that isn’t broke. :smiley:

2 Likes

Yeah, this is starting to look like a solution looking for a problem.

99.9999999% of stakers won’t notice a 0.003% slash.

This is just going to end up in good validators creating refund scripts on a bad day and bad validators running sub optimal equipment not giving refunds and continuing to run bad equipment.

Who are these validators we’re trying to weed out? Are they currently affecting chain health? Do they have any significant VP?

Id focus on developing performance based epochs tbh.

3 Likes

It would be best if even the validators commission and self stake would be hurt by this even. But that gets a lot more complicated as well :sweat_smile:

@Golden_Ratio_Staking @thesilverfox
It is mainly upgrade related if I read the facts correctly.
During regular operation there is not much of a problem, it is more that there are validators who are very very very slow at upgrading their nodes which has a (potential) issue. But going as far as having a solution which jails validators ONLY with a different set of parameters after upgrades sounds kinda complex.

This is the nuts and bolts of this. Osmosis doesn’t have an issue here in this area. If we were suffering greatly, I agree we’d need to make drastic changes but we have a pretty stellar validator set and up-time overall.

Agreed on the 24 hour jailing aspect, but you have to provide a good window. Agreed on solution looking for a problem aspect of this - can’t see enough reasons or compelling purpose to vote for a change at this point.

They don’t have significant voting power, beehive is the worst performer and is at the bottom of the set right now.

The point stands that the slashing parameters are practically never used now though and a validator could join with high stake and poor performance.

I’m going to walk back the actual slash amount and just propose a simply tightening to a 12 hour window (which was the original prompt for this since some validators were down for a full day before jailing after v24) with a 80% uptime requirement.

We can look into performance based payouts instead of slashing but these parameters should be what we set as the bare minimum level of participation performance.

In this game of assumptions, I assume that any new validator that finds a way to get significant stake will likely be a fund with self stake.

Simply dont see a world where a chronically underperforming validator gets a big enough community of stake to cause harm…and if they do form a cult that big, they probably wont leave for a tiny slash or a missed day of rewards.

Im just not seeing the urgency or vision of this one. Im sorry.

1 Like

I get that view. But I am looking at this ahead of what the discussion would likely be in that same scenario, and I think it would include comments like Defiantlabs’, which would require a stepped resolution.

If we can agree that 12 hours at 80% uptime are fundamental criteria, then we can already make this jump from 22 hours at 5% uptime now.

Validators will likely see no impact from this parameter change, but it simplifies the discussions of any further tightening of the uptime parameter if we do get one of those bad actors you mentioned.

@czarcas7ic kindly let me share this chart of the average validator performance which shows that while this proposal itself may have no impact, there also isn’t any reason not to tighten up the very loose parameters on the slashing module to account for outlier events.

I agree that setting up this as a basis for further tweaking gives us at least a reference from where to work on. We will be voting yes on this but agree that we need to keep the discussion going.