Forum Discussion

SeanC's avatar
3 years ago

Auto Balanced Collector Groups need a rework

Currently, if a collector goes down the devices it was monitoring will only failover to other collectors in the group if doing so does not put them over the defined Rebalance Threshold on the ABCG.

This means you can't specify a Rebalance Threshold that allows instances to rebalance evenly across your collectors AND allow for instances to failover, it's one or the other which defeats the whole purpose and takes the Auto out of Auto Balance as to have collectors run in an n-1 highly available configuration means you have to specify a Rebalance Threshold value that is higher than the total instance count divided by the number of collectors minus one, then after a failover you have to adjust the Rebalance Threshold to a number close to the total number of instances divided by number of collectors, trigger a rebalance, wait, and then set the Rebalance Threshold back to total instance count divided by the number of collectors minus one.

Madness!

I propose keeping the Rebalance Threshold field but making it actually behave like it's name; when instance count is over the value, collector tries to offload instances to other collectors in the group.

BUT, allow instances from a failed collector to offload to collectors past that threshold value so that failover continues to work.

To address the concern of collectors being pushed past their capacity, add an additional field called Max Allowed Instances and only disallow instances to be offloaded to a collector if the instance count would push it past that value and trigger an alert in the event that that happens.

This will allow us to have HA configurations AND auto balancing of instances work at the same time, as well as alerting us to the fact that instances have not failed over when it happens so that steps can be taken to increase the sizing/number of collectors.

 

4 Replies

  • Anonymous's avatar
    Anonymous

    I like this. Sort of a "rebalance if over X" and "don't fail to me if it'll put me over Y".  X defines when to shed load and Y determines when to refuse shedded load.

  • @SeanC ABCG should not prevent failover even when the target collector is/would be above the rebalance threshold. If you're seeing this behavior I suggest reaching out to support to see if they can help you out.

  • On 9/23/2021 at 2:04 AM, Michael Rodrigues said:

    @SeanC ABCG should not prevent failover even when the target collector is/would be above the rebalance threshold. If you're seeing this behavior I suggest reaching out to support to see if they can help you out.

     

    We had an incident where instances didn't failover as expected so I raised a ticket.

    Support staff were the ones who advised me that collectors won't take instances over the threshold during a failover.

    I questioned the advice at the time as it was contrary to the way I interpreted the documentation but apparently the advice was verified with another engineer and is accurate.

    I guess I'll have to set up some tests and verify.

  • @SeanC I've spoken with the Collector development team about this. They've confirmed that my answer above is accurate.

    I'll speak to support about the confusion. Collector dev is also going to take a look at the ticket to see what went wrong in your portal with failover.

    I hope you didn't have to spend too much time doing your own verification on this. Sorry for any hassle.