CC-1059 Changed ES connector to use exponential backoff with jitter
With lots of ES connector tasks all hitting the ES backend, when the ES backend becomes overloaded all of the tasks will experience timeouts (possibly at nearly the same time) and thus retry. Prior to this change, all tasks would use the same constant backoff time and would thus all retry at about the same point in time and possibly overwhelming the ES backend. This is known as a thundering herd, and when many attempts fail it takes a long time and many attempts to recover. A solution to this problem is to use expontential backoff to give the ES backend time to recover, except that this alone doesn’t really reduce the thundering herd problem. To solve both problems we use expontential backoff but with jitter, which is a randomization of the sleep times for each of the attempts. This PR adds exponential backoff with jitter. This new algorithm computes the normal maximum time to wait for a particular retry attempt using exponential backoff and then choosing a random value larger than the `retry.backoff.ms` initial backoff value and that maximum time. Since this exponential algorithm breaks down after a large number of retry attempts, rather than adding a constraint for `max.retries` this change simply uses a practical (and arbitrary) absolute upper limit on the backoff time of 24 hours, and it logs a warning if this upper limit is exceeded.
Loading
Please register or sign in to comment