Resilience
As a Cloud Native traffic orchestrator, Easegress supports build-in resilience features. It is the ability of your system to react to failure and still remain functional. It’s not about avoiding failure, but accepting failure and constructing your cloud-native services to respond to it. You want to return to a fully functioning state quickly as possible.[1]
Basic: Load Balance
name: pipeline-reverse-proxy
kind: Pipeline
flow:
- filter: proxy
filters:
- name: proxy
kind: Proxy
pools:
- servers:
- url: http://127.0.0.1:9095
- url: http://127.0.0.1:9096
- url: http://127.0.0.1:9097
loadBalance:
policy: roundRobin
More Livingness: Resilience of Service
CircuitBreaker
CircuitBreaker leverges a finite state machine to implement the processing
logic, the state machine has three states: CLOSED
, OPEN
, and HALF_OPEN
.
When the state is CLOSED
, requests pass through normally, state transits
to OPEN
if request failure rate or slow request rate reach a configured
threshold and requests will be shor-circuited in this state. After a
configured duration, state transits from OPEN
to HALF_OPEN
, in which a
limited number of requests are permitted to pass through while other
requests are still short-circuited, and state transit to CLOSED
or OPEN
based on the results of the permitted requests.
When CLOSED
, it uses a sliding window to store and aggregate the result
of recent requests, the window can either be COUNT_BASED
or TIME_BASED
.
The COUNT_BASED
window aggregates the last N requests and the TIME_BASED
window aggregates requests in the last N seconds, where N is the window size.
Below is an example configuration with a COUNT_BASED
policy. GET
request
to paths begin with /books/
uses this policy, which short-circuits requests
if more than half of the last 100 requests failed with status code 500, 503,
or 504.
name: pipeline-reverse-proxy
kind: Pipeline
flow:
- filter: proxy
filters:
- name: proxy
kind: Proxy
pools:
- servers:
- url: http://127.0.0.1:9095
- url: http://127.0.0.1:9096
- url: http://127.0.0.1:9097
loadBalance:
policy: roundRobin
circuitBreakerPolicy: countBased
failureCodes: [500, 503, 504]
resilience:
- name: countBased
kind: CircuitBreaker
slidingWindowType: COUNT_BASED
failureRateThreshold: 50
slidingWindowSize: 100
And we can also use a TIME_BASED
policy, which short-circuits requests
if more than 60% of the requests within the last 200 seconds failed.
resilience:
- name: time-based-policy
kind: CircuitBreaker
slidingWindowType: TIME_BASED
failureRateThreshold: 60
slidingWindowSize: 200
In addition to failures, we can also short-circuit slow requests. Below configuration regards requests which cost more than 30 seconds as slow requests and short-circuits requests if 60% of recent requests are slow.
resilience:
- name: countBased
kind: CircuitBreaker
slowCallRateThreshold: 60
slowCallDurationThreshold: 30s
For a policy, if the first request fails, the failure rate could be 100%
because there’s only one request. This is not the desired behavior in most
cases, we can avoid it by specifying minimumNumberOfCalls
.
resilience:
- name: countBased
kind: CircuitBreaker
minimumNumberOfCalls: 10
We can also configure the wait duration in the open
state and the max
wait duration in the half-open
state:
resilience:
- name: countBased
kind: CircuitBreaker
waitDurationInOpenState: 2m
maxWaitDurationInHalfOpenState: 1m
In the half-open
state, we can limit the number of permitted requests:
resilience:
- name: countBased
kind: CircuitBreaker
permittedNumberOfCallsInHalfOpenState: 10
For the full YAML, see here, and please refer CircuitBreaker Policy for more information.
RateLimiter
NOTE: When there are multiple instances of Easegress, the configuration will be applied for every instance equally. For example, TPS of RateLimiter is configured with 100 in 3-instances cluster, so the total TPS will be 300.
The below configuration limits the request rate for requests to /admin
and requests that match regular expression ^/pets/\d+$
.
name: pipeline-reverse-proxy
kind: Pipeline
flow:
- filter: rate-limiter
- filter: proxy
filters:
- name: rate-limiter
kind: RateLimiter
policies:
- name: policy-example
timeoutDuration: 100ms
limitRefreshPeriod: 10ms
limitForPeriod: 50
defaultPolicyRef: policy-example
urls:
- methods: [GET, POST, PUT, DELETE]
url:
exact: /admin
regex: ^/pets/\d+$
policyRef: policy-example
- name: proxy
kind: Proxy
For the full YAML, see here.
Retry
If we want to retry a failed request, for example, retry on HTTP status
codes 500, 503, and 504, we can create a RetryerPolicy
with the below
configuration, it makes at most 3 attempts on failure.
name: pipeline-reverse-proxy
kind: Pipeline
flow:
- filter: retryer
- filter: proxy
filters:
- name: proxy
kind: Proxy
pools:
- servers:
- url: http://127.0.0.1:9095
- url: http://127.0.0.1:9096
- url: http://127.0.0.1:9097
loadBalance:
policy: roundRobin
retryPolicy: retry3Times
failureCodes: [500, 503, 504]
resilience:
- name: retry3Times
kind: Retry
maxAttempts: 3
waitDuration: 500ms
By default, the wait duration between two attempts is waitDuration
, but
this can be changed by specifying backOffPolicy
and randomizationFactor
.
resilience:
- name: retry3Times
kind: Retry
backOffPolicy: Exponential
randomizationFactor: 0.5
For the full YAML, see here, and please refer Retry Policy for more information.
TimeLimiter
TimeLimiter limits the time of requests, a request is canceled if it cannot
get a response in configured duration. As this resilience type only requires
config a timeout duration, it is implemented directly on filters like Proxy
.
name: pipeline-reverse-proxy
kind: Pipeline
flow:
- filter: retryer
- filter: proxy
filters:
- name: proxy
kind: Proxy
pools:
- servers:
- url: http://127.0.0.1:9095
- url: http://127.0.0.1:9096
- url: http://127.0.0.1:9097
loadBalance:
policy: roundRobin
timeout: 500ms
References
CircuitBreaker
name: pipeline-reverse-proxy
kind: Pipeline
flow:
- filter: proxy
filters:
- name: proxy
kind: Proxy
pools:
- servers:
- url: http://127.0.0.1:9095
- url: http://127.0.0.1:9096
- url: http://127.0.0.1:9097
loadBalance:
policy: roundRobin
circuitBreakerPolicy: countBasedPolicy
failureCodes: [500, 503, 504]
resilience:
- name: countBasedPolicy
kind: CircuitBreaker
slidingWindowType: COUNT_BASED
failureRateThreshold: 50
slidingWindowSize: 100
slowCallRateThreshold: 60
slowCallDurationThreshold: 30s
minimumNumberOfCalls: 10
waitDurationInOpenState: 2m
maxWaitDurationInHalfOpenState: 1m
permittedNumberOfCallsInHalfOpenState: 10
- name: timeBasedPolicy
kind: CircuitBreaker
slidingWindowType: TIME_BASED
failureRateThreshold: 60
slidingWindowSize: 200
RateLimiter
name: pipeline-reverse-proxy
kind: Pipeline
flow:
- filter: rate-limiter
- filter: proxy
filters:
- name: rate-limiter
kind: RateLimiter
policies:
- name: policy-example
timeoutDuration: 100ms
limitRefreshPeriod: 10ms
limitForPeriod: 50
defaultPolicyRef: policy-example
urls:
- methods: [GET, POST, PUT, DELETE]
url:
exact: /admin
regex: ^/pets/\d+$
policyRef: policy-example
- name: proxy
kind: Proxy
pools:
- servers:
- url: http://127.0.0.1:9095
- url: http://127.0.0.1:9096
- url: http://127.0.0.1:9097
loadBalance:
policy: roundRobin
Retry
name: pipeline-reverse-proxy
kind: Pipeline
flow:
- filter: proxy
filters:
- name: proxy
kind: Proxy
pools:
- servers:
- url: http://127.0.0.1:9095
- url: http://127.0.0.1:9096
- url: http://127.0.0.1:9097
loadBalance:
policy: roundRobin
retryPolicy: retry3Times
failureCodes: [500, 503, 504]
resilience:
- name: retry3Times
kind: Retry
backOffPolicy: Exponential
randomizationFactor: 0.5
maxAttempts: 3
waitDuration: 500ms