Docker/Kubernetes workshop
Fault injection is a testing method where failures are intentionally introduced into a system to test its resiliency and measure recovery times in various scenarios.
Failures are inevitable, and for critical systems it is important to investigate any method which could help reduce the impact of a failure.
However, some systems are so critical that any down time to test resiliency is out of the question. Service mesh can allow for fault injection testing while minimising the impact on users.
Istio lets you inject faults at the application layer. You can inject two types of faults, both configured using a virtual service:
Delays: Timing failures that mimic increased network latency or an overloaded upstream service.
Aborts: Crash failures that mimic failures in upstream services. Aborts usually manifest in the form of HTTP error codes or TCP connection failures.
E.g. this virtual service introduces a 5 second delay for 1 out of every 1000 requests to the ratings service.
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: ratings
spec:
hosts:
- ratings
http:
- fault:
delay:
percentage:
value: 0.1
fixedDelay: 5s
route:
- destination:
host: ratings
subset: v1
Apply application version routing with the following commands:
kubectl apply -f exercise/virtual-service-all-v1.yaml
kubectl apply -f exercise/virtual-service-reviews-test-v2.yaml
With this configuration, the request flow looks like:
To test the Istio Bookinfo application microservices for resiliency, inject a 7 second delay between reviews:v2 and ratings microservices for the user ‘jason’.
fault:
delay:
percentage:
value: 100.0
fixedDelay: 7s
Apply the rule:
kubectl apply -f exercise/ratings-test-delay.yaml
kubectl get virtualservice ratings -o yaml
Open the Bookinfo application in your browser.
Error fetching product reviews!
Sorry, product reviews are currently unavailable for this book.
View the web page response times:
You found a bug. There are hard-coded timeouts in the microservices that have caused the reviews service to fail.
As expected, the 7 second delay introduced doesn’t affect the reviews service because the timeout between the productpage and the reviews services is hard-coded at 10 seconds. However, there is also a hard-coded timeout between the productpage and reviews service, coded as 3 seconds + 1 retry for 6 seconds total. As a result, the productpage call to reviews times out prematurely and throws an error after 6 seconds.
Bugs like this can occur in typical enterprise applications where different teams develop different microservices independently. Fault injection can help you identify such anomalies without impacting end users.
You would normally fix the problem by:
However, there is already a fix running in version 3 of the reviews service. The reviews:v3 service reduces the reviews to ratings timeout from 10 seconds to 2.5 seconds so that it is compatible with (less than) the timeout of the downstream productpage requests.
You can migrate all traffic to reviews:v3:
kubectl apply -f exercise/traffic-shifting-v3.yaml
and then try to change the delay rule to any amount less than 2.5 seconds to confirm the end-to-end flow continues without any errors.
Another way to test microservice resiliency is to introduce a HTTP abort fault. In this task, you will introduce a HTTP abort to the ratings microservices for the test user ‘jason’.
In this case, you expect the page to load immediately an error message:
Ratings service is currently unavailable
fault:
abort:
percentage:
value: 100.0
httpStatus: 500
Apply the rule:
kubectl apply -f exercise/ratings-test-abort.yaml
kubectl get virtualservice ratings -o yaml
Remove the application routing rules
kubectl delete -f exercise/virtual-service-all-v1.yaml