06-03 Fault Injection

Fault injection is a testing method where failures are intentionally introduced into a system to test its resiliency and measure recovery times in various scenarios.

Failures are inevitable, and for critical systems it is important to investigate any method which could help reduce the impact of a failure.

However, some systems are so critical that any down time to test resiliency is out of the question. Service mesh can allow for fault injection testing while minimising the impact on users.

Istio lets you inject faults at the application layer. You can inject two types of faults, both configured using a virtual service:

Delays: Timing failures that mimic increased network latency or an overloaded upstream service.
Aborts: Crash failures that mimic failures in upstream services. Aborts usually manifest in the form of HTTP error codes or TCP connection failures.

E.g. this virtual service introduces a 5 second delay for 1 out of every 1000 requests to the ratings service.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: ratings
spec:
  hosts:
  - ratings
  http:
  - fault:
      delay:
        percentage:
          value: 0.1
        fixedDelay: 5s
    route:
    - destination:
        host: ratings
        subset: v1

Setup

Apply application version routing with the following commands:

kubectl apply -f exercise/virtual-service-all-v1.yaml
kubectl apply -f exercise/virtual-service-reviews-test-v2.yaml

With this configuration, the request flow looks like:

productpage → reviews:v2 → ratings (for user jason)
productpage → reviews:v1 (for everyone else)

Exercise 1

Injecting a HTTP delay fault

To test the Istio Bookinfo application microservices for resiliency, inject a 7 second delay between reviews:v2 and ratings microservices for the user ‘jason’.

Configure the rule

Create a fault injection rule to delay traffic coming from the test user jason.

fault:
  delay:
 percentage:
   value: 100.0
 fixedDelay: 7s

Apply the rule:

kubectl apply -f exercise/ratings-test-delay.yaml

Confirm the rule was created:

kubectl get virtualservice ratings -o yaml

Testing the delay configuration

Open the Bookinfo application in your browser.
On the /productpage web page, log in as user jason. You expect the page to load without errors in about 7 seconds. However the Reviews section is displaying an error message:
```
Error fetching product reviews!
Sorry, product reviews are currently unavailable for this book.
```
View the web page response times:
1. Open the Developer Tools menu in your web browser.
2. Open the Network tab.
3. Reload the web page. You will see that the page actually loads in about 6 seconds.

What happened?

You found a bug. There are hard-coded timeouts in the microservices that have caused the reviews service to fail.

As expected, the 7 second delay introduced doesn’t affect the reviews service because the timeout between the productpage and the reviews services is hard-coded at 10 seconds. However, there is also a hard-coded timeout between the productpage and reviews service, coded as 3 seconds + 1 retry for 6 seconds total. As a result, the productpage call to reviews times out prematurely and throws an error after 6 seconds.

Bugs like this can occur in typical enterprise applications where different teams develop different microservices independently. Fault injection can help you identify such anomalies without impacting end users.

Fixing the bug

You would normally fix the problem by:

Either increasing the productpage to reviews service timeout or decreasing the reviews to ratings timeout
Stopping and restarting the fixed microservice
Confirming that the /productpage web page returns its response without any errors.

However, there is already a fix running in version 3 of the reviews service. The reviews:v3 service reduces the reviews to ratings timeout from 10 seconds to 2.5 seconds so that it is compatible with (less than) the timeout of the downstream productpage requests.

You can migrate all traffic to reviews:v3:

kubectl apply -f exercise/traffic-shifting-v3.yaml

and then try to change the delay rule to any amount less than 2.5 seconds to confirm the end-to-end flow continues without any errors.

Exercise 2

Injecting a HTTP abort fault

Another way to test microservice resiliency is to introduce a HTTP abort fault. In this task, you will introduce a HTTP abort to the ratings microservices for the test user ‘jason’.

In this case, you expect the page to load immediately an error message:

Ratings service is currently unavailable

Configure the rule

Create a fault injection rule to send a HTTP abort for user jason.

fault:
  abort:
 percentage:
   value: 100.0
 httpStatus: 500

Apply the rule:

kubectl apply -f exercise/ratings-test-abort.yaml

Confirm the rule was created:

kubectl get virtualservice ratings -o yaml

Testing the abort configuration

Open the Bookinfo application in your browser.
On the /productpage web page, log in as user jason. If the rule propagated successfully to all pods, the page loads immediately and the error message appears.
If you log out from user jason, you will see the /productpage still call reviews:v1 (which never calls ratings) for everybody but jason. Therefore there will be no error message.

Clean up

Remove the application routing rules

kubectl delete -f exercise/virtual-service-all-v1.yaml