How to appropriately configure readiness/liveness probes for suppressing lost requests ?

4 min readNov 2, 2020

I’d like to show you simple demonstration of the pod rolling update through two patterns. First pattern is configured probes based on middle ware healthy check, and the other pattern is configured probes based on service healthy check. It’s so different each other, so it will be interesting for you.

You can learn why we configure the probes based on the service health check through comparison of the above two patterns. The following process diagram is about DeploymentConfig(ReplicaController), but it’s the same with Deployment(ReplicaSet) either.

Let’s test

First of all, create a test project for this demonstration.

$ oc new-project test-rolling-update

Create a DeploymentConfig for a test pod

The following command will create DeploymentConfig that deploys and manage a Pod that it is initialized middle ware first, and after 10 seconds the application service would be initialized. Such as, when the pod has just started, we can response the “MIDDLE WARE OK” through “http://:8080/middleware_health/index.html". And after 10 seconds we can also get the “SERVICE OK: V1.0” from the pod through “http://:8080/service_health/index.html" on the testing pod. But before 10 seconds, if we access to “http://:8080/service_health/index.html", we are not able to get any messages.

$ oc run test-pod --image=registry.access.redhat.com/rhel7 \
  -- bash -c \
  'mkdir -p /tmp/test/{service_health,middleware_health}; cd /tmp/test; echo "MIDDLE WARE OK" > middleware_health/index.html; nohup sleep 10 && echo "SERVICE OK: V1.0" > service_health/index.html & python -m SimpleHTTPServer 8080'

I tested through the following Service IP and port.

$ oc create -f - <<EOF
apiVersion: v1
kind: Service
metadata:
  labels:
    run: test-pod
  name: test-pod-svc
spec:
  ports:
  - name: 8080-8080
    port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    run: test-pod
  type: ClusterIP
EOF$ oc get svc
NAME           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
test-pod-svc   ClusterIP   172.30.225.73   <none>        8080/TCP   3m

Take the status and request results of the running a test pod

Run this command for monitoring the request response and each service status, such as a middleware and the application service.

$ while :; do echo $(date '+%H:%M:%S') - $(curl --connect-timeout 1 -s http://172.30.225.73:8080/middleware_health/index.html): $(curl --connect-timeout 1 -s http://172.30.225.73:8080/service_health/index.html) ; sleep 1; done

Concurrently run this command using the other terminal for monitoring pod status transition either.

$ while :; do echo $(date '+%H:%M:%S') ---; oc get pod; sleep 1; done

Pattern #1, configuring the probes based on the middleware health

This pattern shows you that a middleware health check is not good for keeping stable service during rolling update.

$ oc edit dc/test-pod
:
        livenessProbe:
          exec:
            command:
            - curl
            - -f
            - http://localhost:8080/middleware_health/index.html
          failureThreshold: 3
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 15
        readinessProbe:
          exec:
            command:
            - curl
            - -f
            - http://localhost:8080/middleware_health/index.html
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 15

Change the “SERVICE OK: V1.0” -> “SERVICE OK: V2.0” for triggering the rolling update of the DeploymentConfig.

$ oc edit dc/test-pod
... make changes for rolling update ...

You can see some failed requests until new pod becomes “Running” after old pod becomes “Terminating” as follows.

  13:56:04 ---
  NAME                READY     STATUS    RESTARTS   AGE
  test-pod-6-gpjcp    1/1       Running   0          10m
  test-pod-7-deploy   1/1       Running   0          12s
  test-pod-7-jtv44    0/1       Running   0          9s
  13:56:05 ---
  NAME                READY     STATUS        RESTARTS   AGE
  test-pod-6-gpjcp    1/1       Terminating   0          10m
  test-pod-7-deploy   1/1       Running       0          14s
  test-pod-7-jtv44    1/1       Running       0          11s
  13:56:07 ---
  NAME                READY     STATUS        RESTARTS   AGE
  test-pod-6-gpjcp    1/1       Terminating   0          10m
  test-pod-7-deploy   1/1       Running       0          15s
  test-pod-7-jtv44    1/1       Running       0          12s

There are Request results during above rolling update. As you see, some requests were lost.

13:56:05 - MIDDLE WARE OK: SERVICE OK: V1.0
  13:56:06 - MIDDLE WARE OK: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"><html> <title>Directory listing for /service_health/</title> <body> <h2>Directory listing for /service_health/</h2> <hr> <ul> </ul> <hr> </body> </html>
  13:56:07 - MIDDLE WARE OK: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"><html> <title>Directory listing for /service_health/</title> <body> <h2>Directory listing for /service_health/</h2> <hr> <ul> </ul> <hr> </body> </html>
  13:56:08 - MIDDLE WARE OK: SERVICE OK: V2.0

Pattern #2, configuring the probes based on the application service health

This time it shows you how can we suppress the requests lost during rolling update for the solution on this matter.

$ oc edit dc/test-pod
:
        livenessProbe:
          exec:
            command:
            - curl
            - -f
            - http://localhost:8080/service_health/index.html
          failureThreshold: 3
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 15
        readinessProbe:
          exec:
            command:
            - curl
            - -f
            - http://localhost:8080/service_health/index.html
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 15

Change the “SERVICE OK: V2.0” -> “SERVICE OK: V3.0” for triggering the rolling update of the DeploymentConfig.

$ oc edit dc/test-pod

You can see some no lost requests during new pod becomes “Running” after old pod became “Terminating” as follows

14:00:25 ---
  NAME                READY     STATUS    RESTARTS   AGE
  test-pod-8-2lpwn    1/1       Running   0          2m
  test-pod-9-deploy   1/1       Running   0          18s
  test-pod-9-jkbnw    1/1       Running   0          14s
  14:00:26 ---
  NAME                READY     STATUS        RESTARTS   AGE
  test-pod-8-2lpwn    1/1       Terminating   0          2m
  test-pod-9-deploy   1/1       Running       0          19s
  test-pod-9-jkbnw    1/1       Running       0          15s
  14:00:27 ---
  NAME                READY     STATUS        RESTARTS   AGE
  test-pod-8-2lpwn    1/1       Terminating   0          2m
  test-pod-9-deploy   1/1       Running       0          21s
  test-pod-9-jkbnw    1/1       Running       0          17s

There is no lost requests.

  14:00:24 - MIDDLE WARE OK: SERVICE OK: V2.0
  14:00:25 - MIDDLE WARE OK: SERVICE OK: V2.0
  14:00:26 - MIDDLE WARE OK: SERVICE OK: V3.0
  14:00:27 - MIDDLE WARE OK: SERVICE OK: V3.0

As you see, a middleware health is not equal to the running application health on it. We should configure appropriate probes on your deployment manifest for stable rolling update.

Thank you for reading.