How to appropriately configure readiness/liveness probes for suppressing lost requests ?

I’d like to show you simple demonstration of the pod rolling update through two patterns. First pattern is configured probes based on middle ware healthy check, and the other pattern is configured probes based on service healthy check. It’s so different each other, so it will be interesting for you.

You can learn why we configure the probes based on the service health check through comparison of the above two patterns. The following process diagram is about DeploymentConfig(ReplicaController), but it’s the same with Deployment(ReplicaSet) either.

Let’s test

First of all, create a test project for this demonstration.

$ oc new-project test-rolling-update

Create a DeploymentConfig for a test pod

The following command will create DeploymentConfig that deploys and manage a Pod that it is initialized middle ware first, and after 10 seconds the application service would be initialized. Such as, when the pod has just started, we can response the “MIDDLE WARE OK” through “http://:8080/middleware_health/index.html". And after 10 seconds we can also get the “SERVICE OK: V1.0” from the pod through “http://:8080/service_health/index.html" on the testing pod. But before 10 seconds, if we access to “http://:8080/service_health/index.html", we are not able to get any messages.

$ oc run test-pod --image=registry.access.redhat.com/rhel7 \
-- bash -c \
'mkdir -p /tmp/test/{service_health,middleware_health}; cd /tmp/test; echo "MIDDLE WARE OK" > middleware_health/index.html; nohup sleep 10 && echo "SERVICE OK: V1.0" > service_health/index.html & python -m SimpleHTTPServer 8080'

I tested through the following Service IP and port.

$ oc create -f - <<EOF
apiVersion: v1
kind: Service
metadata:
labels:
run: test-pod
name: test-pod-svc
spec:
ports:
- name: 8080-8080
port: 8080
protocol: TCP
targetPort: 8080
selector:
run: test-pod
type: ClusterIP
EOF
$ oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
test-pod-svc ClusterIP 172.30.225.73 <none> 8080/TCP 3m

Take the status and request results of the running a test pod

Run this command for monitoring the request response and each service status, such as a middleware and the application service.

$ while :; do echo $(date '+%H:%M:%S') - $(curl --connect-timeout 1 -s http://172.30.225.73:8080/middleware_health/index.html): $(curl --connect-timeout 1 -s http://172.30.225.73:8080/service_health/index.html) ; sleep 1; done

Concurrently run this command using the other terminal for monitoring pod status transition either.

$ while :; do echo $(date '+%H:%M:%S') ---; oc get pod; sleep 1; done

Pattern #1, configuring the probes based on the middleware health

This pattern shows you that a middleware health check is not good for keeping stable service during rolling update.

$ oc edit dc/test-pod
:
livenessProbe:
exec:
command:
- curl
- -f
- http://localhost:8080/middleware_health/index.html
failureThreshold: 3
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 15
readinessProbe:
exec:
command:
- curl
- -f
- http://localhost:8080/middleware_health/index.html
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 15

Change the “SERVICE OK: V1.0” -> “SERVICE OK: V2.0” for triggering the rolling update of the DeploymentConfig.

$ oc edit dc/test-pod
... make changes for rolling update ...

You can see some failed requests until new pod becomes “Running” after old pod becomes “Terminating” as follows.

  13:56:04 ---
NAME READY STATUS RESTARTS AGE
test-pod-6-gpjcp 1/1 Running 0 10m
test-pod-7-deploy 1/1 Running 0 12s
test-pod-7-jtv44 0/1 Running 0 9s
13:56:05 ---
NAME READY STATUS RESTARTS AGE
test-pod-6-gpjcp 1/1 Terminating 0 10m
test-pod-7-deploy 1/1 Running 0 14s
test-pod-7-jtv44 1/1 Running 0 11s
13:56:07 ---
NAME READY STATUS RESTARTS AGE
test-pod-6-gpjcp 1/1 Terminating 0 10m
test-pod-7-deploy 1/1 Running 0 15s
test-pod-7-jtv44 1/1 Running 0 12s

There are Request results during above rolling update. As you see, some requests were lost.

13:56:05 - MIDDLE WARE OK: SERVICE OK: V1.0
13:56:06 - MIDDLE WARE OK: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"><html> <title>Directory listing for /service_health/</title> <body> <h2>Directory listing for /service_health/</h2> <hr> <ul> </ul> <hr> </body> </html>
13:56:07 - MIDDLE WARE OK: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"><html> <title>Directory listing for /service_health/</title> <body> <h2>Directory listing for /service_health/</h2> <hr> <ul> </ul> <hr> </body> </html>
13:56:08 - MIDDLE WARE OK: SERVICE OK: V2.0

Pattern #2, configuring the probes based on the application service health

This time it shows you how can we suppress the requests lost during rolling update for the solution on this matter.

$ oc edit dc/test-pod
:
livenessProbe:
exec:
command:
- curl
- -f
- http://localhost:8080/service_health/index.html
failureThreshold: 3
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 15
readinessProbe:
exec:
command:
- curl
- -f
- http://localhost:8080/service_health/index.html
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 15

Change the “SERVICE OK: V2.0” -> “SERVICE OK: V3.0” for triggering the rolling update of the DeploymentConfig.

$ oc edit dc/test-pod

You can see some no lost requests during new pod becomes “Running” after old pod became “Terminating” as follows

14:00:25 ---
NAME READY STATUS RESTARTS AGE
test-pod-8-2lpwn 1/1 Running 0 2m
test-pod-9-deploy 1/1 Running 0 18s
test-pod-9-jkbnw 1/1 Running 0 14s
14:00:26 ---
NAME READY STATUS RESTARTS AGE
test-pod-8-2lpwn 1/1 Terminating 0 2m
test-pod-9-deploy 1/1 Running 0 19s
test-pod-9-jkbnw 1/1 Running 0 15s
14:00:27 ---
NAME READY STATUS RESTARTS AGE
test-pod-8-2lpwn 1/1 Terminating 0 2m
test-pod-9-deploy 1/1 Running 0 21s
test-pod-9-jkbnw 1/1 Running 0 17s

There is no lost requests.

  14:00:24 - MIDDLE WARE OK: SERVICE OK: V2.0
14:00:25 - MIDDLE WARE OK: SERVICE OK: V2.0
14:00:26 - MIDDLE WARE OK: SERVICE OK: V3.0
14:00:27 - MIDDLE WARE OK: SERVICE OK: V3.0

As you see, a middleware health is not equal to the running application health on it. We should configure appropriate probes on your deployment manifest for stable rolling update.

Thank you for reading.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Daein Park

Hi, there. I’m Daein. Just do something fun :) Nothing happens, if you do nothing. #OpenShift #Kubernetes #Containers #Troubleshooting #Linux #OpenSource