Fork me 🍴

Willian Antunes

Monitoring K8S resources through its APIs

4 minute read

python, k8s, kind

Let's say you need to monitor when a certificate is about to expire in a Kubernetes cluster. Using cert-manager, when you configure a ClusterIssuer, for example, you can set an email to be notified when a certificate is about to expire. Let's assume this does not work. What can you do? 🤔

One approach is to create a Python application that uses the Kubernetes API to monitor the certificate resource and take actions based on the information it provides. For example, you can force the application to fail on purpose, so your monitoring system can alert you about it.

Let's see a real example. Let's say you have the following certificate resource:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: develop-willianantunesi-com-br
  namespace: development
spec:
  commonName: '*.develop.willianantunesi.com.br'
  dnsNames:
  - '*.develop.willianantunesi.com.br'
  - develop.willianantunesi.com.br
  duration: 2160h0m0s
  issuerRef:
    kind: ClusterIssuer
    name: this-cluster-issuer-does-not-exist
  renewBefore: 480h0m0s
  secretName: develop-willianantunesi-com-br-tls
  subject:
    organizations:
    - willianantunes
status:
  conditions:
  - lastTransitionTime: "2024-02-06T13:11:28Z"
    message: Certificate is up to date and has not expired
    observedGeneration: 1
    reason: Ready
    status: "True"
    type: Ready
  notAfter: "2024-05-15T13:04:35Z"
  notBefore: "2024-02-15T13:04:36Z"
  renewalTime: "2024-04-25T13:04:35Z"
  revision: 2

Notice that the status attribute has a renewalTime attribute. This is the attribute that cert-manager uses to know when to renew the certificate. This is set by cert-manager automatically when a certificate is issued. Thus, a script can use Kubernetes Python Client to consult the certificate resource above and analyze the renewalTime attribute. If it is in the past, that means the cert-manager is still trying to renew the certificate without success. That's when we can force the application to fail on purpose. Check out this example:

import logging
import sys

from datetime import datetime

import kubernetes.client

from kubernetes import config
from kubernetes.client.rest import ApiException

_logger = logging.getLogger(__name__)
_group = "cert-manager.io"
_version = "v1"


def _create_configuration():
    configuration = kubernetes.client.Configuration()
    # Accessing the API from within a Pod
    # https://kubernetes.io/docs/tasks/run-application/access-api-from-pod/#directly-accessing-the-rest-api
    config.load_incluster_config(configuration)
    return configuration


def check_certificates_and_inform_on_slack_if_applicable(certificates: list[str], now=datetime.now().astimezone()):
    _logger.debug("Generating configuration")
    configuration = _create_configuration()

    should_exit = False
    with kubernetes.client.ApiClient(configuration) as client:
        api = kubernetes.client.CustomObjectsApi(client)
        for certificate in certificates:
            _logger.info("Checking certificate: %s", certificate)
            namespace, name = certificate.split("|")
            try:
                api_response = api.get_namespaced_custom_object(_group, _version, namespace, "certificates", name)
            except ApiException as e:
                _logger.error("Exception when calling CustomObjectsApi->get_namespaced_custom_object")
                raise e
            renewal_time = api_response["status"].get("renewalTime")
            if not renewal_time:
                _logger.error("Certificate %s does not have a renewalTime", certificate)
                should_exit = True
                continue
            renewal_time = datetime.fromisoformat(renewal_time)
            if renewal_time <= now:
                _logger.error("Certificate %s is about to expire on %s", certificate, renewal_time)
                should_exit = True
        if should_exit:
            _logger.error("Some certificates are either about to expire or invalid. Please fix them ASAP")
            sys.exit(1)
    _logger.info("Work has been completed")

If should_exit is True, we log an error and exit the program. To check it from time to time, we can use a CronJob that runs every day at 7 AM:

apiVersion: batch/v1
kind: CronJob
metadata:
  namespace: development
  name: watchdog-k8s-cronjob
spec:
  schedule: "0 7 * * *"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: watchdog-k8s-sa
          containers:
            - name: watchdog-k8s
              image: watchdog_k8s-remote-interpreter
              imagePullPolicy: IfNotPresent
              envFrom:
                - configMapRef:
                    name: watchdog-k8s-configmap
          restartPolicy: Never

Did you notice that the attribute serviceAccountName is set? The script requires it to consult the Kubernetes API. Proper RBAC for watchdog-k8s-sa is also needed:

apiVersion: v1
kind: ServiceAccount
metadata:
  namespace: development
  name: watchdog-k8s-sa

---

kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: development
  name: watchdog-k8s-role
rules:
  - apiGroups:
      - "cert-manager.io"
    resources:
      - "certificates"
    verbs:
      - "get"

---

kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: watchdog-k8s-role-binding
  namespace: development
subjects:
  - kind: ServiceAccount
    name: watchdog-k8s-sa
    namespace: development
roleRef:
  kind: Role
  name: watchdog-k8s-role
  apiGroup: rbac.authorization.k8s.io

Check out the whole project on GitHub if you have any further questions.

I hope this may help you. See you 😄!


Have you found any mistakes 👀? Feel free to submit a PR editing this blog entry 😄.