Module health verification Endpoints and K8s probes

Health endpoint: Will be used by monitoring to determine the health state of a Microservice. The health endpoint will also include the states of the Liveness, Readiness and Ping States. It is not intended to be used by the container management system, but for privileged (authorization required) users like operators or tools of an operator. For that reason the endpoints are exposed and will provide additional information than a simple healthy, unhealthy status.
Readiness endpoint: Will be polled by Kubernetes to check, if a Microservice (pod) can accept traffic.
Liveness endpoint: Will be polled by Kubernetes to check, if a Microservice internal state is valid. Kubernetes will restart the Container if the liveness endpoint fails for the configured number of times.
Ping endpoint: Always returns "Healthy". Is intended to be used as a subsequent call executed by the health endpoint implementation of a service. It ensures only that a service client is able to connect to a service.

Services

Both process and equipment service have the same health and k8s probe endpoints available at their respective base URL address (ex: https://server.com/mdm/process/health, https://server.com/mdm/health)

Health and probes

Health Endpoint
- /health
- down: 503
  - Health endpoint is showing down and answering 503. We think that the microservice is not healthy in case of lost connection to RabbitMQ, MACMA and Database.
  - Health endpoint is showing some details about the RabbitMQ, MACMA, Database, Portal and Equipment API service connections issue (only for authenticated users).
- up: 200
Liveness Endpoint
- /health/live
- unhealthy: 503
  - only during startup and shutdown it will be down
- healthy: 200
Readiness Endpoint
- /health/readiness
- unhealthy: 503
  - If Database, RabbitMQ and MACMA are down we will not accept requests to the microservice
- healthy: 200
  - If Portal and Equipment services are down we can still accept requests to the microservice
Ping Endpoint
- /ping
- always healthy: 200

Dependencies

The status of probes (healthy/unhealthy) are based on health checks for service dependecies

RabbitMQ
- Use cases:
  Lost connection to RabbitMQ.
  Reasons:
  - RabbitMQ instance crashed, restarted, etc.
  - Network issue between microservice and RabbitMQ
- General Behavior
  - The lost connection to RabbitMQ will be logged when an event is tried to be sent.
  - We keep the Service alive and ready to accept requests.
  - The re-established connection to RabbitMQ will be logged for the first event that is sent after RabbitMQ comes back online.
- Impact
  - MDM service will still be functional if the messaging 'enabled' flag is set to false in configuration.
Database
- Use cases:
  Lost connection to the Database.
  Reasons:
  - Database instance crashed, restarted etc
  - Network issue between microservice and Database
- General Behavior
  - The lost connection to Database will be logged on error.
  - The microservice is trying to reconnect infinitely.
  - We keep the Service alive, but it won’t be ready to accept any requests.
- Impact
  - The service will no longer be functional
MACMA
- Use cases:
  Lost connection to MACMA.
  Reasons:
  - MACMA instance crashed, restarted etc
  - Network issue between microservice and MACMA
  - MACMA cannot accept traffic (readiness state is "Unhealty")
- General Behavior
  - The lost connection to MACMA will be logged on error.
  - The microservice is trying to reconnect infinitely.
  - We keep the Service alive, but it won’t be ready to accept any requests.
- Impact
  - The service will no longer be functional
Portal
- Use cases:
  Lost connection to Portal.
  Reasons:
  - Portal instance crashed, restarted etc
  - Network issue between microservice and Portal
  - Portal cannot accept traffic (readiness state is "Unhealty")
- General Behavior
  - The lost connection to Portal will be logged on error.
  - The microservice is trying to reconnect infinitely.
  - We keep the Service alive, but it won’t be ready to accept any requests.
- Impact
  - The service will be still functional as its optional service.
Equipment API
- Use cases:
  Lost connection to Equipment API service.
  Reasons:
  - Equipment API service instance crashed, restarted etc
  - Network issue between microservice and Equipment API service
  - Equipment API service cannot accept traffic (readiness state is "Unhealty")
- General Behavior
  - The lost connection to Equipment service will be logged on error.
  - The microservice is trying to reconnect infinitely.
  - We keep the Service alive, but it won’t be ready to accept any requests.
- Impact
  - The process service will be still functional as its optional service.

Nginx health endpoints

The Nginx gateway provides no builtin health endpoints and no extra configuration is added

Liveness: is not needed as nginx starts quickly and if the process exits then the container is restarted anyway. *Readiness: not implemented Theoretically there is a short time during the startup when this would be needed. Because in MDM we use a simple configuration we consider this startup time is very short and can be safely ignored.

Infrastructure outages

If a required infrastructure or module will become unavailable MDM will immediately reflect the status in the health response. The general rule is that when a failed dependency becomes available again MDM will automatically restore connection without any user intervention.

MACMA MDM will try connection on every incoming authenticated request. There is no continuous background connection to Macma.
RabbitMq MDM will reconnect for listening to incoming messages as soon as RabbitMq becomes available, with a delay in the range of 10 - 30 seconds.
Database MDM will attempt a DB connection at every incoming service request and thus restore connectivity as soon as the database becomes available. The /health state has no influence on the db connection attempts.
Portal Once registered successfully, MDM does not re-attempt registration if portal transitions between unhealthy and healthy states. The portal registration flow is not related in any way to the health monitoring endpoints.
Internal MDM services MDM services attempt to connect to each other whenever needed by an incoming API request.