Failure handling guidelines

Common deployment errors

Connection refused, 401: unauthorized, 404 not found when downloading the images during the deployment

This error usually happens when the proxy blocks the request to the Docker registry. It can mean the proxy is not configured correctly on the VM, or the Docker registry should be whitelisted.

401:unauthorized when trying to get the helm charts

This error usually happens when the proxy blocks the request to the Helm registry. It can mean the proxy is not configured correctly on the VM, or the Helm registry should be whitelisted.

Ingress is not deploying / the node is not ingress ready

This error usually happens when the node doesn’t have the correct label assigned. This issue is solved by running the following command:

kubectl label nodes $HOSTNAME ingress=ready

Forbidden! Configured service account doesn’t have access / Service account may have been revoked / User "system:serviceaccount:default:default" cannot get services in the namespace "[namespace_name]"

This error usually means that the cluster is missing a service account role. The issue can be resolved by running:

kubectl create clusterrolebinding default-serviceaccount-rb --clusterrole=cluster-admin --serviceaccount=default:default

Any errors that indicate disk is full

This error can be resolved by adjusting the size of your /var partition or changing the location of K3s files from /var/lib/rancher to another place that has enough space.

Post-deployment errors

Connection refused

When encountering this error, you need to check two things:

Firewall
- When debugging this issue for a faster outcome, the firewall can be temporarily disabled. If the connection is restored, the problem comes from the firewall.
- Check if the firewall is active and if all the necessary rules exist. If you run a Redhat system, you can run the following commands:
  systemctl status firewalld firewall-cmd --list-all
Ingress
- Check if the ingress pod is up and has no error by running the following commands:
  kubectl get pods -n ingress-nginx kubectl logs -n ingress-nginx {pod_name}
- If the issue is with the ingress most of the time a simple restart will work. This can be achived by running the following command:
  kubectl rollout restart deployment -n ingress-nginx

NGNIX: 503 service unavailable

This error point out to Nginx ingress controller being unable to connect to backend services.

Check the following kubernetes objects:

ingress definition corresponding to the module path raising 504 error
service state for the service specified in the ingress definition: check if there are any endpoints connected to the service
if the endpoints are missing, check the pods specified in the service definition for not-ready state
if the pods are not ready, check the pod logs to understand why the readiness probe is failing

If the configuration looks fine but the error is still present, ask for support for kubernetes admin team since it might relate to an internal k8s networking issue.

Sometimes this error happens because Portal or MACMA didn’t load correctly.

Use pod restart only as a last resort.

Usually MACMA is not the problem, so we start with Portal:

kubectl rollout restart deployment -n portal

If the Portal restart didn’t work, you can try restart MACMA by running:

kubectl rollout restart deployment -n iam

Deployed modules doesn’t appear in the menu / When checked in the integration status in the main tenant the modules appeared registered but they have no views

This error happens because either Portal didn’t load the views or the module didn’t start correctly.

Check if you can access the URL specified in the Views menu error.

We can restart Portal and then the module (only if the URL is not reachable) by running the following commands:

kubectl rollout restart deployment -n portal core-service-deployment
# restart other modules
kubectl rollout restart deployment -n <module_namespace> <module_ui_deployment>

Deployed modules doesn’t register in portal

This error can happen for two reasons:

The module is missing the Portal registration permissions. We can check the permissions from the primary tenant by accessing access management > modules > faulty_module > roles > portal. Make sure that either the Portal User or Portal Registration role is assigned.
The module doesn’t start correctly. In this case the module’s status and logs should be checked by following the steps:
1. Check the module’s status by getting all the pods from the namespace
  kubectl get pods -n <module_namespace>
2. If all the services are up and running, a restart of the module web UI may solve the issue
  kubectl rollout restart deployment -n <module_namespace> <module_web_ui>
3. If a service is down or the restart did not solve the issue, you can look at the logs.
  kubectl logs -n <module_namespace> <faulty_pod>
  4 Types of errors are common: database errors, SSL errors, permission errors, or RabbitMQ errors.
  1. Database errors: Wrong usernames/passwords or wrong connection strings usually cause database errors. Try connecting with the same connection strings you have in the config file from another environment with a database client. If the connection works, the issue is strictly between the cluster and the database server.
  2. SSL errors: If an SSL error is presented then the certificate, the key or the CA is not valid. The certificate can be checked with the following command:
    
    openssl x509 -in server.crt -text
    
    The validity of the key against the certificate can be checked with:
    
    openssl rsa -modulus -noout -in server.key | openssl sha256 openssl rsa -modulus -noout -in server.crt | openssl sha256
    
    Both modules should be identic. If they are not, then the key and certificate don’t match.
  3. Permission errors: If the logs indicate an unauthorized error, the module is missing some required roles. Please assign the essential roles that the module needs as per the application assignment roles matrix. The roles can be assigned from the main tenant by accesing access management > modules > faulty_module > roles > missing_role.
  4. RabbitMQ errors*: If one of the application module logs indicates a RabbitMQ error, check the RabbitMQ Admin interface for cluster and queues status. If you are using the embedded RabbitMQ server instance (deployed via Nexeed IAS umbrella helm chart) you can start a shell in one of the rabbitmq pods and check the status using rabbitmqctl command:

rabbitmqctl cluster_status

or:

rabbitmqctl status

or you can check the rabbitmq pod logs for errors.

Common errors

Portal doesn’t load after the login

This issue can be solved by running:

kubectl rollout restart deployment -n portal

Module specific errors

Please check the troubleshooting section of the corresponding Nexeed IAS application module operations manual.