Failure handling guidelines
Common deployment errors
Connection refused, 401: unauthorized, 404 not found when downloading the images during the deployment
Ingress is not deploying / the node is not ingress ready
This error usually happens when the node doesn’t have the correct label assigned. This issue is solved by running the following command:
kubectl label nodes $HOSTNAME ingress=ready
Forbidden! Configured service account doesn’t have access / Service account may have been revoked / User "system:serviceaccount:default:default" cannot get services in the namespace "[namespace_name]"
This error usually means that the cluster is missing a service account role. The issue can be resolved by running:
kubectl create clusterrolebinding default-serviceaccount-rb --clusterrole=cluster-admin --serviceaccount=default:default
Any errors that indicate disk is full
This error can be resolved by adjusting the size of your /var partition or changing the location of K3s files from /var/lib/rancher to another place that has enough space.
Post-deployment errors
Connection refused
When encountering this error, you need to check two things:
-
Firewall
-
When debugging this issue for a faster outcome, the firewall can be temporarily disabled. If the connection is restored, the problem comes from the firewall.
-
Check if the firewall is active and if all the necessary rules exist. If you run a Redhat system, you can run the following commands:
systemctl status firewalld firewall-cmd --list-all
-
-
Ingress
-
Check if the ingress pod is up and has no error by running the following commands:
kubectl get pods -n ingress-nginx kubectl logs -n ingress-nginx {pod_name} -
If the issue is with the ingress most of the time a simple restart will work. This can be achived by running the following command:
kubectl rollout restart deployment -n ingress-nginx
-
NGNIX: 503 service unavailable
This error point out to Nginx ingress controller being unable to connect to backend services.
Check the following kubernetes objects:
-
ingress definition corresponding to the module path raising 504 error
-
service state for the service specified in the ingress definition: check if there are any endpoints connected to the service
-
if the endpoints are missing, check the pods specified in the service definition for not-ready state
-
if the pods are not ready, check the pod logs to understand why the readiness probe is failing
If the configuration looks fine but the error is still present, ask for support for kubernetes admin team since it might relate to an internal k8s networking issue.
Use pod restart only as a last resort.
kubectl rollout restart deployment -n portal
kubectl rollout restart deployment -n iam
Deployed modules doesn’t appear in the menu / When checked in the integration status in the main tenant the modules appeared registered but they have no views
This error happens because either Portal didn’t load the views or the module didn’t start correctly.
Check if you can access the URL specified in the Views menu error.
We can restart Portal and then the module (only if the URL is not reachable) by running the following commands:
kubectl rollout restart deployment -n portal core-service-deployment
# restart other modules
kubectl rollout restart deployment -n <module_namespace> <module_ui_deployment>
Deployed modules doesn’t register in portal
This error can happen for two reasons:
-
The module is missing the Portal registration permissions. We can check the permissions from the primary tenant by accessing access management > modules > faulty_module > roles > portal. Make sure that either the
Portal UserorPortal Registrationrole is assigned. -
The module doesn’t start correctly. In this case the module’s status and logs should be checked by following the steps:
-
Check the module’s status by getting all the pods from the namespace
kubectl get pods -n <module_namespace> -
If all the services are up and running, a restart of the module web UI may solve the issue
kubectl rollout restart deployment -n <module_namespace> <module_web_ui> -
If a service is down or the restart did not solve the issue, you can look at the logs.
kubectl logs -n <module_namespace> <faulty_pod>4 Types of errors are common: database errors, SSL errors, permission errors, or RabbitMQ errors.
-
Database errors: Wrong usernames/passwords or wrong connection strings usually cause database errors. Try connecting with the same connection strings you have in the config file from another environment with a database client. If the connection works, the issue is strictly between the cluster and the database server.
-
SSL errors: If an SSL error is presented then the certificate, the key or the CA is not valid. The certificate can be checked with the following command:
openssl x509 -in server.crt -textThe validity of the key against the certificate can be checked with:
openssl rsa -modulus -noout -in server.key | openssl sha256 openssl rsa -modulus -noout -in server.crt | openssl sha256Both modules should be identic. If they are not, then the key and certificate don’t match.
-
Permission errors: If the logs indicate an unauthorized error, the module is missing some required roles. Please assign the essential roles that the module needs as per the application assignment roles matrix. The roles can be assigned from the main tenant by accesing access management > modules > faulty_module > roles > missing_role.
-
RabbitMQ errors*: If one of the application module logs indicates a RabbitMQ error, check the RabbitMQ Admin interface for cluster and queues status. If you are using the embedded RabbitMQ server instance (deployed via Nexeed IAS umbrella helm chart) you can start a shell in one of the rabbitmq pods and check the status using rabbitmqctl command:
-
-
rabbitmqctl cluster_status
or:
rabbitmqctl status
or you can check the rabbitmq pod logs for errors.