Required monitoring
Condition Monitoring supports OpenTelemetry for tracing and monitoring. For configuration details, see Chapter "11.5. OpenTelemetry Integration" in the central NEXEED Industrial Application System Operations Manual.
Logs
Alert rules maybe use 'JSONPath' for addressing values in structured logs. This is indicated with a '$'.
| Message | Logger | Alert rule | Context | Symptoms | Solution |
|---|---|---|---|---|---|
Service crashed |
LIFE-CYCLE |
(($stackTraces[*].failureType = 'liquibase.exception.LockException') > 0) in one minute |
Database migration |
Pod does not start up / pod restarts |
Assert if the kubernetes deployment, listed in the 'LOCKEDBY' column of table 'CM_CORE_DATABASECHANGELOGLOCK' or RM_DATABASECHANGELOGLOCK', is running. If not, delete the entry from the database. If so, wait for the migration to complete. |
Health status is [DOWN]. <further details> |
LIFE-CYCLE |
(($status = 'UNHEALTHY') > 0) in one minute |
System availability |
Loss of functionality, loss of data, system unavailability |
Inspect the further details of the message. This contains a detailed report of the problem. |
Open Telemetry
To see the open telemetry configuration, please check the general IAS Operations Manual. Under logging monitoring concepts you can find the Open Telemetry integration.
Monitoring RabbitMQ queues
Below are the recommended thresholds for monitoring RabbitMQ queues used by Condition Monitoring services. .Rabbitmq Queue Monitoring Thresholds
Queue |
MaxLength |
Alert Threshold (Upper Limit) |
q.cm.core.opp.v09.machineEquipment.measurementTimeSeries.v09 |
1000 |
750 |
q.cm.core.opp.v09.machineEquipment.machine.v09 |
1000 |
750 |
q.cm.core.ppmp.v2.message.measurement |
1000 |
750 |
q.cm.core.ruleResult.positive |
1000 |
750 |
q.cm.rs.ppmp.machineMsg.enriched |
1000 |
750 |
q.cm.rs.ppmp.measurement.enriched |
1000 |
750 |
q.cm.sfe.machineRuleExecutionMsg.received |
50000 |
35000 |
q.cm.sfe.measurementRuleExecutionMsg.received |
50000 |
35000 |
Horizontal pod scaling guidance
Currently, automatic pod scaling is not enabled for Condition Monitoring services. It is strongly recommended to monitor the CPU and memory usage of each pod:
-
Trigger an alert if CPU or memory usage of a pod exceeds 80%.
-
When such alerts are triggered, evaluate scaling the affected service horizontally by increasing the number of pods to ensure continued performance and reliability.
Based on reference customer data (scale factor 1), load tests were performed to determine appropriate scaling for each service. The following tables provide guidance for scaling services according to system load and scale factor.
Measurement msg/sec (Queue: q.cm.core.opp.v09.machineEquipment.measurementTimeSeries.v09) |
Machine msg/sec (Queue: q.cm.core.opp.v09.machineEquipment.machine.v09) |
All Devices |
Active Devices Sending Measurement |
All Rules (active measurement rule, active machine rule) |
Measurement Rules Considered for Rule Execution |
Machine Rules Considered for Rule Execution |
Scale Factor |
1500 |
10 |
870 |
433 |
1830 (209, 1009) |
133 |
5 |
1 |
3000 |
20 |
1740 |
866 |
3660 (418, 2018) |
266 |
10 |
2 |
4500 |
30 |
2610 |
1299 |
5490 (627, 3027) |
399 |
15 |
3 |
6000 |
40 |
3480 |
1732 |
7320 (836, 4036) |
532 |
20 |
4 |
7500 |
50 |
4350 |
2165 |
9150 (1045, 5045) |
665 |
25 |
5 |
9000 |
60 |
5220 |
2568 |
10980 (1254, 6054) |
798 |
30 |
6 |
The following tables show the recommended number of service instances (based on resource (cpu and memory) requests and limits of the services) for each scale factor. Adjust these values as needed based on observed system performance and monitoring data.
Service Name |
Service Instances |
condition-monitoring-core |
2 |
rule-service-app |
2 |
rule-function-executor |
2 |
rule-value-provider |
2 |
rule-value-aggregator |
2 |
rule-result-aggregator |
2 |
Service Name |
Service Instances |
condition-monitoring-core |
3 |
rule-service-app |
2 |
rule-function-executor |
2 |
rule-value-provider |
2 |
rule-value-aggregator |
2 |
rule-result-aggregator |
3 |
Service Name |
Service Instances |
condition-monitoring-core |
4 |
rule-service-app |
2 |
rule-function-executor |
2 |
rule-value-provider |
2 |
rule-value-aggregator |
2 |
rule-result-aggregator |
4 |
Service Name |
Service Instances |
condition-monitoring-core |
5 |
rule-service-app |
2 |
rule-function-executor |
2 |
rule-value-provider |
2 |
rule-value-aggregator |
2 |
rule-result-aggregator |
5 |
Service Name |
Service Instances |
condition-monitoring-core |
6 |
rule-service-app |
3 |
rule-function-executor |
2 |
rule-value-provider |
3 |
rule-value-aggregator |
2 |
rule-result-aggregator |
6 |