Required monitoring :: Nexeed Learning Portal

Logs

Alert rules maybe use 'JSONPath' for addressing values in structured logs. This is indicated with a '$'.

Table 1. Required log monitoring
Message	Logger	Alert rule	Context	Symptoms	Solution
Service crashed	LIFE-CYCLE	(($stackTraces[*].failureType = 'liquibase.exception.LockException') > 0) in one minute	Database migration	Pod does not start up / pod restarts	Assert if the kubernetes deployment, listed in the 'LOCKEDBY' column of table 'CM_CORE_DATABASECHANGELOGLOCK' or RM_DATABASECHANGELOGLOCK', is running. If not, delete the entry from the database. If so, wait for the migration to complete.
Health status is [DOWN]. <further details>	LIFE-CYCLE	(($status = 'UNHEALTHY') > 0) in one minute	System availability	Loss of functionality, loss of data, system unavailability	Inspect the further details of the message. This contains a detailed report of the problem.

Open Telemetry

To see the open telemetry configuration, please check the general IAS Operations Manual. Under logging monitoring concepts you can find the Open Telemetry integration.

Monitoring RabbitMQ queues

Below are the recommended thresholds for monitoring RabbitMQ queues used by Condition Monitoring services. .Rabbitmq Queue Monitoring Thresholds

Queue	MaxLength	Alert Threshold (Upper Limit)
q.cm.core.opp.v09.machineEquipment.measurementTimeSeries.v09	1000	750
q.cm.core.opp.v09.machineEquipment.machine.v09	1000	750
q.cm.core.ppmp.v2.message.measurement	1000	750
q.cm.core.ruleResult.positive	1000	750
q.cm.rs.ppmp.machineMsg.enriched	1000	750
q.cm.rs.ppmp.measurement.enriched	1000	750
q.cm.sfe.machineRuleExecutionMsg.received	50000	35000
q.cm.sfe.measurementRuleExecutionMsg.received	50000	35000

Horizontal pod scaling guidance

Currently, automatic pod scaling is not enabled for Condition Monitoring services. It is strongly recommended to monitor the CPU and memory usage of each pod:

Trigger an alert if CPU or memory usage of a pod exceeds 80%.
When such alerts are triggered, evaluate scaling the affected service horizontally by increasing the number of pods to ensure continued performance and reliability.

Based on reference customer data (scale factor 1), load tests were performed to determine appropriate scaling for each service. The following tables provide guidance for scaling services according to system load and scale factor.

Table 2. Message and Rule Amount by Scale Factor
Measurement msg/sec (Queue: q.cm.core.opp.v09.machineEquipment.measurementTimeSeries.v09)	Machine msg/sec (Queue: q.cm.core.opp.v09.machineEquipment.machine.v09)	All Devices	Active Devices Sending Measurement	All Rules (active measurement rule, active machine rule)	Measurement Rules Considered for Rule Execution	Machine Rules Considered for Rule Execution	Scale Factor
1500	10	870	433	1830 (209, 1009)	133	5	1
3000	20	1740	866	3660 (418, 2018)	266	10	2
4500	30	2610	1299	5490 (627, 3027)	399	15	3
6000	40	3480	1732	7320 (836, 4036)	532	20	4
7500	50	4350	2165	9150 (1045, 5045)	665	25	5
9000	60	5220	2568	10980 (1254, 6054)	798	30	6

The following tables show the recommended number of service instances (based on resource (cpu and memory) requests and limits of the services) for each scale factor. Adjust these values as needed based on observed system performance and monitoring data.

Table 3. Service Scaling Reference (Scale Factor 1 and 2)
Service Name	Service Instances
condition-monitoring-core	2
rule-service-app	2
rule-function-executor	2
rule-value-provider	2
rule-value-aggregator	2
rule-result-aggregator	2

Table 4. Service Scaling Reference (Scale Factor 3)
Service Name	Service Instances
condition-monitoring-core	3
rule-service-app	2
rule-function-executor	2
rule-value-provider	2
rule-value-aggregator	2
rule-result-aggregator	3

Table 5. Service Scaling Reference (Scale Factor 4)
Service Name	Service Instances
condition-monitoring-core	4
rule-service-app	2
rule-function-executor	2
rule-value-provider	2
rule-value-aggregator	2
rule-result-aggregator	4

Table 6. Service Scaling Reference (Scale Factor 5)
Service Name	Service Instances
condition-monitoring-core	5
rule-service-app	2
rule-function-executor	2
rule-value-provider	2
rule-value-aggregator	2
rule-result-aggregator	5

Table 7. Service Scaling Reference (Scale Factor 6)
Service Name	Service Instances
condition-monitoring-core	6
rule-service-app	3
rule-function-executor	2
rule-value-provider	3
rule-value-aggregator	2
rule-result-aggregator	6