The concept and conventions

SOT is a distributed system consisting of multiple self-contained modules which function together. The concept implemented focuses on the aspect of centralized operations of many systems. One key to having control and good observability of such a system is to treat specific logs as a kind of API, which has to be implemented by any operated system. This insight leads us to the need to log at least everything which enables support, operations, and monitoring in a homogenous and standardized way over all SOT components.

All components of the system send metrics, traces, and logs to a log aggregator (for example, the ELK stack) allowing us to:

provide information as one source for monitoring the behavior and health from an inner application perspective
record failures
debug the application if necessary, e.g., while reproducing reported failures
track usage statistics of APIs
record security-relevant events

General logging conventions

The following are the guidelines implemented by each application module regarding logging:

No request-based logging per default
Request-based logging is avoided. It can only be enabled temporarily and only limited to specific components. Anything else is logged in an aggregated way.
Technology-independent log pattern
All applications log in the same manner, independent of used programming languages or frameworks.
Configurable log level
The log level is treated as a configuration. The log level can be configured externally.
Environment independent logging
The log is runtime environment independent, so whether the application runs on azure, on-premise, or elsewhere doesn’t matter.
Log analysis tool independent logging
Supporting a new log aggregator do not trigger a release cascade of all applications. In other words, the log analysis tool is independent of logging.
Single line messages
Logs are written as single-line formatted JSON - no line breaks.
Log to stdout
Logs are written to stdout for collection.
Log in English
Log messages presented to the end user are written in English.

Log levels

Log levels are configurable and can be one of the following. The default log level for SOT modules is WARN if it not defined differently.

TRACE - Anything at statement level such as start to calculate something, query something, loop x of y, etc… is logged at TRACE level.
DEBUG - Anything at the method or request processing level, e.g[].,
- method entered or return with …. is logged at DEBUG level
- HTTP GET /some resource took 5 seconds status 200 but also 4xy statuses
INFO - In the following cases, the information is logged at INFO level, e.g.,
- Process life cycle messages, such as
  - application started
  - application ready
  - shutdown application
- Administrative events, such as
  - (re)load or flush caches
  - (re)load configuration
  - database migrations
- handled exceptions, which do not require any additional investigations, such as
  - connection refused, try reconnect in X seconds
WARN - Uncommon behavior or situations, e.g.,
- unavailability of services
- security related events (potentially malicious)
- handled failures that prevent the application from working, usually as long as the application can recover automatically
ERROR - Anything which potentially leads to unexpected behavior or crashes, e.g.,
- unhandled failures
- implementation failures

Configuration of default log levels during deployment

It is possible to configure the log level for all SOT modules during the helm deployment in a standardized way. The default log level is WARN. All modules have two logging parameters:

default - This parameter is used for all logging configurations as a fallback
application - This parameter is used for the logs that are written by the SOT applications. The logs normally include only logs from the business logic and exclude logs from frameworks

Some modules have further logging parameters defined (e.g. for different frameworks) which are documented in the module’s operations manuals. The log levels can be configured during the helm deployment via helm override file. It is possible to set the log level parameters per module or globally which will effect all SOT modules. Example:

global:
  logging:
    default: "ERROR"
    application: "INFO"
macma:
  local:
    logging:
      default: "INFO"
      application: "TRACE"

Information of interest

This section describes the information which will be logged. General information will be logged with any log event. Depending on the case the general information will be enriched by additional information e.g. in case of request based logs a correlationId will be logged additionally together with other information described in detail in the following sections.

This section describes the information that the system logs. General information is logged with any log event. Depending on the case, additional information enriches the general information. For example, in the case of request-based logs, a correlationId is logged. The following sections describe other information in detail.

General information

Attribute Description Information Origin Required

Attribute	Description	Information Origin	Required
`version`	The version of the application, as specified during the build.	Build time configuration.	Yes.
`timestamp`	The time when the log was emitted as ISO 8601 formatted timestamp.	System time.	Yes.
`thread`	The thread name at which the log was generated.	Process.	No.
`system`	The name of the system, e.g. `DEV`, `TEST`, `PROD`, etc…	Environment variable `NEXEED_GLOBAL_SYSTEM_NAME`.	Yes.
`stackTraces`	Stack traces of caught exceptions.	Source code.	No.
`product`	The name of the product, the specific application belongs to.	Build time / runtime configuration.	Yes.
`loglevel`	One of `TRACE`, `DEBUG`, `INFO`, `WARN`, `ERROR`.	Source code.	Yes.
`logger`	The name of the logger.	Source code.	Yes.
`instance`	The name or identifier of the instance.	Environment variable `NEXEED_GLOBAL_APPLICATION_INSTANCE_ID`.	Yes.
`environment`	The name of the environment, e.g. `AZURE, ONPREM@<CUSTOMER>`.	Environment variable `NEXEED_GLOBAL_ENVIRONMENT_NAME`.	Yes.
`application`	The name of the application, as specified during the build.	Build time configuration.	Yes.

version

The version of the application, as specified during the build.

Build time configuration.

Yes.

timestamp

The time when the log was emitted as ISO 8601 formatted timestamp.

System time.

Yes.

thread

The thread name at which the log was generated.

Process.

No.

system

The name of the system, e.g. DEV, TEST, PROD, etc…

Environment variable NEXEED_GLOBAL_SYSTEM_NAME.

Yes.

stackTraces

Stack traces of caught exceptions.

Source code.

No.

product

The name of the product, the specific application belongs to.

Build time / runtime configuration.

Yes.

loglevel

One of TRACE, DEBUG, INFO, WARN, ERROR.

Source code.

Yes.

logger

The name of the logger.

Source code.

Yes.

instance

The name or identifier of the instance.

Environment variable NEXEED_GLOBAL_APPLICATION_INSTANCE_ID.

Yes.

environment

The name of the environment, e.g. AZURE, ONPREM@<CUSTOMER>.

Environment variable NEXEED_GLOBAL_ENVIRONMENT_NAME.

Yes.

application

The name of the application, as specified during the build.

Build time configuration.

Yes.

Request / event-based information

Attribute Description Information Origin Required

Attribute	Description	Information Origin	Required
`correlationId`	The request related correlationId.	E.g. request header.	Yes, if available.
`trace_id`	`trace_id` as specified by W3CTraceContext specification and recommended by the OpenTelemetry Logging specification for JSON logs.	E.g. request header.	Yes, if available.
`span_id`	`span_id` as specified by W3CTraceContext specification and recommended by the OpenTelemetry Logging specification for JSON logs.	E.g. request header.	Yes, if available.
`tenantId`	The tenant in whose context the request was processed.	E.g. access token.	Yes, as soon as the information is available.
`userId`	The user in whose context the request was processed.	E.g. access token.	Yes, as soon as the information is available.

correlationId

The request related correlationId.

E.g. request header.

Yes, if available.

trace_id

trace_id as specified by W3CTraceContext specification and recommended by the OpenTelemetry Logging specification for JSON logs.

E.g. request header.

Yes, if available.

span_id

span_id as specified by W3CTraceContext specification and recommended by the OpenTelemetry Logging specification for JSON logs.

E.g. request header.

Yes, if available.

tenantId

The tenant in whose context the request was processed.

E.g. access token.

Yes, as soon as the information is available.

userId

The user in whose context the request was processed.

E.g. access token.

Yes, as soon as the information is available.

The goal is to support tracing as specified by the W3CTraceContext, OpenTelemetry/Tracing, and OpenTelemetry/Logging specifications. Since not all components are migrated simultaneously, the correlationId header is still supported. As soon as all components support the standard tracing approach, the support for the correlationId will be removed.

LIFE-CYCLE log

The following information is logged to know when an application started, stopped, and is ready.

Attribute Description Required Default

Attribute	Description	Required	Default
`logger`	`LIFE-CYCLE`	Yes.	N/A
`loglevel`	`INFO`	Yes.	N/A
`status`	One of: `STARTED` `INITIALIZATION` `READY` `UNHEALTHY` `STOPPING` `CRASHING` `RESTARTING`	Yes.	N/A
`message`	An optional reason phrase.	No.	N/A

logger

LIFE-CYCLE

Yes.

N/A

loglevel

INFO

Yes.

N/A

status

One of:

STARTED
INITIALIZATION
READY
UNHEALTHY
STOPPING
CRASHING
RESTARTING

Yes.

N/A

message

An optional reason phrase.

No.

N/A

SECURITY logs

Security-related events which are logged are, for example:

call to non existing endpoints
rejected calls, e.g., caused by failed input validation
unauthorized calls
signature verification failures
usage of expired tokens

Attribute Description Required Default

Attribute	Description	Required	Default
`logger`	`SECURITY`	Yes.	N/A
`loglevel`	One of: `INFO` `WARN`	Yes.	N/A
`correlationId`		Yes, as soon as the information is available.	N/A
`trace_id`		Yes, as soon as the information is available.	N/A
`span_id`		Yes, as soon as the information is available.	N/A
`tenantId`		Yes, as soon as the information is available.	N/A
`userId`		Yes, as soon as the information is available.	N/A
`sourceIp`		Yes, as soon as the information is available.	N/A
`event`	What happened, e.g `LOGIN_FAILED` `LOGIN_SUCCESSFUL` `TOKEN_EXPIRED` `TOKEN_SIGNATURE_INVALID` `TOKEN_OUTDATED` `TOKEN_VALIDATION_FAILED` `SENSITIVE_DATA_ACCESSED` `INPUT_VALIDATION_FAILED` `TLS_DISABLED`	Yes.	N/A
`message`	An optional reason phrase.	No.	N/A

logger

SECURITY

Yes.

N/A

loglevel

One of:

INFO
WARN

Yes.

N/A

correlationId

Yes, as soon as the information is available.

N/A

trace_id

Yes, as soon as the information is available.

N/A

span_id

Yes, as soon as the information is available.

N/A

tenantId

Yes, as soon as the information is available.

N/A

userId

Yes, as soon as the information is available.

N/A

sourceIp

Yes, as soon as the information is available.

N/A

event

What happened, e.g

LOGIN_FAILED
LOGIN_SUCCESSFUL
TOKEN_EXPIRED
TOKEN_SIGNATURE_INVALID
TOKEN_OUTDATED
TOKEN_VALIDATION_FAILED
SENSITIVE_DATA_ACCESSED
INPUT_VALIDATION_FAILED
TLS_DISABLED

Yes.

N/A

message

An optional reason phrase.

No.

N/A

Specific SECURITY events

Attribute Description Required Default

Attribute	Description	Required	Default
`WARN`	`TLS_DISABLED`	Environment Variable: `NEXEED_GLOBAL_DISABLE_TLS`.	If this variable is set to `true`, this must be logged.

WARN

TLS_DISABLED

Environment Variable: NEXEED_GLOBAL_DISABLE_TLS.

If this variable is set to true, this must be logged.

OpenTelemetry integration

The following section describes the configuration and usage of SOT in combination with OpenTelemetry (OTEL).

Configuration of default OpenTelemetry during deployment

It is possible to configure the OpenTelemetry agents for all SOT modules that are supporting OpenTelemetry during the Helm deployment in a standardized way.

The proper environment variables for OTEL agent are injected automatically in all pods via a configmap (otel-observability-configmap), a secret (otel-observability-secret) and a pod environment variable (OTEL_SERVICE_NAME).

One can disable the automatic injection by setting the following variable in the module local section:

<module>:
  local:
    observability:
      otelAutoInjectEnvParams: false

In the observability node of the global section you can define defaults for all modules supporting OpenTelemetry. You can enable or disable the exporter or the feature flags for tracing, metrics, or logging. The default for all settings is disabled. You can overwrite the global configuration by module-specific configuration in the local section of the module.

global:
  observability:
    otlpEnabled: true
    otlpTracingEnabled: true
    otlpUrl: "<yourUrl>"
    otlpHeaders: "<headerContainingYourAuthenticationSecrets>"
macma:
  local:
    observability:
      otlpTracingEnabled: false

The following settings can be configured in both global umbrella chart and module local dictionary under observability key, local dictionary having precendence over global one:

Parameter	Required	Description
observability.otlpUrl	yes	The URL endpoint for the OpenTelemetry Protocol (OTLP) collector.
observability.otlpEnabled	no	Flag to enable or disable the OpenTelemetry Protocol (OTLP) exporter.
observability.otlpTracingEnabled	no	Flag to enable or disable tracing for OpenTelemetry Protocol (OTLP).
observability.otlpMetricEnabled	no	Flag to enable or disable metrics for OpenTelemetry Protocol (OTLP).
observability.otlpLoggingEnabled	no	Flag to enable or disable logging for OpenTelemetry Protocol (OTLP).
observability.otlpHeader	yes if otel collector/endpoint requires authentication	The header of the messages for OpenTelemetry Protocol (OTLP). Will be mapped to a secret as it might contain the Authorization Secrets.
observability.otelResourceAttributes	no	A comma-separated list of key-value pairs to define resource attributes, which describe the entity producing telemetry data. Example: 'deployment.environment=<myEnvironmentKey>'
observability.otlpProtocol	no	Protocol used for otel communication - one of grpc, http/protobuf or http/json
observability.tracesSampler	no	Sampler to be used for traces
observability.tracesSamplerArg	no	String value to be used as the sampler argument

Parameter

Required

Description

observability.otlpUrl

yes

The URL endpoint for the OpenTelemetry Protocol (OTLP) collector.

observability.otlpEnabled

Flag to enable or disable the OpenTelemetry Protocol (OTLP) exporter.

observability.otlpTracingEnabled

Flag to enable or disable tracing for OpenTelemetry Protocol (OTLP).

observability.otlpMetricEnabled

Flag to enable or disable metrics for OpenTelemetry Protocol (OTLP).

observability.otlpLoggingEnabled

Flag to enable or disable logging for OpenTelemetry Protocol (OTLP).

observability.otlpHeader

yes if otel collector/endpoint requires authentication

The header of the messages for OpenTelemetry Protocol (OTLP). Will be mapped to a secret as it might contain the Authorization Secrets.

observability.otelResourceAttributes

A comma-separated list of key-value pairs to define resource attributes, which describe the entity producing telemetry data. Example: 'deployment.environment=<myEnvironmentKey>'

observability.otlpProtocol

Protocol used for otel communication - one of grpc, http/protobuf or http/json

observability.tracesSampler

Sampler to be used for traces

observability.tracesSamplerArg

String value to be used as the sampler argument

For more information about the values which can be passed in the variables please consult OTEL agent official documentation.