Failure handling
|
Some of the failure cases are taken from Official RabbitMQ site. |
What Can Fail?
Messaging-based systems are distributed by definition and can fail in different, and sometimes subtle, ways.
Network connection problems and congestion are probably the most common class of failure. Not only can networks fail, firewalls can interrupt connections they consider to be idle, and network failures take time to detect.
In addition to connectivity failures, the server and client applications can experience hardware failure (or software can crash) at any time. Additionally, even if client applications keep running, logic errors can cause channel or connection errors which force the client to establish a new channel or connection and recover from the problem.
This list of failures, of course, is not at all exhaustive. It does not cover more subtle failures such as omission failures (failure to respond in a predictable amount of time), performance degradations, malicious or buggy applications that exhaust the system of resources and so on. Those failures can be detected with monitoring, metrics and health checks.