alive Server Design
Table of contents
Introduction
The alive record in itself is not very useful, as it needs a server that collects the heartbeat UDP packets sent by the alive records on IOCs. The server has to process the messages and keep a database in order to make use of the information.
The typical intended configuration is for many IOCs sending heartbeats to a single server. This of course means there is a single point of failure, being the server. If this is a problem, since the record doesn’t allow for multiple packet recipients (although it could in theory), multiple records could be on an IOC with different server targets (and different local TCP ports), allowing for server redundancy.
This document describes how the data can be used. It is based on how the author has designed a server, alived. See the Message Protocol page for the complete wire format of heartbeat and information request messages.
Heartbeat Processing
The first thing is to make sure that the alive record is sending heartbeat UDP packets to the server (from RHOST) at the expected UDP port (from RPORT), and at the expected rate determined from the HPRD period.
UDP packets are by their nature unreliable, with some getting dropped or delayed (so packets may arrive out of order). The packet handling has to allow for this.
The IP address of the sending IOC is not included in the heartbeat message itself. This is because there might be several active network interfaces, which make it not clear which one will be used for sending. When receiving the UDP packet, the IP address of the sender is given by the network layer, which needs to be used for the IOC IP address. The IP address alone can’t identify an IOC, as multiple IOCs can exist on one machine, which is why the IOC environment variable is used for identification.
A data structure for each IOC should be made, and all the data structures should be made into a searchable construct (binary tree/list/etc.) where the key is the IOC name.
The following describes how the server should handle each field in the heartbeat message:
- Magic Number (HMAG) – Used to filter out unexpected UDP packets. A server typically allows only one number, but could handle multiple or accept all packets.
- Version of Protocol – Determines the format of subsequent fields. A server can support multiple versions or ignore all but the current one. The current value is 5.
- Incarnation – Should be recorded. Serves as both the boot time (EPICS time) and a unique session identifier. If this value changes, reinitialize the IOC’s data record as if it was new, since things may have changed between boots.
- Current Time – Should be recorded. The current EPICS time as measured by the IOC.
- Heartbeat Value (VAL) – Should be recorded. Increases by one each heartbeat. If a packet with a lower heartbeat value arrives, it should be ignored as it came out of order.
- Period – Should be recorded. The heartbeat period, used for determining failure.
- Flags – Bit flags that need to be acted on:
- Bit 0 (Read): Set when ITRIG is set or when a record field is updated. The record wants the server to do a TCP callback to read its extra information. After a successful read, this will be cleared. If the server does not implement TCP callbacks, this bit can be ignored.
- Bit 1 (Blocked): Set when ISUP is set. The server can’t make a callback to the alive record. This bit overrides bit 0. An IOC behind a firewall that does not allow TCP return traffic should have this permanently enabled to keep the server from endlessly trying to make a callback.
- Return Port (IPORT) – The TCP port for making callbacks. Should be recorded or passed to the callback routine. A value of 0 means the IOC could not create the callback port.
- User Message (MSG) – No set action. Should be recorded and/or acted on if used as a server flag. Multiple values or flags could be combined and might need to be split out.
- IOC Name – Should be recorded as the searchable key for the IOC data structure.
Failure Detection and Up/Down Times
When an IOC is turned off or crashes, there is no immediate detection of failure. This determination depends on the rate of heartbeats and the number of missing heartbeats. For a HPRD rate of 15 seconds, a failure declared after four missing heartbeats would be a minute. This is fairly conservative, and if you are certain that the network between the IOCs and the server doesn’t drop many packets, the packet number can be reduced; the HPRD rate could also be increased, although that means more processing at the server.
The time value to use for determination of failure should be how long it has been since the last accepted heartbeat (as they can be out of order) was received, with the reception time being locally measured by the server, not the IOC’s current time. There might seem to be some redundancy of measuring server local time when the IOC sends its local time, but this allows you to sense any packet delivery lag or any systematic difference in time (like from time zone differences). One also has to remember that the EPICS time sent back from IOCs has a negative offset from Linux time of 631152000 seconds (20 years).
A failure can be actively detected and acted on directly by the server, or the server can simply collect data and let polling clients determine the failures themselves (which allows for varying failure times and HPRD rates).
The up time for an active IOC is the time since the last heartbeat plus the difference between the IOC’s last current time and its incarnation time.
The down time for a failed IOC is the time since the last heartbeat.
Callback Processing
The TCP callback is used to get static information from the alive record. The format of the response is described in the Information Request Message section of the protocol documentation. If the IOC was not able to create the callback port, the value of the Return Port will be 0, and a callback can’t be made.
The server can make a callback at any time, although it typically should be done when a new incarnation is seen in the heartbeat message or when the Read (Bit 0) flag is set in the heartbeat message. Also, if the Blocked (Bit 1) flag is set in the heartbeat message, the callback will not work as the record will not accept connections; if the server tries a callback to an alive record that is sending heartbeats to a different server, it will also fail.
The information returned is static in nature (and the server doesn’t really need it for running), so it should be recorded in a data structure, attached to the IOC data entry.