High Availability and Reliability


High Availability through Server Redundancy and Failover

One of the major requirements that InterMail fulfills is high availability, which means that the system must be available and functioning properly at all times.

Figure B: the recommended high availability and reliability features for each of the server types.

Server Redundancy
Server redundancy simply means that more than one server is made available to fulfill a function. The first and third axes of Figure A (MTAs and client access servers), as well as the directory, employ server redundancy. The system is configured so that there are more machines than necessary to handle the expected traffic, so that if one of the machines fails, the other machines will accommodate the load until the first machine is brought back online. The distribution of load among the machines is handled by DNS using a round-robin mechanism. When a machine fails, it is removed from the DNS rotation and then reintroduced when it is brought back online.

Failover
Failover is employed for the MSS machines, which host data that must be accessible at all times. In order to insure continuous access, "hot spare" machines are deployed. If an MSS machine fails, the hot spare will assume the network identity of the failed machine and access its disk array through the second port.

Failover is provided by third party software and is not part of InterMail itself. There are two facets to supporting failover. The first is the actual monitoring of the hardware itself so that failures can be detected. The second is a set of scripts that will be executed if a failure occurs. These scripts will cause the switch-over to a partially booted machine (which allows the IP address of a failed machine to be assumed by the spare machine). The spare machine assumes the identity of the failed machine, connects to the disk array through a second port, and appears on the network to replace the MSS that failed.

Note: While Software.com recommends the use of a failover mechanism, it is not required by InterMail.

Reliability through Disk Mirroring, Journaling, and Online Backup

Messaging data must be recoverable in the event of a disaster situation such as a hardware crash. Although it is not required, Software.com highly recommends server disk mirroring, journaling, and online backups. The following is an overview of each:

Disk Mirroring
Disk mirroring protects data integrity; all critical disks are mirrored to insure that data will always be accessible. In the event of a disk failure, the mirror is immediately placed into service, providing uninterrupted access to the data. "Hot spare" disks in the storage array are brought up to date with their mirrors. With this mechanism, single points of disk failure are eliminated.

Journaling
InterMail uses file-level and application-level journaling to provide rapid and graceful system recovery. A journaling file system enables quick recovery of the file system used for storing message bodies. Application-level journaling enables the recovery procedure to roll the file system and the database forward to the most recent consistent state, ensuring that no message transactions are lost.

Online Backup and Recovery
To further insure against loss of data, online backups are performed while systems continue to fulfill their normal duties without interruption. Snapshots are taken on standard schedules for full and incremental backups, and this is the data that will be used in disaster recovery scenarios. (This level of data backup is applied to the message stores; the mail queue data on the MTAs are backed up by the use of disk mirrors.)

Learn more about InterMail: Management and Administration

Home Page | Post.Office | InterMail | Visit Software.com's Web Site