Reliability & Serviceability


Reliability - Memory Error Recovery

The Sun4d architecture provides a high level of hardware protection against memory faults. An error correcting code (ECC) is generated by each write to system memory. The ECC value is then checked upon reading that memory location. Using the ECC value, the memory hardware is able to automatically correct single-bit memory errors and to detect, but not correct, double-bit errors.

Within base Solaris, detection of an uncorrectable memory error causes the termination of the thread which accessed that failing memory location. The A+Edition improves upon the handling of this case by attempting to keep the thread running. Upon detection of the uncorrectable error, the recovery code determines whether the page on which the error occurred has been modified since it was loaded. If the page has been modified, then all of the threads which have this page mapped into their address space are terminated. This is done so that there is no chance that the corrupted data will be stored to disk by the system.

If this page has not been modified, then the mapping to the physical page is removed from the virtual memory maps of the owning threads. The failing page is tested to determine if the error is a permanent problem or one that was transient. If the error is permanent, the page is placed on a "vary-off" list indicating that it is not available for use by the system. If the problem was determined to be transient, the page will be returned to the free list. Upon the next access by a thread to one of the virtual memory addresses within this page, a new copy of the page contents is brought in from the disk and placed on a new physical memory page. This new page is then mapped into the thread's virtual memory space.

Serviceability - System Event Logger

Solaris currently provides some means of logging events and errors that have occurred on the system. For example, the syslogd facility can collect a variety of textual messages that have been generated by different system components. Kernel printf(s) display a variety of messages on the system console. The A+Edition provides an enhanced system event logger that improves the logging of textual and binary data, and facilitates centralized event logging within a network.

The event messages can contain binary as well as textual information. A standardized message data format is used to ensure that these messages can be monitored by any Solaris system using the event logger facility. The first release of the A+Edition contains a local version of this facility. The second release of the A+Edition extends this capability over a network.

Logged System Events

The amount and organization of this data will enable engineers to review hardware and software errors and events, decreasing the time to resolve a problem, and increasing the serviceability of the system. System administrators can also refer to the log, and take preemptive corrective actions if necessary.