Performance - Processor Cache Utilization


In the past few years the speed at which a CPU can load and store data has greatly outpaced the ability of the dynamic memory chips (DRAM) to supply and retrieve this data. In an attempt to lessen the impact of this speed mismatch, most systems now include memory caches. A memory cache is a relatively small amount of very fast memory which sits between the CPU and the memory subsystem. The cache memory duplicates information which exists in main memory. Since the time required to access the information in the cache is much shorter than the time needed to access information residing in main memory, system performance can be improved by maximizing the use of the cache contents, thereby providing high cache utilization.

The Sun4d architecture provides both a primary and a secondary cache. The primary data and instruction caches reside on the SuperSPARC CPU while the secondary cache is contained on the CPU board. Access to the main memory is done over the system XDBUS. Associated with the CPU and cache is the Bus Watcher and Cache Controller which ensure consistency between all of the CPUs in the system. Although support for caching greatly complicates the hardware design of the system, efficient use of the cache by the operating system will greatly improve the overall performance of the system.

A simple block diagram of the CPU/Memory Subsystem for the Sun4d architecture shows the "distance" between the CPU to the DRAM making up the main memory. Maximizing the frequency that accesses can be done from the primary (within the CPU) or secondary caches not only speeds the operation but also reduces the amount of traffic that takes place on the XDBUS. Reducing the number or scope of shared data objects lessens the number of cache invalidates which occur on the bus.

A number of techniques can be used to enhance the cache utilization of the system. For example, as a thread runs on a particular CPU, the main memory areas that it accesses are duplicated in that CPU's cache. At some point, this thread will go to sleep. When the scheduler awakens this thread, it can cause the thread to execute on the same CPU, thereby using any "state" which remains in the CPU's cache from the previous time of execution. This technique is known as processor affinity and is becoming more common in multiprocessor systems. Processor affinity cannot be taken too far, however, since the workload of the system must be balanced between multiple CPUs.

The existence of shared data objects within the kernel is a significant impediment to providing high cache utilization. Every time a thread writes to a shared kernel data object, any copies of this data object which resides on other CPU's caches must be invalidated to ensure coherency. As the number of CPUs in the system increase, this problem becomes much more significant. Extensive performance testing has identified a number of objects where this is a problem. Algorithmic changes have been made to Solaris to reduce the frequency of these invalidations and thereby improve cache utilization.