Electronics - Tronic 4 U: 2009

Thursday, June 18, 2009

Basic principle of cache design

Locality of References

An implication of locality is that we can predict with reasonable accuracy what instructions and data a program will use in the near future based on its accesses in the recent past.

Temporal locality:

states that recently accessed items are likely to be accessed in the near future.

Spatial locality:

says that items whose addresses are near one another tend to be referenced close together in time.

What is cache memory?

Cache memory is random access memory (RAM) that a computer microprocessor can access more quickly than it can access regular RAM (Main memory). As the microprocessor processes data, it looks first in the cache memory and if it finds the data there (from a previous reading of data), it does not have to do the more time-consuming reading of data from larger memory.

Cache memory is sometimes described in levels of closeness and accessibility to the microprocessor. An L1 cache is on the same chip as the microprocessor. (For example, the PowerPC 601 processor has a 32 kilobyte level-1 cache built into its chip.) L2 is usually a separate static RAM (SRAM) chip. The main RAM (Main memory) is usually a dynamic RAM (DRAM) chip.
Cache main memory structure

Figure: Data mapping between cache and main memory

Tag Data block ( k words)

Figure: Cache arrangement

0 Word
1
2
3

2n-1

Figure: Main memory arrangement

Cache size

The size of the disk's cache is important to its overall impact in improving the performance of the system, for the same reason that adding system memory will improve system performance. It is told that cache should be small enough to overall average cost per bit is close to main memory and large enough so that average access time close to that of cache alone.

Mapping Function

The replacement policy decides where in the cache a copy of a particular entry of main memory will go. In order to make room for the new entry on a cache miss, the cache generally has to evict one of the existing entries. The fundamental problem with any replacement policy is that it must predict which existing cache entry is least likely to be used in the future. Predicting the future is difficult, especially for hardware caches that use simple rules amenable to implementation in circuitry, so there are a variety of replacement policies to choose from and no perfect way to decide among them. One popular replacement policy, LRU, replaces the least recently used entry.

Direct mapping

If each entry in main memory can go in just one place in the cache.

Fully Associative mapping
If the replacement policy is free to choose any entry in the cache to hold the copy. This mapping function is basically a hypothetical one, and very expensive to implemented in the real environment as it demands lots of hardwire complexity (parallel searching must be implemented).

Set Associative mapping

Associativity is a trade-off. If there are ten places the replacement policy can put a new cache entry, then when the cache is checked for a hit, all ten places must be searched. Checking more places takes more power, area, and potentially time. On the other hand, caches with more associativity suffer fewer misses (see conflict misses, below), so that the CPU spends less time servicing those misses. The rule of thumb is that doubling the associativity, from direct mapped to 2-way, or from 2-way to 4-way, has about the same effect on hit rate as doubling the cache size. Associativity increases beyond 4-way have much less effect on the hit rate, and are generally done for other reasons (see virtual aliasing, below).

Figure: Miss rate against cache size

Write policy

When data is written to the cache, it must at some point be written to main memory as well. The timing of this write is controlled by what is known as the write policy. In a write-through cache, every write to the cache causes a write to main memory. Alternatively, in a write-back or copy-back cache, writes are not immediately mirrored to memory. Instead, the cache tracks which locations have been written over (these locations are marked dirty). The data in these locations is written back to main memory when that data is evicted from the cache. For this reason, a miss in a write-back cache will often require two memory accesses to service: one to first write the dirty location to memory and then another to read the new location from memory.

Multi-level caches

Cache hierarchy in a modern processor

Modern processors have multiple interacting caches on chip.

Specialized caches

Pipelined CPUs access memory from multiple points in the pipeline: instruction fetch, virtual-to-physical address translation, and data fetch (see classic RISC pipeline). The natural design is to use different physical caches for each of these points, so that no one physical resource has to be scheduled to service two points in the pipeline. Thus the pipeline naturally ends up with at least three separate caches (instruction, TLB, and data), each specialized to its particular role.

Pipelines with separate instruction and data caches, now predominant, are said to have a Harvard architecture. Originally, this phrase referred to machines with separate instruction and data memories, which proved not at all popular. Most modern CPUs have a single-memory von Neumann architecture.

Victim cache

A victim cache is a cache used to hold blocks evicted from a CPU cache due to a conflict or capacity miss. The victim cache lies between the main cache and its refill path, and only holds blocks that were evicted from that cache on a miss. This technique is used to reduce the penalty incurred by a cache on a miss.

The original victim cache on the HP PA7200 was a small, fully-associative cache. Later processors, such as the AMD K7 and K8, used the very large secondary cache as a victim cache, to avoid duplicate storage of the contents of the large primary cache.

Trace cache

One of the more extreme examples of cache specialization is the trace cache found in the Intel Pentium 4 microprocessors. A trace cache is a mechanism for increasing the instruction fetch bandwidth and decreasing power consumption (in the case of the Pentium 4) by storing traces of instructions that have already been fetched and decoded.

A trace cache stores instructions either after they have been decoded, or as they are retired. Generally, instructions are added to trace caches in groups representing either individual basic blocks or dynamic instruction traces. A basic block consists of a group of non-branch instructions ending with a branch. A dynamic trace ("trace path") contains only instructions whose results are actually used, and eliminates instructions following taken branches (since they are not executed); a dynamic trace can be a concatenation of multiple basic blocks. This allows the instruction fetch unit of a processor to fetch several basic blocks, without having to worry about branches in the execution flow.

Trace lines are stored in the trace cache based on the program counter of the first instruction in the trace and a set of branch predictions. This allows for storing different trace paths that start on the same address, each representing different branch outcomes. In the instruction fetch stage of a pipeline, the current program counter along with a set of branch predictions is checked in the trace cache for a hit. If there is a hit, a trace line is supplied to fetch which does not have to go to a regular cache or to memory for these instructions. The trace cache continues to feed the fetch unit until the trace line ends or until there is a misprediction in the pipeline. If there is a miss, a new trace starts to be built.

Trace caches are also used in processors like the Intel Pentium 4 to store already decoded micro-operations, or translations of complex x86 instructions, so that the next time an instruction is needed, it does not have to be decoded again.

Multi-level caches

Another issue is the fundamental tradeoff between cache latency and hit rate. Larger caches have better hit rates but longer latency. To address this tradeoff, many computers use multiple levels of cache, with small fast caches backed up by larger slower caches.

Multi-level caches generally operate by checking the smallest Level 1 (L1) cache first; if it hits, the processor proceeds at high speed. If the smaller cache misses, the next larger cache (L2) is checked, and so on, before external memory is checked.
As the latency difference between main memory and the fastest cache has become larger, some processors have begun to utilize as many as three levels of on-chip cache. For example, in 2003, Itanium 2 began shipping with a 6 MiB unified level 3 (L3) cache on-chip. The IBM Power 4 series has a 256 MiB L3 cache off chip, shared among several processors. The new AMD Phenom series of chips carries a 2MB on die L3 cache.

Exclusive versus inclusive

Multi-level caches introduce new design decisions. For instance, in some processors, all data in the L1 cache must also be somewhere in the L2 cache. These caches are called strictly inclusive. Other processors (like the AMD Athlon) have exclusive caches — data is guaranteed to be in at most one of the L1 and L2 caches, never in both. Still other processors (like the Intel Pentium II, III, and 4), do not require that data in the L1 cache also reside in the L2 cache, although it may often do so. There is no universally accepted name for this intermediate policy, although the term mainly inclusive has been used.

The advantage of exclusive caches is that they store more data. This advantage is larger when the exclusive L1 cache is comparable to the L2 cache, and diminishes if the L2 cache is many times larger than the L1 cache. When the L1 misses and the L2 hits on an access, the hitting cache line in the L2 is exchanged with a line in the L1. This exchange is quite a bit more work than just copying a line from L2 to L1, which is what an inclusive cache does.

One advantage of strictly inclusive caches is that when external devices or other processors in a multiprocessor system wish to remove a cache line from the processor, they need only have the processor check the L2 cache. In cache hierarchies which do not enforce inclusion, the L1 cache must be checked as well. As a drawback, there is a correlation between the associativities of L1 and L2 caches: if the L2 cache does not have at least as many ways as all L1 caches together, the effective associativity of the L1 caches is restricted.

Another advantage of inclusive caches is that the larger cache can use larger cache lines, which reduces the size of the secondary cache tags. (Exclusive caches require both caches to have the same size cache lines, so that cache lines can be swapped on a L1 miss, L2 hit). If the secondary cache is an order of magnitude larger than the primary, and the cache data is an order of magnitude larger than the cache tags, this tag area saved can be comparable to the incremental area needed to store the L1 cache data in the L2.

As mentioned above, larger computers sometimes have another cache between the L2 cache and main memory called an L3 cache. This cache can be implemented on a separate chip from the CPU, and, as of 2004, may range in size from 2 to 256 megabytes. The benefits of an off chip L3 cache depend on the application's access patterns. High-end x86 workstations and servers are now available with an L3 cache option implemented on the microprocessor die, increasing the speed and reducing the cost substantially. For example, Intel's Xeon MP product code-named "Tulsa" features 16 MiB of on-die L3 cache, shared between two processor cores.

Finally, at the other end of the memory hierarchy, the CPU register file itself can be considered the smallest, fastest cache in the system, with the special characteristic that it is scheduled in software—typically by a compiler, as it allocates registers to hold values retrieved from main memory. (See especially loop nest optimization.) Register files sometimes also have hierarchy: The Cray-1 (circa 1976) had 8 address "A" and 8 scalar data "S" registers that were generally usable. There was also a set of 64 address "B" and 64 scalar data "T" registers that took longer to access, but were faster than main memory. The "B" and "T" registers were provided because the Cray-1 did not have a data cache. (The Cray-1 did, however, have an instruction cache.)

Pseudo-associative cache

A true set-associative cache tests all the possible ways simultaneously, using something like a content addressable memory. A pseudo-associative cache tests each possible way one at a time. A hash-rehash cache is one kind of pseudo-associative cache.
In the common case of finding a hit in the first way tested, a pseudo-associative cache is as fast as a direct-mapped cache. But it has a much lower conflict miss rate than a direct-mapped cache, closer to the miss rate of a fully associative cache

Electronics - Tronic 4 U

Thursday, June 18, 2009

Basic principle of cache design

What is cache memory?

Cache size

Mapping Function

Write policy

Multi-level caches

Cache hierarchy in a modern processor

Victim cache

Trace cache

Multi-level caches

Exclusive versus inclusive

Pseudo-associative cache

Counter

Visitors

Followers

Blog Archive

About Me