Wednesday, December 10, 2014

All about caches

Cache is a component that transparently stores data so that future requests for that data can be served faster. The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere. If requested data is contained in the cache (cache hit), this request can be served by simply reading the cache, which is comparatively faster. Otherwise (cache miss), the data has to be recomputed or fetched from its original storage location, which is comparatively slower. Hence, the greater the number of requests that can be served from the cache, the faster the overall system performance becomes. To be cost efficient and to enable an efficient use of data, caches are relatively small. Nevertheless, caches have proven themselves in many areas of computing because access patterns in typical computer applications have locality of reference. References exhibit temporal locality if data is requested again that has been recently requested already. References exhibit spatial locality if data is requested that is physically stored close to data that has been requested already.

Principles of locality:
 A phenomenon describing the same value, or related storage locations, being frequently accessed. There are two basic types of reference locality – temporal and spatial locality. Temporal locality refers to the reuse of specific data, and/or resources, within a relatively small time duration. Spatial locality refers to the use of data elements within relatively close storage locations.
Temporal locality
If at one point in time a particular memory location is referenced, then it is likely that the same location will be referenced again in the near future.
Spatial locality
If a particular memory location is referenced at a particular time, then it is likely that nearby memory locations will be referenced in the near future. 
Reasons for locality:
• Predictability
• Structure of program – we are used to writing code in such a fashion
• Linearity of data structures – eg.arrays
•  Programmers are responsible for moving data between disk and memory through file I/O.
•  Hardware is responsible for moving data between memory and caches.
•  Optimizing compilers are responsible for generating code that, when executed, will cause the hardware to use caches and registers efficiently.

Memory Hierarchy:
A "memory hierarchy" in computer storage distinguishes each level in the "hierarchy" by response time. Since response time, complexity, and capacity are related, the levels may also be distinguished by the controlling technology.



Operation:
Hardware implements cache as a block of memory for temporary storage of data likely to be used again. A cache is made up of a pool of entries. Each entry has a datum (piece of data) – a copy of the same datum in some backing store. Each entry also has a tag, which specifies the identity of the datum in the backing store of which the entry is a copy.
 
When the cache client (a CPU, web browser, operating system) needs to access a datum presumed to exist in the backing store, it first checks the cache. If an entry can be found with a tag matching that of the desired datum, the datum in the entry is used instead. This situation is known as a cache hit. The alternative situation, when the cache is consulted and found not to contain a datum with the desired tag, has become known as a cache miss. The previously uncached datum fetched from the backing store during miss handling is usually copied into the cache, ready for the next access.  The percentage of accesses that result in cache hits is known as the hit rate or hit ratio of the cache. 

Cache metrics:
Hit rate = 1 – miss rate
Hit rate = #references that hit / #references
Miss rate = #references that missed/#references
Average access time = (hit rate * hit time) + (miss rate * miss time)
Average miss penalty = average access for next level down
eg. Avg miss penalty of L1 = avg access time of L2
Structure of cache:
The structure of a cache line is shown below.

flag bit
tag
data block
An instruction cache requires only one flag bit per cache row entry: a valid bit. The valid bit indicates whether or not a cache block has been loaded with valid data. A data cache typically requires two flag bits per cache line – a valid bit and a dirty bit. Having a dirty bit set indicates that the associated cache line has been changed since it was read from main memory ("dirty"), meaning that the processor has written data to that line and the new value has not propagated all the way to main memory.
 The data block is the actual block fetched from the backing storage. The tag is the address (part of the full address of the data in the backing storage). Size of a cache is the amount of main memory data it can hold which is 
Number of bytes in each data block * no of data blocks per cache level
The memory address is split into tag, index and byte, as shown below.
tag
index
byte

For 8 cache lines, the last digit of address specifies the byte within a block (0 to F) – 16 bytes
The last second and third bits specify index addresses from 00 to FF – 256 index values.
The remaining MSB values represent tag, which is matched with the physical address for read or write operations.
Design dimensions:
1. Block size: this is read/write granularity to the next level down.
Increasing the block size:
a. Exploits spatial locality by prefetching likely to use data
b. It could also hurt, as cache fills take longer, and also reduces amount of storage for other blocks

2. Transfer size: how much data is brought in per block, as data brought in might not be full block size

Associativity:
The associativity decides where you can place a block fetched from backing storage.
1. Fully associative: Block can be placed anywhere.
Disadvantage:
a. Have to check every single block for a match (using the tag)
b. Consumes power and takes time
Advantage: better hit rate
2. Direct mapped: block can go to only one place, which is determined by block ID
Disadvantage:
a. Conflicts: addresses ending in 0, 8, 16, 24, 32,… are placed in same location having ID 0; thus higher miss rate
Advantage: faster access time (than fully associative), smaller tag size (and hence less space and less power per access)
3. Set associative: instead of forcing data to go to only one location, we allow hardware to choose to put data in one of n available locations based on index (ID), it is called n way set associative.

4. Multiple direct mapped caches: we can have multiple direct mapped caches, and choose data from a particular cache based on tag (address). Hardware decides which direct mapped cache should be used to put the current block.
Advantage: more flexibility and better hit rate than direct mapped
Disadvantage: longer access time and more power dissipation than direct mapped
IT IS IN BETWEEN Direct mapped and fully associative
Size of cache = block size * #of associativity * #of sets
Bigger the better = for better hit rates
Bigger the worse = for power consumption and access times

Replacement policies: (when bringing a block in, what to throw out?)
During a cache miss, the CPU usually ejects some other entry in order to make room for the previously uncached datum. The heuristic used to select the entry to eject is known as the replacement policy. 
For several years, Belady’s min was assumed to be the gold standard. But, it was found out that Belady’s min is NOT optimal for hardware processor caches.
Belady’s min: when deciding which data to cache, look into the future and decide if it has to be cached or not. If that data is not going to be used for a long time, DO NOT cache it. Instead, transfer it directly to the processor. 
Oracle found that “looking into the future” is not a better way. A better replacement policy is "least recently used" (LRU) which replaces the least recently used entry.

Inclusion:
I’ve got the cache; I’ve got the backing store. I make copy of the block from the backing store to the cache, and a copy of it still remains in the backing store. This is called inclusion.
Inclusive caches:
If block is copied from L2 cache to L1 cache, its copy also remains in L2.
Exclusive cache:
Block is moved from L2 to L1, deleting the original copy in L2.
Non-inclusive caches:
Data might be in both caches.

Cache problems:
1. Content management heuristics: what to cache?
2. Logical organization: where to put it?
3. Consistency management heuristics: how to maintain it? – keep a cache consistent with itself, other caches and backing storage.
4. Due to virtual memory, two different virtual addresses could refer to the same memory location. All updates to that memory location should happen in program order. The processor should ensure that only one copy of the physical memory is present in the cache in any given time.
Solution to point 4 - Split caches:
Use separate instruction and data cache. Twice the bandwidth. 2 single port caches are easier to implement compared to 1 two port cache. Instruction cache can be made read only. Instruction cache can be used during fetch stage and data cache could be used during memory stage.

Write policies:
1. Write through: every write in the cache goes “through” to memory => write becomes very slow
Solution: use write buffers (which are logical part of memory, physical part of cache). Data to be written to the memory are dumped into the write buffers, and once in a specified cycle, the buffer dumps it back to the main memory.
Two issues with write buffer: what if the buffer is full? – stall and wait until not full
What if I read (or write) and get a cache miss? – 
If write buffer, drain it.
If write cache, it has a tag and we can read data from it => search it and find the data.
WRITE BUFFER CONVERTS LATENCY PROBLEM TO BANDWIDTH PROBLEM – we can’t make memory faster, as is it costly and has latency problems. Instead we use write buffers and improve average latencies.
2. Write back: when a data in cache is modified, do not write it back immediately to main memory. This is called “lazy write”. => no problem even if main memory is slow. Reduces unnecessary writes to main memory.

Problem: data in cache and main memory are not consistent (inconsistency) – how to solve it?
Dirty bit – it is just a Boolean variable, which indicates 1 if the data in the cache is modified
When does the data get written into the main memory? – when a read miss occurs and some block has to be evicted to give space for the new incoming block. But now, every read miss is potentially going to take more time than before – first it has to write the evicted block into main memory, then it has to bring the new block into the cache.
Even if there is a write miss, the processor has to check if the data in the cache below is dirty. If it is dirty, then the data has to be written to main memory (or the cache even below this) and then a fresh copy has to be placed. Also, the dirty bit of this fresh copy has to be marked 1.
In general: write back – better performance. Write through – simpler in terms of programming model.

Cache miss types: (called the 4 c’s)
A cache miss refers to a failed attempt to read or write a piece of data in the cache, which results in a main memory access with much longer latency (the delay between the miss and actual availability of data). There are three kinds of cache misses: instruction read miss, data read miss, and data write miss.
Instruction read misses are going to take the most amount of time. This is because the processor has to wait for the instruction to be available for proceeding.
Data read misses take the next least time, as till that data is available, the processor can execute instructions which are independent of the requested data.
Data write misses are going to take the least amount of time. This is because, these writes can be queued into the write buffer and the processor needs to wait only when the buffer is full.

Misses can be classified into 4 types:
Compulsory miss: (also called cold start misses) occurs on initial startup, when no values are available in the cache. Can be reduced by prefetching data and increasing block size.
Capacity miss: occurs because of small cache size. Increase the cache size to solve this.
Conflict miss: occurs even if the cache has big size but lacks associativity. Increase associativity to solve this.
Coherence: due to coherence engine.

Addressing schemes:
There are 4 schemes:
1. Virtually indexed virtually tagged: both the index and tag are obtained from the virtual address.
Advantage: very fast lookup time
Disadvantage: susceptible to virtual address aliasing
2. Physically indexed physically tagged: both the index and tag are obtained from physical address. Used in L2 caches and beyond (as speed of L1 cache is so fast as only VIPT is suitable).
Advantage: no aliasing problems
Disadvantage: longer latency

3. Virtually indexed physically tagged: here index used is from the virtual address. By the time the cache is searched for and the matching cache line is found, the address translation is also done. Now, the tag stored in the cache line is matched with the physical tag and then it is decided if it is a cache hit or a miss.
Advantage over PIPT: lower latency, as virtual index is used to search the cache simultaneously when the address translation is taking place. However, we have to wait till physical tag is obtained after translation to decide if it is a hit or no.
Advantage over VIVT: there is no aliasing issues and coherence is good.
4. Physically indexed virtually tagged: this is useless, as virtual tagging is not required only we have the physical address in hand.

Cache coherence and consistency:
Memory coherence: memory behaves “rationally”
Memory consistency: programming contract of hardware. Allows programmer to define behavior
Cache coherence: hardware implementation. Guarantees memory coherence
Cache consistency: the principle – cache systems behave “rationally”
Contents of memory simply doesn’t change until you overwrite or erase them. When clients in a system maintain caches of a common memory resource, problems may arise with inconsistent data. This is particularly true of CPUs in a multiprocessing system. Referring to the illustration below, if the top client has a copy of a memory block from a previous read and the bottom client changes that memory block, the top client could be left with an invalid cache of memory without any notification of the change. Cache coherence is intended to manage such conflicts and maintain consistency between cache and memory.


Timing or ordering of writes means “when the rest of the system sees the results?”.

Consistency models:
The data consistency model specifies a contract between programmer and system, wherein the system guarantees that if the programmer follows the rules, memory will be consistent and the results of memory operations will be predictable.

1. Strict consistency: the value read by a processor is always the most recently written value. Makes perfect sense for a uniprocessor, but creates a problem in multiprocessors.
When we have two processors A and B. If A writes a value and B reads same value, the soonest it can be made available is determined by the latency. (light cone originating at A touches B at a point, which is the soonest data availability timeline for B).
Thus, strict consistency is never used for multiprocessors.

2. Sequential consistency: this is what everybody does.
a. Memory operations are performed in program order. =>if you write load before store, load actually happens before store.
b. All CPU’s see the same global ordering.
c. The global ordering (interleaving) can be anything.
Implication: if machine A sees a particular order, then everyone in the system see the same order.
d. All I/O must go through the memory system (the coherence system)
e. You can’t buffer writes.
f. If you have write A, read B, read C, read D, then you can’t buffer write A and start performing the reads. This is because, when you do that, the world sees BCDA. But the actual global order is supposed to be ABCD.
g. Reordering operations screws things up.

3. Processor consistency: also called total store order.
a. The only ordering is of writes
b. Reads can occur in any order

Cache coherence:

SI: Shared/Invalid:
only works with write through caches. This is because, in write through caches, the backing store always has the data which is consistent with all the other caches and can be evicted safely.
Data block can be in one of the two stages: shared (cannot write) or invalid (not present).
On read: get a shared copy from memory or any other cache
On write:
1. Write update: everytime you write to memory, you also update all the other caches with the same value.
2. Write invalidate: everytime you write to memory, you ask all other caches having this block to evict their copies and get a fresh copy from backing store.

MSI: Modified Shared Invalid: supports write back caches as well.
Each block within any cache can be in one of the three states:
Modified: the block in the cache has be written and is inconsistent with backing storage and other caches.
Shared: the block in this cache is a read only copy and can be evicted safely.
Invalid: the block is not available in the cache and should be present in the backing store or any other cache.
These coherency states are maintained through communication between the caches and the backing store. The caches have different responsibilities when blocks are read or written, or when they learn of other caches issuing reads or writes for a block.
When a read request arrives at a cache for a block in the "M" or "S" states, the cache supplies the data. If the block is not in the cache (in the "I" state), it must verify that the line is not in the "M" state in any other cache. Different caching architectures handle this differently. For example, bus architectures often perform snooping, where the read request is broadcast to all of the caches. Other architectures include cache directories which have agents (directories) that know which caches last had copies of a particular cache block. If another cache has the block in the "M" state, it must write back the data to the backing store and go to the "S" or "I" states. Once any "M" line is written back, the cache obtains the block from either the backing store, or another cache with the data in the "S" state. The cache can then supply the data to the requestor. After supplying the data, the cache block is in the "S" state.
When a write request arrives at a cache for a block in the "M" state, the cache modifies the data locally. If the block is in the "S" state, the cache must notify any other caches that might contain the block in the "S" state that they must evict the block. This notification may be via bus snooping or a directory, as described above. Then the data may be locally modified. If the block is in the "I" state, the cache must notify any other caches that might contain the block in the "S" or "M" states that they must evict the block. If the block is in another cache in the "M" state, that cache must either write the data to the backing store or supply it to the requesting cache. If at this point the cache does not yet have the block locally, the block is read from the backing store before being modified in the cache. After the data is modified, the cache block is in the "M" state.

MESI: Modified Exclusive Shared Invalid:
Every cache line is marked with one of the four following states (coded in two additional bits):
Modified
The cache line is present only in the current cache, and is dirty; it has been modified from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (no longer valid) main memory state. The write-back changes the line to the Exclusive state.
Exclusive
The cache line is present only in the current cache(in the whole system, it is present only in the current cache), but is clean; it matches main memory. It may be changed to the Shared state at any time, in response to a read request. Alternatively, it may be changed to the Modified state when writing to it.
Shared
Indicates that this cache line may be stored in other caches of the machine and is "clean"; it matches the main memory. The line may be discarded (changed to the Invalid state) at any time.
Invalid
Indicates that this cache line is invalid (unused).


DISCLAIMER: The above article is based on my notes from one of my graduate courses and major portion of it is from wikipedia.

Cheers,
Vijay.