Cache is a component that
transparently stores data so that future requests for that data can be served
faster. The data that is stored within a cache might be values that have been
computed earlier or duplicates of original values that are stored elsewhere. If
requested data is contained in the cache (cache hit), this request can be
served by simply reading the cache, which is comparatively faster. Otherwise
(cache miss), the data has to be recomputed or fetched from its original
storage location, which is comparatively slower. Hence, the greater the number
of requests that can be served from the cache, the faster the overall system
performance becomes. To be cost efficient and to enable an efficient use of
data, caches are relatively small. Nevertheless, caches have proven themselves
in many areas of computing because access patterns in typical computer
applications have locality of reference. References exhibit temporal locality
if data is requested again that has been recently requested already. References
exhibit spatial locality if data is requested that is physically stored close
to data that has been requested already.
Principles of locality:
A phenomenon describing the
same value, or related storage locations, being frequently accessed. There are
two basic types of reference locality – temporal and spatial locality. Temporal
locality refers to the reuse of specific data, and/or resources, within a
relatively small time duration. Spatial locality refers to the use of data
elements within relatively close storage locations.
Temporal locality
If at one point in time a particular
memory location is referenced, then it is likely that the same location will be
referenced again in the near future.
Spatial locality
If a particular memory location is
referenced at a particular time, then it is likely that nearby memory locations
will be referenced in the near future.
Reasons for locality:
• Predictability
• Structure of program – we are
used to writing code in such a fashion
• Linearity of data structures
– eg.arrays
• Programmers are responsible
for moving data between disk and memory through file I/O.
• Hardware is responsible for
moving data between memory and caches.
• Optimizing compilers are
responsible for generating code that, when executed, will cause the hardware to
use caches and registers efficiently.
Memory Hierarchy:
A "memory hierarchy" in
computer storage distinguishes each level in the "hierarchy" by
response time. Since response time, complexity, and capacity are related, the
levels may also be distinguished by the controlling technology.
Operation:
Hardware implements cache as a block
of memory for temporary storage of data likely to be used again. A cache is
made up of a pool of entries. Each entry has a datum (piece of data) – a copy
of the same datum in some backing store. Each entry also has a tag, which
specifies the identity of the datum in the backing store of which the entry is
a copy.
When the cache client (a CPU, web browser, operating system) needs to access a
datum presumed to exist in the backing store, it first checks the cache. If an
entry can be found with a tag matching that of the desired datum, the datum in
the entry is used instead. This situation is known as a cache hit. The
alternative situation, when the cache is consulted and found not to contain a
datum with the desired tag, has become known as a cache miss. The previously
uncached datum fetched from the backing store during miss handling is usually
copied into the cache, ready for the next access. The percentage of
accesses that result in cache hits is known as the hit rate or hit ratio of the
cache.
Cache metrics:
Hit rate = 1 – miss rate
Hit rate = #references that hit /
#references
Miss rate = #references that
missed/#references
Average access time = (hit rate *
hit time) + (miss rate * miss time)
Average miss penalty = average
access for next level down
eg. Avg miss penalty of L1 = avg
access time of L2
Structure of cache:
The structure of a cache line is
shown below.
An instruction cache requires only one flag bit per cache row entry: a valid
bit. The valid bit indicates whether or not a cache block has been loaded with
valid data. A data cache typically requires two flag bits per cache line – a
valid bit and a dirty bit. Having a dirty bit set indicates that the associated
cache line has been changed since it was read from main memory
("dirty"), meaning that the processor has written data to that line
and the new value has not propagated all the way to main memory.
The data block is the actual
block fetched from the backing storage. The tag is the address (part of the
full address of the data in the backing storage). Size of a cache is the amount
of main memory data it can hold which is
Number of bytes in each data block *
no of data blocks per cache level
The memory address is split into
tag, index and byte, as shown below.
For 8 cache lines, the last digit of address specifies the byte within a block
(0 to F) – 16 bytes
The last second and third bits
specify index addresses from 00 to FF – 256 index values.
The remaining MSB values represent
tag, which is matched with the physical address for read or write operations.
Design dimensions:
1. Block size: this is read/write
granularity to the next level down.
Increasing the block size:
a. Exploits spatial locality by
prefetching likely to use data
b. It could also hurt, as cache
fills take longer, and also reduces amount of storage for other blocks
2. Transfer size: how much data is brought in per block, as data brought in
might not be full block size
Associativity:
The associativity decides where you
can place a block fetched from backing storage.
1. Fully associative: Block can be
placed anywhere.
Disadvantage:
a. Have to check every single block
for a match (using the tag)
b. Consumes power and takes time
Advantage: better hit rate
2. Direct mapped: block can go to
only one place, which is determined by block ID
Disadvantage:
a. Conflicts: addresses ending in 0,
8, 16, 24, 32,… are placed in same location having ID 0; thus higher miss rate
Advantage: faster access time (than
fully associative), smaller tag size (and hence less space and less power per
access)
3. Set associative: instead of
forcing data to go to only one location, we allow hardware to choose to put
data in one of n available locations based on index (ID), it is called n way
set associative.
4. Multiple direct mapped caches: we can have multiple direct mapped caches,
and choose data from a particular cache based on tag (address). Hardware
decides which direct mapped cache should be used to put the current block.
Advantage: more flexibility and
better hit rate than direct mapped
Disadvantage: longer access time and
more power dissipation than direct mapped
IT IS IN BETWEEN Direct mapped and
fully associative
Size of cache = block size * #of
associativity * #of sets
Bigger the better = for better hit
rates
Bigger the worse = for power
consumption and access times
Replacement policies: (when bringing a block in, what to throw out?)
During a cache miss, the CPU usually
ejects some other entry in order to make room for the previously uncached
datum. The heuristic used to select the entry to eject is known as the
replacement policy.
For several years, Belady’s min was
assumed to be the gold standard. But, it was found out that Belady’s min is NOT
optimal for hardware processor caches.
Belady’s min: when deciding which
data to cache, look into the future and decide if it has to be cached or not.
If that data is not going to be used for a long time, DO NOT cache it. Instead,
transfer it directly to the processor.
Oracle found that “looking into the
future” is not a better way. A better replacement policy is "least
recently used" (LRU) which replaces the least recently used entry.
Inclusion:
I’ve got the cache; I’ve got the
backing store. I make copy of the block from the backing store to the cache,
and a copy of it still remains in the backing store. This is called inclusion.
Inclusive caches:
If block is copied from L2 cache to
L1 cache, its copy also remains in L2.
Exclusive cache:
Block is moved from L2 to L1,
deleting the original copy in L2.
Non-inclusive caches:
Data might be in both caches.
Cache problems:
1. Content management heuristics:
what to cache?
2. Logical organization: where to
put it?
3. Consistency management
heuristics: how to maintain it? – keep a cache consistent with itself, other
caches and backing storage.
4. Due to virtual memory, two
different virtual addresses could refer to the same memory location. All
updates to that memory location should happen in program order. The processor
should ensure that only one copy of the physical memory is present in the cache
in any given time.
Solution to point 4 - Split caches:
Use separate instruction and data
cache. Twice the bandwidth. 2 single port caches are easier to implement
compared to 1 two port cache. Instruction cache can be made read only.
Instruction cache can be used during fetch stage and data cache could be used
during memory stage.
Write policies:
1. Write through: every write
in the cache goes “through” to memory => write becomes very slow
Solution: use write buffers (which
are logical part of memory, physical part of cache). Data to be written to the
memory are dumped into the write buffers, and once in a specified cycle, the
buffer dumps it back to the main memory.
Two issues with write buffer: what
if the buffer is full? – stall and wait until not full
What if I read (or write) and get a
cache miss? –
If write buffer, drain it.
If write cache, it has a tag and we
can read data from it => search it and find the data.
WRITE BUFFER CONVERTS LATENCY
PROBLEM TO BANDWIDTH PROBLEM – we can’t make memory faster, as is it costly and
has latency problems. Instead we use write buffers and improve average
latencies.
2. Write back: when a data in
cache is modified, do not write it back immediately to main memory. This is
called “lazy write”. => no problem even if main memory is slow. Reduces
unnecessary writes to main memory.
Problem: data in cache and main memory are not consistent (inconsistency) – how
to solve it?
Dirty bit – it is just a Boolean
variable, which indicates 1 if the data in the cache is modified
When does the data get written into
the main memory? – when a read miss occurs and some block has to be evicted to
give space for the new incoming block. But now, every read miss is potentially
going to take more time than before – first it has to write the evicted block
into main memory, then it has to bring the new block into the cache.
Even if there is a write miss, the
processor has to check if the data in the cache below is dirty. If it is dirty,
then the data has to be written to main memory (or the cache even below this)
and then a fresh copy has to be placed. Also, the dirty bit of this fresh copy
has to be marked 1.
In general: write back – better performance. Write through – simpler in terms
of programming model.
Cache miss types: (called the 4 c’s)
A cache miss refers to a failed
attempt to read or write a piece of data in the cache, which results in a main
memory access with much longer latency (the delay between the miss and actual
availability of data). There are three kinds of cache misses: instruction read
miss, data read miss, and data write miss.
Instruction read misses are going to
take the most amount of time. This is because the processor has to wait for the
instruction to be available for proceeding.
Data read misses take the next least
time, as till that data is available, the processor can execute instructions
which are independent of the requested data.
Data write misses are going to take
the least amount of time. This is because, these writes can be queued into the
write buffer and the processor needs to wait only when the buffer is full.
Misses can be classified into 4 types:
Compulsory miss:
(also called cold start misses) occurs on initial startup, when no values are
available in the cache. Can be reduced by prefetching data and increasing block
size.
Capacity miss:
occurs because of small cache size. Increase the cache size to solve this.
Conflict miss:
occurs even if the cache has big size but lacks associativity. Increase
associativity to solve this.
Coherence:
due to coherence engine.
Addressing schemes:
There are 4 schemes:
1. Virtually indexed virtually tagged:
both the index and tag are obtained from the virtual address.
Advantage: very fast lookup time
Disadvantage: susceptible to virtual
address aliasing
2. Physically indexed physically
tagged: both the index and tag are obtained from physical address. Used in
L2 caches and beyond (as speed of L1 cache is so fast as only VIPT is
suitable).
Advantage: no aliasing problems
Disadvantage: longer latency
3. Virtually indexed physically tagged: here index used is from the
virtual address. By the time the cache is searched for and the matching cache
line is found, the address translation is also done. Now, the tag stored in the
cache line is matched with the physical tag and then it is decided if it is a
cache hit or a miss.
Advantage over PIPT: lower latency,
as virtual index is used to search the cache simultaneously when the address
translation is taking place. However, we have to wait till physical tag is
obtained after translation to decide if it is a hit or no.
Advantage over VIVT: there is no
aliasing issues and coherence is good.
4. Physically indexed virtually
tagged: this is useless, as virtual tagging is not required only we have
the physical address in hand.
Cache coherence and consistency:
Memory coherence: memory behaves
“rationally”
Memory consistency: programming
contract of hardware. Allows programmer to define behavior
Cache coherence: hardware
implementation. Guarantees memory coherence
Cache consistency: the principle –
cache systems behave “rationally”
Contents of memory simply doesn’t
change until you overwrite or erase them. When clients in a system maintain
caches of a common memory resource, problems may arise with inconsistent data.
This is particularly true of CPUs in a multiprocessing system. Referring to the
illustration below, if the top client has a copy of a memory block from a
previous read and the bottom client changes that memory block, the top client
could be left with an invalid cache of memory without any notification of the
change. Cache coherence is intended to manage such conflicts and maintain
consistency between cache and memory.
Timing or ordering of writes means “when the rest of the system sees the
results?”.
Consistency models:
The data consistency model specifies
a contract between programmer and system, wherein the system guarantees that if
the programmer follows the rules, memory will be consistent and the results of
memory operations will be predictable.
1. Strict consistency: the value read by a processor is always the most
recently written value. Makes perfect sense for a uniprocessor, but creates a
problem in multiprocessors.
When we have two processors A and B.
If A writes a value and B reads same value, the soonest it can be made
available is determined by the latency. (light cone originating at A touches B
at a point, which is the soonest data availability timeline for B).
Thus, strict consistency is never
used for multiprocessors.
2. Sequential consistency: this is what everybody does.
a. Memory operations are performed
in program order. =>if you write load before store, load actually
happens before store.
b. All CPU’s see the same global
ordering.
c. The global ordering
(interleaving) can be anything.
Implication: if machine A sees a
particular order, then everyone in the system see the same order.
d. All I/O must go through the
memory system (the coherence system)
e. You can’t buffer writes.
f. If you have write A, read B, read
C, read D, then you can’t buffer write A and start performing the reads. This
is because, when you do that, the world sees BCDA. But the actual global order
is supposed to be ABCD.
g. Reordering operations screws
things up.
3. Processor consistency: also called total store order.
a. The only ordering is of writes
b. Reads can occur in any order
Cache coherence:
SI: Shared/Invalid: only works with write through
caches. This is because, in write through caches, the backing store always has
the data which is consistent with all the other caches and can be evicted
safely.
Data block can be in one of the two
stages: shared (cannot write) or invalid (not present).
On read: get a shared copy from
memory or any other cache
On write:
1. Write update: everytime you write
to memory, you also update all the other caches with the same value.
2. Write invalidate: everytime you
write to memory, you ask all other caches having this block to evict their
copies and get a fresh copy from backing store.
MSI: Modified Shared Invalid: supports write back caches as well.
Each block within any cache can be
in one of the three states:
Modified:
the block in the cache has be written and is inconsistent with backing storage
and other caches.
Shared:
the block in this cache is a read only copy and can be evicted safely.
Invalid:
the block is not available in the cache and should be present in the backing
store or any other cache.
These coherency states are
maintained through communication between the caches and the backing store. The
caches have different responsibilities when blocks are read or written, or when
they learn of other caches issuing reads or writes for a block.
When a read request arrives at a
cache for a block in the "M" or "S" states, the cache
supplies the data. If the block is not in the cache (in the "I"
state), it must verify that the line is not in the "M" state in any
other cache. Different caching architectures handle this differently. For
example, bus architectures often perform snooping, where the read request is
broadcast to all of the caches. Other architectures include cache directories
which have agents (directories) that know which caches last had copies of a
particular cache block. If another cache has the block in the "M"
state, it must write back the data to the backing store and go to the
"S" or "I" states. Once any "M" line is written
back, the cache obtains the block from either the backing store, or another cache
with the data in the "S" state. The cache can then supply the data to
the requestor. After supplying the data, the cache block is in the
"S" state.
When a write request arrives at a
cache for a block in the "M" state, the cache modifies the data
locally. If the block is in the "S" state, the cache must notify any
other caches that might contain the block in the "S" state that they
must evict the block. This notification may be via bus snooping or a directory,
as described above. Then the data may be locally modified. If the block is in
the "I" state, the cache must notify any other caches that might
contain the block in the "S" or "M" states that they must
evict the block. If the block is in another cache in the "M" state,
that cache must either write the data to the backing store or supply it to the
requesting cache. If at this point the cache does not yet have the block
locally, the block is read from the backing store before being modified in the
cache. After the data is modified, the cache block is in the "M"
state.
MESI: Modified Exclusive Shared Invalid:
Every cache line is marked with one
of the four following states (coded in two additional bits):
Modified
The cache line is present only in
the current cache, and is dirty; it has been modified from the value in main
memory. The cache is required to write the data back to main memory at some
time in the future, before permitting any other read of the (no longer valid)
main memory state. The write-back changes the line to the Exclusive state.
Exclusive
The cache line is present only in
the current cache(in the whole system, it is present only in the current
cache), but is clean; it matches main memory. It may be changed to the Shared
state at any time, in response to a read request. Alternatively, it may be
changed to the Modified state when writing to it.
Shared
Indicates that this cache line may
be stored in other caches of the machine and is "clean"; it matches
the main memory. The line may be discarded (changed to the Invalid state) at
any time.
Invalid
Indicates that this cache line is
invalid (unused).
DISCLAIMER: The above article is based on my notes from one of my
graduate courses and major portion of it is from wikipedia.
Cheers,
Vijay.