ECE366 Fall 2013 Computer Organization II University of Illinois at Chicago ECE366 Lab 4: Cache Simulation For this lab assignment, you will write a configurable cache simulator (in C, Java, or whatever programming language you prefer). Your cache simulator will read an address trace (a chronological list of memory addresses referenced), simulate the cache, generate cache hit and miss data, and calculate the execution time for the executing program. The address trace has been generated by a simulator executing some real programs. Your cache simulator is not the end product of this lab, but a tool you will use to complete it. In this lab, you will experiment with various cache configurations and make conclusions about the optimal cache organization for this set of programs. The simulator: The simulator takes the trace file as its input and generates a report as the output. Each trace contains the memory accesses of millions of instructions. Your simulations should be able to process all of them. The report should include: A. Number of accesses, number of misses and miss rates for all accesses (load and stores) B. Number of accesses, number of misses and miss rates for loads only C. Instruction count and execution time for the program, in cycles and seconds D. Total memory access time and average memory access time (cycles per access). E. Average CPI To calculate execution time, assume that: 1- Each instruction takes one cycle to execute. 2- A load takes one cycle plus the memory access time. Your program should support, at least, the following 5 tunable parameters: 1- Cache size 2- Cache associativity 3- Cache-line size 4- Miss-penalty 5- Cycle time 6- Write policy : either write-allocate policy or write-around policy If you are doing the extra credit parts, you will also need the following parameters: 1. Replacement policy: either LRU or Randomize The cache specifications: The baseline cache configuration will be 16-byte line size, direct-mapped, 16 KB cache size, write-through and write-allocate. Assume a default clock rate of 1 GHz. Memory access time for a load hit is 0 cycles (the same cycle as load being executed) and for a load miss is 40 cycles (unless specified otherwise). The default replacement policy (for set associative caches) is LRU. ECE366 Fall 2013 Computer Organization II University of Illinois at Chicago The address trace: An address trace is simply a list of addresses produced by a program running on a processor. These are the addresses resulting from load and store instructions in the code as it is executed. Some address traces would include both instruction fetch addresses and data (load and store) addresses, but you will be simulating only a data cache, so these traces only have data addresses. These traces were generated by a simulator of a RISC processor running five programs, gcc, gzip, mcf, swim, and twolf from the SPEC 2000 benchmarks. All lines of the address trace are of the format: l/s ADDRESS IC where l for a load and s for a store, ADDRESS is an 8-character hexadecimal number, and IC is the number of instructions executed between the previous memory access and this one (not including the load or store instruction itself). There is a single space between each field. The instruction count information will be used to calculate execution time (or at least cycle count). A sample address trace starts out like this: l 7fffed80 6 l 10010000 0 l 10010003 2 s 10010030 3 l 1001000E 5 l 10010064 3 s 10010034 3 You should assume no accesses address multiple cache lines (e.g., assume all accesses are for 32 bits or less). In the trace shown, the first 24 instructions should take 144 cycles, assuming 3 cache misses and 2 cache hits for the 5 accesses, and a 40-cycle miss penalty (assuming baseline configuration). Cache configuration experiment: You will re-evaluate the default parameters one at a time, in the following order. In each configuration, choose a best value for each parameter, then use that for all subsequent analyses. Because the workload is five programs (trace files), you will run five simulations for each configuration you evaluate, and then combine the results, using average execution time as your evaluation metric. A. Look at cache line (block) sizes of 16, 32, and 64 bytes. Assume that it takes two extra cycles to load 32 bytes into the cache, and 6 extra cycles to load 64 bytes. (i.e., raise the miss penalty accordingly). Choose the best cache line size (in terms of miss penalty) and proceed. B. Look at 16 KB, 32 KB, and 128 KB cache sizes. Larger caches take longer to access, so assume that the 32 KB cache requires a 5% longer cycle time, and the 128 KB cache 15% longer. Choose the best configuration and proceed to the next step. C. Look at cache associativity of direct-mapped, 2-way set-associative, and 8-way set-associative. Assume that 2-way associative adds 5% to the cycle time, and 8-way adds 10%. Choose the best configuration and proceed. D. Compare the write allocate policy with a write-around policy (see notes 1,2 and 3). E. (extra credit). Compare the LRU replacement policy with randomize policy. Lab 4A demo: In the first week of the lab, you will demo the first two sets of configurations and find the optimal cache line size and cache size. In both cases, the cache is a direct-mapped cache. ECE366 Fall 2013 Computer Organization II University of Illinois at Chicago Lab 4B demo: In the second week of this lab, you will demo the effect of cache associativity on execution time. Lab 4C demo: In the last week of this lab, you will demo the effect of write policy on execution time. If you also implement the replacement policy, you will show your work in this demo. Submit a lab report with the following: (a) Show your results and design decisions for each of the four configuration parameters above. Provide data / results in meaningful charts or figures. (b) Is cache miss rate a good indicator of performance? In what cases did the option with the lowest miss rate not have the lowest execution time? Why? (c) Were results uniform across the five programs? In what cases did different programs give different conclusions? Speculate as to why that may have been true. (d) What was the speedup of your final design over the default? (e) Summarize the data structure and algorithm you used in the simulator. (f) Turn in your source code file with sample output. Note & Hints: 1. Details about write-allocate cache: A write-allocate cache, on a write miss, will make a spot for the line in the cache, load it from memory, and it becomes available to read in "miss penalty" cycles later. If the processor accesses it (with a load) before that, it stalls until the data is ready. So for a write- allocate cache, you can get a store miss at time T, followed by a load hit at time T+7. However, the load still must stall until the data becomes available at time T+ miss_penalty (for bookkeeping, it is still a load hit, but for timing, you must record the stall), so the processor will see a miss penalty on this cache hit of (miss_penalty – 7) cycles. Thus, you never stall for a store, but you may stall on a later load to the same cache line (that is, we’re assuming a large store buffer). You need not stall (more than the original miss penalty) on a miss that evicts a line in transit (still waiting for the store to complete) – we’ll assume that the first transfer can complete, and the original store complete, while the new data is being brought from memory. 2. A store instruction can hit or miss in the cache (and thus change the contents of the cache for a write- allocate cache), but does not incur a miss penalty, since no data is returned to the CPU. 3. In write-around policy, the processor still does not stall for stores, and a store miss does not change the contents of the cache. 4. Think about how to intelligently debug and test your program. Running immediately on the entire input gives you little insight on whether it is working (unless it is way off). 5. Speed matters. These simulations should take a couple minutes (actually, less) on an unloaded machine. If it is taking much more than that, do yourself a favor and think about what you are doing inefficiently. Cache Simulation Kai Zhao ECE 366 Lab 04 2013 Dec 1 00.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035 0.004 LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d write- allocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 16B 32B 64B 16B 32B 64B 16B 32B 64B 16KB 32KB 128KB Ti m e (s ) Runtime Average of All 5 Trace Files associativity write policy Replacement policy W rit e al lo ca te W rit e- ar ou nd cache size block size 2.40E+06 2.45E+06 2.50E+06 2.55E+06 2.60E+06 2.65E+06 2.70E+06 2.75E+06 2.80E+06 2.85E+06 2.90E+06 LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d write- allocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 16B 32B 64B 16B 32B 64B 16B 32B 64B 16KB 32KB 128KB Cy cl e Co un t Cycle Count Average of All 5 Trace Files cache size block size Replacement policy write policy associativity 16 KB is Fastest Cache Size 0.00E+00 5.00E-04 1.00E-03 1.50E-03 2.00E-03 2.50E-03 3.00E-03 3.50E-03 4.00E-03 16 KB 32 KB 64 KB Ru nt im e (s ) Cache Size Runtime vs Cache Size Results are the average of all configurations with the cache size 16 Byte is Fastest Block Size Results are the average of all configurations with the block size 0.00E+00 5.00E-04 1.00E-03 1.50E-03 2.00E-03 2.50E-03 3.00E-03 3.50E-03 4.00E-03 16 B 32 B 64 B Ru nt im e (s ) Block Size Runtime vs Block Size Direct-mapped (1-way) cache is fastest Results are the average of all configurations with the associativity 0.00E+00 5.00E-04 1.00E-03 1.50E-03 2.00E-03 2.50E-03 3.00E-03 3.50E-03 4.00E-03 1-way 2-way 8-way Ru nt im e (s ) Associativity Runtime vs Associativity Write Allocate is Slightly Faster Results are the average of all configurations with the write policy 0.00E+00 5.00E-04 1.00E-03 1.50E-03 2.00E-03 2.50E-03 3.00E-03 3.50E-03 4.00E-03 write allocate write-around Ru nt im e (s ) Write Policy Runtime vs Write Policy Assumption: data can be read from the write buffer. For write-around with loads after a store, instead of two memory access penalties (one to write to memory, another to read from memory), there will only be one memory access penalty to load the remaining of the block. Assumption summary: no penalty and no FIFO for writing to memory LRU is Infinitesimal Faster Results are the average of all configurations with the replacement policy 0.00E+00 5.00E-04 1.00E-03 1.50E-03 2.00E-03 2.50E-03 3.00E-03 3.50E-03 4.00E-03 LRU Randomized Ru nt im e (s ) Replacement Policy Runtime vs Replacement Policy Part A Show your results and design decisions for each of the four configuration parameters above. • In order of importance – 16 KB is the fastest cache size – Direct-mapped cache is the fastest associativity – 16 Byte is the fastest block size – Write allocate is faster than write-around – LRU is faster than randomized replacement policy 0.00E+00 5.00E-02 1.00E-01 1.50E-01 2.00E-01 2.50E-01 3.00E-01 3.50E-01 LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d write- allocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 16B 32B 64B 16B 32B 64B 16B 32B 64B 16KB 32KB 128KB M is s Ra te Miss Rate Average of All 5 Trace Files Part B Is cache miss rate a good indicator of performance? In what cases did the option with the lowest miss rate not have the lowest execution time? Why? • Cache miss rate is not a good indicator of performance – Although 64 Byte block sizes have the lowest miss rate, they do have the longest execution time. – The extra miss penalty cycles is too expensive relatively to the lower miss rates 00.0005 0.001 0.0015 0.002 0.0025 LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d LR U Ra nd om ize d write- allocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around write- llocate write- around 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 16B 32B 64B 16B 32B 64B 16B 32B 64B 16KB 32KB 128KB Ti m e (s ) Runtime of gcc.trace Part C Were results uniform across the five programs? In what cases did different programs give different conclusions? • Most programs agree that the default configuration is fastest – 16 KB cache size, 16 B block size, 1-way associativity, write allocation, and LRU • For gcc.trace, there is another configuration that is 1.65% faster than the default configuration – Due to much lower miss rates compared to the default Part D What was the speedup of your final design over the default? • 0% because my final design is the default Part E Data Structures Summarize the data structure and algorithm you used in the simulator. • Configurations: traceFile, cacheSize, blockSize, associativity, writePolicy, and replacementPolicy • 2D array of integers: the first dimension is number if indices, the second dimension is number of ways – int[][] cache; // used to store the tag of every block – int[][] validBit; // initialized as 0’s to keep determine if the data in cache is valid – int[][] timeWhenReady; // used for write allocate to determine when the cache will be ready when there is a load hit after a store miss – int[][] lastUsedTime; // updated when reference to implement LRU policy • Counters for cache simulation output: clockCycleNumber, instuctionCount, numberOfLoadHits, numberOfLoadMisses, numberOfStoreHits, numberOfStoreMisses Part E Algorithms Summarize the data structure and algorithm you used in the simulator. • Search for cache hit: – Find the index and transverse each way checking if the validBit > 0 and tag matches • Stall if load after store miss: – if clockCycleNumber + numberOfInstructionsInbetween < timeWhenReady clockCycleNumber = timeWhenReady else clockCycleNumber += numberOfInstructionsInbetween + 1 • Write allocate: – Update cache and time timeWhenReady on store misses • LRU: – Loop through each way to check if any validBit == 0. If not, then find the cache way with the lowest lastUsedTime to replace