Spring 2004 CS 152 Final Project 8-Stage Deep-Pipelined MIPS Processor Members: Otto Chiu (cs152-ae) Charles Choi (cs152-bm) Teddy Lee (cs152-ac) Man-Kit Leung (cs152-al) Bruce Wang (cs152-am) Table Of Contents 0. Abstract 1. Division of Labor 2. Detailed Strategy 2.1 Detailed Strategy 2.2 Stage Summary 2.3 Stage Details 2.4 Forwarding Paths 3. Testing 3.1 General testing. 3.2 Victim Cache test. 3.3 Branch Predictor. 3.4 Signed/Unsigned Multiply/Divide. 4. Results 5. Conclusion 6. Appendix I (Notebooks) 7. Appendix II (Schematics) 8. Appendix III (Verilog Files) 9. Appendiex IV (Testing) 10. Appendiex V (Timing) 0. Abstract The goal of our final project was to improve the performance of the 5-stage pipelined processor from previous labs. Aiming at this goal, we converted our processor into a 8-stage deep-pipelined one (22 pt.). Since an increase in the number of branch delay slots is an intrinsic drawback to adding more pipeline stages, we decided to add a branch predictor to cut down the number of stalled cycles in most cases (8 pt.). This problem also appears during a jr instruction. Thus, we installed a jump- target predictor (8 pt.) to abate the stalls associated with the instruction. In addition, we implemented a write-back cache (7 pt.) and added a victim-cache (4 pt.) to minimize first- level cache miss penalties. Finally, we added to our multiplier/divider the ability to handle signed numbers (4 pt.). Our project implemented a total of 53 pt. out of the required 37.5 pt. for our group. We implemented our design successfully and thoroughly. We noted significant improvement in performance. The final clock speed for our processor is 27 MHz. 1. Division of Labor The project specifications allowed us to split up the work by the major components: branch predictor, write-back cache, victim cache, jump-target predictor, and signed multiplier/divider. The entire group was involved in changing and verifying the existing design. Here is a more detailed division of labor: Otto: Branch predictor, signed mult/div Charles: Branch predictor, memory controller Teddy: Datapath, mult/div, testbenches Man-Kit: Write-back cache, victim cache Bruce: Write-back cache, jump-target predictor 2. Detailed Strategy 2.1 Datapath. We noticed from lab 5 that our critical path involves the memory components. Thus, we knew that we would need to concentrate in splitting the memory paths. From the timing analyzer for lab 5, we saw that it took more than 10ns to perform lookup on memory. In order to achieve the 28ns cycle time requirement, we began by splitting up each memory stage because the timing analysis tool told us that those stages together with forwarding paths took significantly more time than other stages. The idea here is to progressively split up the stages with long critical paths. Since a lot of work is involved in splitting a stage, we cut the critical paths by shifting components across stages whenever possible as an alternative. By doing this, we potentially introduced extra cycle delays, but this is partially remedied by the higher clock speed. We have split up our pipeline into the following stages: IF PR ID EX MR MW WB FW Figure 1: Pipeline Stages 2.2 Stage Summary IF: instruction fetch PR: predict and register - Branch predict - Jump-target predict - Register file read/write ID: decode - Mult/Div EX: execute - Resolve branches MR: memory read MW: memory write WB: write back FW: forward - forward WB value to ID stage We initially split up our memory stages to a tag-lookup stage and a data-lookup stage. However, we soon moved the data-lookup parallel to the tag-lookup and registered the data in the MW stage because we found that the critical path happened with data-lookup + forward logics. By moving data-lookup one stage earlier we can split up data-lookup from the forwarding logics. In addition, the routing time is shortened. 2.3 Stage Details IF: The instruction is fetched in this stage. We simultaneously look up the tag file and the instruction cache and then check the tag to determine whether we should output it or stall. Our original design was doing only the tag lookup in this stage, but we realized that we could lookup the instruction at the same time. PR: The register file is located here. It was moved from the ID stage to here because of the amount of forwarding logic in the ID stage. We also do our branch and jump-target prediction in this stage. ID: We decode the instruction in this stage. Most of the forwarded values are forwarded to this stage. Our multiplier and divider are also located in this stage. EX: Arithmetic operations except for multiplication and division in this stage. We also resolve the branch and jump predictions we have made in this stage. MR: Like the IF stage, we could do a tag lookup and a cache read in the same cycle. But since we need to know whether there has been a cache hit before we can write to cache, we cannot perform stores in this stage. MW: We perform stores to cache in this stage. When we have a sequence of store followed by a load to the same address, the store and the load would be happening on the same cycle. Since the result from writing and reading from the same line at the same time in a RAM is undefined, we could not let the cache handle this case by itself. But when we have a store followed by a load to the same address, we know that the value loaded definitely should be the value that we stored. Thus, we simply latch in the value of the store, and output the value on the load. WB: A typical write back stage writing results back to register. We also have forwarding paths from here to ID and EX. FW: This stage only forwards values to the ID stage. We decided to add this stage because we wanted to forward to the ID stage instead of the PR stage. The details are explained below. 2.4 Forwarding Paths The forwarding paths that we had in lab 5 stayed in our design. Because we added 3 stages to our pipeline, we had to introduce more forwarding paths. Spliting the data cache stage into MR and MW stages meant that we had to add a forwarding path from the MR stage to the ID stage, as well a path from the MR stage to EX stage. We added a forwarding path from the FW stage to ID to handle the case where there are 5 instructions between the write and the read. Since our register lookup is in the 2nd stage of the pipeline, and our write back stage is the 7th stage, the register lookup would occur before the register read. But since we did not want to increase the length of the PR stage and WB stage, we decided to add the FW stage to forward it back to the ID stage. Since the register file is in the PR stage now, the ID stage only contains the controller and forwarding muxes. Therefore forwarding to the ID stage seems most appropriate. 2.5 Branch Instructions. Instead of resolving branches in the ID stage to get exactly one branch delay slot, we needed to move the branch comparators into the EX stage. This is done to cut the levels of logic in the ID stage. However, this has introduced 2 extra delay slots in addition to the one specified by the ISA. This problem led us to put in a branch predictor. We will discuss our implementation of such a predictor later. 2.6 Branch Predictor In this project, we experimented with 5 different kinds of branch prediction scheme: 1. GAs 2. Gshare 3. Gshare/Bimodal 4. Always true 5. Always false We will give the performance of each predictor in some of our test benches, but we will first describe how we implemented each. Since an always true or always false branch predictor is trivial to implement, we will only show the statistics for them. We used the simple 4-state FSM in Figure 2 to predict whether to take branch or not. The transitional edges are labeled by the actual resolved branch. Figure 2: Branch Prediction FSM Figure 3: GAs 2.6.1 GAs: In this prediction scheme, we use a global branch history register (GBHR) to pick from the set of local predictors (see Figure 3). We experimented with several different combinations of global history-local table sizes and found that increasing the GBHR size does not always translate into better predictor accuracy. This is because it takes the predictor a longer time to learn the global pattern for each branch. Figure 4: Gshare 2.6.2 Gshare: To reduce the aliasing problem associated with the GAs predictor, the gshare predictor XORs the upper bits of the indexing PC with the GBHR to produce a new branch history table (BHT) index. This scheme seems to perform slightly better than GAs. A 7-bit GBHR with 512 entry BHT performed the best in our tests. Figure 5: Combined Bimodal/Gshare 2.6.3 Combined Bimodal/Gshare. This predictor is actually made up of two separate predictors: bimodal and gshare. It contains an extra table that holds the history to which predictor is doing better at a particular branch. Base on that information, it will return the prediction of the better-performing predictor. This special table (labeled as Counts in Figure 5) contains a FSM similar to that of Figure 2. T T NT NT Taken Not Taken Taken Not Taken Taken Not Taken Taken Not Taken This predictor can be implemented to take just as much time as the bimodal or gshare predictors. This can be done by reading the 3 tables simultaneously and selecting the desired output asynchronously based on Counts. To our surprise, this prediction scheme did not outperform gshare predictor. We conclude that this is dependent on the application and may not be true in general. 2.7 Write-Back Cache. We already had this working in lab 5. Basically, we added an extra dirty bit to each cache set to determine whether a line about to be overwritten needs to be written back to memory. 2.8 Victim Cache & Write Buffer. The victim cache served as a small backup to reduce the penalty of conflict misses. It is used as a write buffer as well, thus in addition to the valid bit, a dirty bit is stored in the victim cache as well. Our victim cache contains 4 lines of data, and 8 words each line. We used registers to implement our victim cache, tag file and the LRU file associates with it. The victim cache costs us: (4 entries x (8 words + 27bit tag)) + LRU files. The victim cache is fully associative, so we need a LRU file to keep track of which line to be replaced. The replacement policy for our victim cache is LRU. The LRU module is composed of 4 registers connected in series. When read or write to a particular LRU, the write enable signals below the LRU block being written are turned on. For example, if were using the top LRU block in Figure 6, Write Enable 0 to 3 are all turned on. If the second block is being updated, then Write Enable 1 to 3 are turned on. Figure 6: 4 Entries LRU file for Victim Cache When theres a victim cache hit, we have to swap the data inside our victim cache with the data being kicked out from the data cache. This is done by checking if it is victim cache hit in our first data cache access stage(DT), and then swap the data in the second data access stage(DF). By doing so, we dont have to stall for a memory access if it is a victim cache hit. 2.9 Jump Target Predictor. Since most jr instructions in programs are jumps to $ra, we decided that we would simply always use the address in $ra as the predicted target. In order to do this without stalls, we added an extra output port to the register file that outputs the value in $ra. The accuracy of our predictor is shown in the Results section. 2.10 (Un)Signed Multiply/Divide. We built on our multiply/divide unit from Lab 4. Instead of re-implementing our multiply algorithm which calculates x0, x1, x2, and x3 in 1 cycle, we simply added the 2-bit Booth algorithm to handle signed multiplication. Adding the signed divide functionality to the unit required much less work because we only need to initialize the divisor and the dividend, reuse the existing divider, and fix the signs of the quotient and remainder at the very end. The MIPS convention to signed divide is described in the COD text. 2LRU MRU 1 3 0 Used VC number WrEn0 WrEn1 WrEn2 WrEn3 In our original design in which multu took 1 cycle to compute x0, x1, x2, and x3, we found out that the coprocessor became part of the critical path. This occurs when one of the operands uses the forwarded value. To fix this problem, we decided to add registers to hold the value of the multiplicand and multiplier, which effectively end the critical path at the input to the unit. The tradeoff is that the entire mult/div coprocessor will take an extra cycle to complete because it will not start until the EX stage. 3. Testing 3.1 General testing. Most of the modules we used in this lab are inherited from lab5, and the only new modules we have added in this lab are the victim cache and the branch predictor. The other big change to our processor is the 7-stage pipeline, and we expect it to have a lot more data hazards than the 5 stage pipeline we had. Our methodology is to build up the 7 stage pipeline, test it thoroughly with all the test programs from lab4 and lab5, and only after it is fully bug free then we will split the pipeline again for better clock speed. This can speed up the testing phase when we split up the stages later because most of the changes (cache controls, forwarding, hazard detection) make to the 5-stage pipeline are done in the 7- stage version. 3.2 Victim Cache test. This is a very stressful load/save test. It has a couple of sequences of lw / sw that will request the Victim Cache to continuously swap data with the Data Cache, and then force feed the Victim Cache with sw to fill it up with Dirty lines so that the Memory Arbiter to do writes to the SDRAM to free up the Victim Cache. With this test we found out that there is a problem with the dual port BlockRAM created with CoreGen. If we assign read address and write address at the same time the data out from the BlockRAM will go dont care (xxxx) afterwards. We have to create our own LRU file to fix this problem because parallel read and write operation is needed if we have to read and write from the same cache line in the same clock cycle (A write following a read in the pipeline will cause that). 3.3 Branch Predictor. We simply used the quick sort program that is given to us after lab 5 checkoff. Quick sort has a much more irregular branch pattern, comparing to the other tests where most of the branches are taken. 3.4 (Un)Signed Multiply/Divide. Because we implemented the 2-cycle fast multu, we needed to test those cases thoroughly. To make sure that each 4 of the instructions work correctly, we had 4 separate testbenches that feed random values to the coprocessor 2048 times. We had an option on the testbench to turn on debugging, which shows the results of each operation, but we also have an error count that tells us the number of errors resulted from the tests. 4. Results Our processor is the only one that is able to run the original race program OGR the Golomb Ruler program correctly (at 27MHz) with correct results. Even though we lost to the Group One in the Corner race (2:34 to 2:51 in real time), but we won in terms of CPI. Group One: CPI x Instruction Count / 36MHz = 154sec Our group: CPI x Instruction Count / 27Mhz = 171sec So Group Ones CPI / Our group = 1.2 Our critical path is from the data tagfile to data cache control, back to the instruction cache control, and then back to the instruction cache. This is the special level 0 boot logic we did in lab 5 to avoid instruction compulsory misses during program execution. 4.1 Write-Back & Victim Cache: The increase in performance with the addition of the Write-Back and Victim cache was obvious, since we had already implemented our WB cache and a semi-Victim cache in our lab5 (**semi-Victim because, in lab5, we only kicks out dirty cache lines to the victim cache, the write buffer. And the write buffer actually brings back cache line to the Data cache if there is a hit in the write buffer). Our cycle count, in lab5, was much less than the other two groups without WB and V caches. The speedup was on a magnitude of around 2 to 3 (other groups cycle count for quick_sort 0x4d55 vs. our groups 0x1d40). 4.2 Branch Predictor Accuracy: Test / BP Always true Gshare GShare/Bimodal cycle 75.7% 81.1% 81.1% base 98.6% 98.6% 98.6% hammer 99.0% 99.0% 99.0% q_sort 51.4% 65.1% 64.9% extra 97.6% 97.0% 96.8% An accurate branch predictor is very important to our processor. Without a branch predictor, our pipeline would need to stall for 2-cycles in addition to the delay slot. In the case of correct prediction, we would not need to stall, but a misprediction translates into 2-cycles of stalls, excluding the delay slot. Take quick sort as an example, if we use an always true predictor, the cycle count is 0x1b33. Whereas our gshare predictor required only 0x1a34. We chose to use the gshare predictor instead of the combined gshare/bimodal one because gshare was more accurate in the cases above. In addition, they produce similar results, but the combined predictor uses significantly more resources than gshare alone. That is why we chose gshare. The high-level implementation illustrations are taken from the following source: Combining Branch Predictors. Scott McFarling. WRL Technical Note TN-36, June 1993. 4.3 Deep Pipelining A well-made deep pipeline decreases the critical path, such that the time spent in each stage is nearly the same. In that case, the CPI increases, but the clock rate increases by a much greater factor. Thus gaining a speedup overall. With the exception of the FW stage, we were able to spread the time spent in each stage evenly. When we ran quicksort, we gained a speedup of 2.32. Lab5: 5328 cycles / 9MHz Final: 6911 cycles / 27MHz Speedup = Lab5 / Final = 2.32 5. Conclusion We underestimated the delay from fetching data from block rams and store it into block rams. In our original design, we did the tag check in the first stage of 2 memory access stages and based on the hit or miss signal to select the data requested. However, this kind of approach gave us a RAM to RAM critical path of 36ns (27.7MHz). That is far from our desired clock frequency (35MHz). If we have more time, we could have removed our present critical path by only saving the instructions to the data cache and flushing them out to DRAM before we exit level 0 boot. We estimate that this could bring our clock speed to above 35 MHz. We anticipate our next critical path is in the execute stage, which is shorter than 27 ns. The lesson that we learned from this project is to start with a simple design. Also we learned that implementing more features does not necessarily translate into higher performance. Personally, we all learned to work together as a group and had great fun. 6. Appendix I (Notebooks) Here are the links to each group members notebook along with the total number of hours each person spent on the project. Otto Chiu, 140 hours Charles Choi, 140 hours Teddy Lee, 147 hours Man-Kit Leung, 143 hours Bruce Wang, 150 hours We have spent a total of 720 hours as a group. 7. Appendix II (Schematics) We used Visio as our schematic editor. Here are the links to the files as well as their screenshot. Victim Cache and Memory System (for crossing clock boundary) cache_design.vsd RAM Clock (faster than Processor Clock) Processor Clock VC_Clean VC 0 VC_Wr_En1 VC_Clear1 VC_Wr_En2 VC_Clear2 VC_Wr_En3 VC_Clear3 SD R AM RAMData VC 1 VC 2 VC 3 D R 1 D R 6 D R 7 D R 0 ... da ta _o ut wr_en[0] wr_en[1] wr_en[6] wr_en[7] Inst. Cache Data Cache Block 25632 32 32 32 Cache Controller To Datapath (Inst & Data) VC _W r VC_Wr_En0 VC_Clear0 Write Data Selector 0 1 2 3 0 1 2 3 Wr_Data_Sel W r_ Ad dr & V al id & D irt y da ta _i n Offset Ta g + In de x + 8 W or ds Victim Cache Valid0 + Tag0 + Index0 Valid1 + Tag1 + Index1 Valid2 + Tag2 + Index2 Valid3 + Tag3 + Index3 RAM Controller Victim Cache Control VC _H it + VC _F ul l Addr VC_Full 0123 VC_Data_Sel Addr0, Addr1, Addr2, Addr3 (Valid+Tag+Index) arbiter request + r_w VC _d at a 32 8-Stage Deep Pipeline Datapath (datapath.vsd) IC_miss | mispredicted | mispredicted_reg | Jr_Miss | Jr_Miss_Reg D IP SW ITCH ES + Hazard detection unit controller = Ext Forward unitShift left2 + ALU Shifter EX 1 Im m B1 PC_8_0 St al l EX_Op WB1_RegWr M em O ut SLT NOP IM _A dd r ID _F w dA LU B ID _F w dA LU A Ext_Imm0 O p & Fu nc & R t0 Rw Rd Rs Rt Rs WB2_RegWr 4 NEG OVF J EQ B_Addr0 J_ Ad dr [31:28] Rt B00 Im m [1 0: 6] Sh am t Shamt EX1_ALUOp ALU_B EX1_SLTOp EX 1_ Se lS EX1_MemWr ID_RegDst 0 1 2 DM_Dout IF _E xt O p MSB 0 1 2 EX 1_ SA EX 1_ SR 16 EX1_SelShamt 1 0 B01 Ex t_ Im m 2 A0 ID_B EX _R t Rt W B _D in PC _8 _1 [31] W B 1 EX_Rw EX _R w EX _O p Op Rt W B 3_ M em to R eg 0 1 IO_DOUT O utSel HexLEDs 0 1 Release PC PC NP C IR 1 0 1 2 0 1 2 c_Stall h_Stall Stat {1'b1,IR[12:6]} 0 1 LED 0 1 0 c_Stall A1 MultDivStart div_unsigned 1 0 IR[1] ~m is pr ed ic te d & ~ Jr _M is s & (S ta ll | I C_ m is s) ~Stall & ~IC_miss W B3 S2 R w 3 W B2 S1 Rw 2 EX_S M em W r1 B 3 D at a C ac he C on tro lle r DM_Wr DTWrEn DinSel DT0Hit VC_Hit Select 10 IC Addr Din WrEn DO DRAM Buffer DRAM Buffer or DC input data PC _4 _1 1 0 NOP DRdy Arb Arb In st C ac he Co nt ro lle r CacheWrEn IT0Hit IRdy IRequest IT1Hit DT1Hit DCWrEn MultDiv mtpcnd mt plrHi Lo rdy EX 1_ A LU Sr cB 1 0 IC Ta g LR U Index Din WrEn IT0Hit IT1Hit Tag [2 5: 0] Branch Control 1 0 2 31 Register File Din Rw Rb Ra B AWrEn Ex t_ Im m 1 Ext_Imm0 Un do _A dd r1 + 4 PC_4_0 Branch Predictor1 0 PC _4 _0 B_ A dd r0 10 1 0 mispredicted Undo_Addr3 1 0 R31 [1 1: 2] 1 0 BP_Taken0 Undo_AddrPC_8_0 Un do _A dd r2 BP _T ak en 1 B P_ Ta ke n2 mispredicted [11:2] B P_ In de x2 B P_ In de x1 BP_Index2 IM_Dout[31:26] EX_Br EX_Br A1[31] EX_Rt[0] Mem_FwdMemCpy 3 3 Rt 2 D CT ag LR U Addr WrEn DT0Hit DT1Hit WB2_MemtoReg 0 1 B2 EX _F w dM em Cp y 1 0A1 Undo_Addr3 IF_Jr & ~mispredicted_reg & ~JR_Miss_Reg mispredicted | Undo_Addr3 != A1 EX_Jr 1 0R31 Un do _A dd r0 IF_Jr BP_Taken0 BP_Taken0 & ~mispredicted_reg & ~JR_Miss_Reg (IM_Dout [31:26] == `j | IM_Dout [31:26] == `jal ) & ~mispredicted_reg & ~JR_Miss_Reg In st Addr Din WrEn DO WBIndex OE DC TF TP TF TP M EM IO Addr Din OS IDB SEG IOSel Dout WrEn TFTP_Data DM _D ou t Data Cache (also to cross clock boundary) dc_detail.vsd Addr_1 Cache Control valid(1), tag(Addr_1[31:12]), 32 32 32 32 32 32 32 32 Data_Stall (~DC_miss) en for registers (IR, EX_Rw, EX_Rt, EX_Imm, B1, A1, EX_Op) tf_din valid0 DC0 Din Addr DC0_wen 0-7 TF0_wen A dd r[1 1: 5] DC1_wen 0-7 TF1_wen OR DC_miss to make Stall valid1 RAM Clock Processor Clock LRU LRU_wen LRU & Dirty File Din Addr LRU_din D C 0_ w en 0- 7 D C 1_ w en 0- 7 cache1_dout cache0_dout DRAM_done tag1 20 tag0 20 TF 0_ w en TF 1_ w en D in S el 0- 7 Addr_1[11:5] WB_Wr WB_data[255:0] Addr WB_hit WB_full 256'b Dirty_words (to WB) 1 0 dirty1 dirty0 WB_hit R ea dD at aS el 0 WB_index 0 1WB_tag TFDinSel WB_fullDin_Sel: 0) DRAM 1) processor 2) WB_data TF_DinSel: 0) read_miss 1) write 2) WB_tag 0 1 2 3 4 5 6 7 32'b Data (to processor) Addr_1[4:2] DTF0 Din Addr DTF1 Din Addr DC1 Din Addr A dd r_ 1[ 11 :5 ] 10 re g_ en 1 0 1 0 8. Appendix III (Verilog Files) alu.v ascii_romv1.1.v bht.v bin2HexLED.v bootrom.v bp.v BP_Resolver.v bts32.v cache.v cache_ctrl.v controller.v counter.v datapath_v.v data_mem.v data_tagfile.v debouncer.v def.v dinselect.v D_cache.v VC_Control.v edge_detector.v extend.v forward.v fpga_top2.v hdetect.v HiLoReg.v inst_mem.v inst_tagfile.v I_cache.v I_cache_ctrl.v j.v lab4group02blackbox.v lru_file.v memio.v memory_arbiter.v memory_control.v monitor.v mt48lc16m16a2.v multAdder.v multControl.v multDatapath.v mx2.v mx3.v mx4.v mx5.v ram128_32bit.v ram128_32bit_dual.v reg32.v regfile.v shifter.v slt.v VC_LRU.v tftp_mem.v top_level.v util.v VC.v VC_Block.v 9. Appendix IV (Test Files) testCache.v testDiv.v testDivu.v testFM.v testMult.v testMultu.v base_v1.0.s beq.s cycle_v1.1.s extra_v1.0.s hammer_v1.0.s quick_sort_v1.0.s test2.s (test instr. cache) test3.s (test LRU for cache) testSigned.s testUnsigned.s unfair_testbench.s d_cache_tb.v test_BP.v 10. Appendix V (Timing) ================================================================================ Timing constraint: Default period analysis for net "processor_clock" 7782172 items analyzed, 0 timing errors detected. (0 setup errors, 0 hold errors) Minimum period is 36.799ns. -------------------------------------------------------------------------------- Delay: 36.799ns (data path - clock path skew + uncertainty) Source: DP/S1_dff/Q[7] (FF) Destination: CACHE_UNIT/dc/dtf/B5.A (RAM) Data Path Delay: 36.762ns (Levels of Logic = 12) Clock Path Skew: -0.037ns Source Clock: processor_clock rising Destination Clock: processor_clock rising Clock Uncertainty: 0.000ns Data Path: DP/S1_dff/Q[7] to CACHE_UNIT/dc/dtf/B5.A Delay type Delay(ns) Logical Resource(s) ---------------------------- ------------------- Tcko 0.992 DP/S1_dff/Q[7] net (fanout=50) 2.878 S1[7] Topcyg 1.000 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_99 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_91 net (fanout=1) 0.000 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_91/O Tbyp 0.149 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_82 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_109 net (fanout=1) 0.000 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_109/O Tbyp 0.149 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_100 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_55 net (fanout=1) 0.000 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_55/O Tbyp 0.149 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_46 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_73 net (fanout=1) 0.000 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_73/O Tbyp 0.149 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_64 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_19 net (fanout=1) 0.000 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_19/O Tbyp 0.149 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_10 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_37 net (fanout=1) 0.000 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_37/O Tbyp 0.149 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_28 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_118 net (fanout=1) 3.877 CACHE_UNIT/dc/M_VCC/I_118_n_1 Tilo 0.468 CACHE_UNIT/dc/M_VCC/VC_Hit1 net (fanout=20) 2.357 CACHE_UNIT/dc/VC_Data_Sel[1] Tilo 0.468 CACHE_UNIT/dc/M_VCC/VC_Hit net (fanout=26) 5.002 CACHE_UNIT.VC_Hit Tilo 0.468 CACHE_UNIT/dcc/request_0_i_0_or6 net (fanout=5) 1.153 CACHE_UNIT/dcc/N_428 Tilo 0.468 CACHE_UNIT/dcc/G_286 net (fanout=11) 4.903 CACHE_UNIT/N_497 Tilo 0.468 CACHE_UNIT/dc/un1_reg_en net (fanout=2) 8.948 CACHE_UNIT/dc/un1_reg_en_n Tbeck 2.418 CACHE_UNIT/dc/dtf/B5.A ---------------------------- --------------------------- Total 36.762ns (7.644ns logic, 29.118ns route) (20.8% logic, 79.2% route) -------------------------------------------------------------------------------- Delay: 36.786ns (data path - clock path skew + uncertainty) Source: DP/S1_dff/Q[7] (FF) Destination: CACHE_UNIT/dc/dtf/B9.A (RAM) Data Path Delay: 36.745ns (Levels of Logic = 12) Clock Path Skew: -0.041ns Source Clock: processor_clock rising Destination Clock: processor_clock rising Clock Uncertainty: 0.000ns Data Path: DP/S1_dff/Q[7] to CACHE_UNIT/dc/dtf/B9.A Delay type Delay(ns) Logical Resource(s) ---------------------------- ------------------- Tcko 0.992 DP/S1_dff/Q[7] net (fanout=50) 2.878 S1[7] Topcyg 1.000 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_99 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_91 net (fanout=1) 0.000 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_91/O Tbyp 0.149 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_82 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_109 net (fanout=1) 0.000 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_109/O Tbyp 0.149 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_100 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_55 net (fanout=1) 0.000 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_55/O Tbyp 0.149 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_46 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_73 net (fanout=1) 0.000 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_73/O Tbyp 0.149 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_64 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_19 net (fanout=1) 0.000 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_19/O Tbyp 0.149 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_10 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_37 net (fanout=1) 0.000 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_37/O Tbyp 0.149 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_28 CACHE_UNIT/dc/M_VCC/un2_VC_Hit1_0.I_118 net (fanout=1) 3.877 CACHE_UNIT/dc/M_VCC/I_118_n_1 Tilo 0.468 CACHE_UNIT/dc/M_VCC/VC_Hit1 net (fanout=20) 2.357 CACHE_UNIT/dc/VC_Data_Sel[1] Tilo 0.468 CACHE_UNIT/dc/M_VCC/VC_Hit net (fanout=26) 5.002 CACHE_UNIT.VC_Hit Tilo 0.468 CACHE_UNIT/dcc/request_0_i_0_or6 net (fanout=5) 1.153 CACHE_UNIT/dcc/N_428 Tilo 0.468 CACHE_UNIT/dcc/G_286 net (fanout=11) 4.903 CACHE_UNIT/N_497 Tilo 0.468 CACHE_UNIT/dc/un1_reg_en net (fanout=2) 8.931 CACHE_UNIT/dc/un1_reg_en_n Tbeck 2.418 CACHE_UNIT/dc/dtf/B9.A ---------------------------- --------------------------- Total 36.745ns (7.644ns logic, 29.101ns route) (20.8% logic, 79.2% route) -------------------------------------------------------------------------------- Delay: 36.707ns (data path - clock path skew + uncertainty) Source: CACHE_UNIT/dc/dtf/B5.B (RAM) Destination: CACHE_UNIT/dc/dtf/B9.A (RAM) Data Path Delay: 36.694ns (Levels of Logic = 8) Clock Path Skew: -0.013ns Source Clock: processor_clock rising Destination Clock: processor_clock rising Clock Uncertainty: 0.000ns Data Path: CACHE_UNIT/dc/dtf/B5.B to CACHE_UNIT/dc/dtf/B9.A Delay type Delay(ns) Logical Resource(s) ---------------------------- ------------------- Tbcko 3.414 CACHE_UNIT/dc/dtf/B5.B net (fanout=2) 4.973 CACHE_UNIT/dc/tf_dout1[2] Topcyg 1.000 CACHE_UNIT/dc/un3_DT1Hit_0.I_90 CACHE_UNIT/dc/un3_DT1Hit_0.I_82 net (fanout=1) 0.000 CACHE_UNIT/dc/un3_DT1Hit_0.I_82/O Tbyp 0.149 CACHE_UNIT/dc/un3_DT1Hit_0.I_46 CACHE_UNIT/dc/un3_DT1Hit_0.I_73 net (fanout=1) 0.000 CACHE_UNIT/dc/un3_DT1Hit_0.I_73/O Tbyp 0.149 CACHE_UNIT/dc/un3_DT1Hit_0.I_64 CACHE_UNIT/dc/un3_DT1Hit_0.I_55 net (fanout=1) 0.000 CACHE_UNIT/dc/un3_DT1Hit_0.I_55/O Tbyp 0.149 CACHE_UNIT/dc/un3_DT1Hit_0.I_10 CACHE_UNIT/dc/un3_DT1Hit_0.I_37 net (fanout=1) 0.000 CACHE_UNIT/dc/un3_DT1Hit_0.I_37/O Tbyp 0.149 CACHE_UNIT/dc/un3_DT1Hit_0.I_28 CACHE_UNIT/dc/un3_DT1Hit_0.I_19 net (fanout=3) 4.934 CACHE_UNIT/I_19_n_1 Tilo 0.468 CACHE_UNIT/dcc/DC_miss_0_and2_i_0_or6 net (fanout=17) 0.419 CACHE_UNIT/dcc/N_409 Tilo 0.468 CACHE_UNIT/dcc/DC_miss_0_and2_i_0 net (fanout=44) 4.771 CACHE_UNIT.reg_en Tilo 0.468 CACHE_UNIT/dc/tf_addra[2] net (fanout=37) 14.142 CACHE_UNIT/dc/tf_addra[2] Tback 1.041 CACHE_UNIT/dc/dtf/B9.A ---------------------------- --------------------------- Total 36.694ns (7.455ns logic, 29.239ns route) (20.3% logic, 79.7% route) -------------------------------------------------------------------------------- ================================================================================