ΔΙΑΛΕΞΗ 17: Δυναμικός Παραλληλισμός Εντολών -- Superscalar Επεξεργαστές --

ΗΜΥ 312 -- ΑΡΧΙΤΕΚΤΟΝΙΚΗ ΗΛΕΚΤΡΟΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ Χειµερινό Εξάµηνο ΔΙΑΛΕΞΗ 17: Δυναμικός Παραλληλισμός Εντολών -- Superscalar Επεξεργαστές -- Διδάσκουσα: ΜΑΡΙΑ Κ. ΜΙΧΑΗΛ Επίκουρη Καθηγήτρια, ΗΜΜΥ (mmichael@ucy.ac.cy) [Προσαρµογή από Computer Architecture, Computer Organization and Design, Patterson & Hennessy, 25 και Superscalar Microprocessor Design, Johnson, 1992 ] Review: Baseline Superscalar MIPS Processor Model P C Fetch BHT BTB I$ N I RUU_Head RUU_Tail Decode & Dispatch Register Update Unit (managed as a queue) N L FP I L I RegFile Integer I RegFile RUU Issue & Execute Load/Store Queue 1 2 3 4 5 6 FPALU IALU IALU LSQ IMULT Result Bus Writeback D$ Commit ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.2 1

Επιπρόσθετα πεδία στο Regιster File (Fields) q Το κάθε register στο γενικής χρήσης RegFile έχει δύο συζευγµένα (associated) n-bit counters (συνήθως n=3) NI (Number of Instances) the number of instances of a register as a destination register in the RUU LI (Latest Instance) the number of the latest instance q Όταν µια εντολή µε destination register address R i εκδίδεται για εκτέλεση στο RUU, το Ni i και το LI i αυξάνονται κατά 1 Dispatch is blocked if a destination register s NI is 2 n -1, so only up to 2 n 1 instances of a register can be present in the RUU at any one time q Όταν µια εντολή γράφεται (committed -- updates the R i value) το NI i µειώνεται κατά 1 When NI i = the register is free (there is no instruction in the RUU that is going to write to that register) and LI i is cleared ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.3 Register Update Unit (RUU) q Ένα hardware data structure για επίλυση των data dependencies κρατώντας σειρά (track) των δεδοµένων µιας εντολής των αναγκών εκτέλεσης που γράφει τις συµπληρωµένες εντολές στην σειρά τους (commits completed instructions in program order) q An entry in the RUU speculative src operand 1 src operand 2 destination issued functional unit executed PC Yes/No Spec Instr Addr Ready Tag Content Ready Tag Content Tag Tag = RegFile addr LI Content Yes/No Yes/No Unit Number Address ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.4 2

Ροή Εντολών (Instruction Flow Overview) q Fetch (in program order): Fetch multiple instructions in parallel from the I$ - Μεταφορά πολλαπλών εντολών από το Ι$ q Decode & Dispatch (in program order): In parallel, decode the instr s just fetched and schedule them for execution by dispatching them to the RUU Loads and stores are dispatched as two (micro)instr s one to compute the effective addr and one to do the memory operation q Issue & Execute (out of program order): Μόλις το RUU πάρει τα δεδοµένα της εντολής (source operands) και το FU είναι διαθέσιµο, οι εντολές εκδίδονται (issued) στο FU για εκτέλεση (execution) q Writeback (out of program order): Τελειώνοντας το FU βάζει το αποτέλεσµα στο Result Bus επιτρέποντας στο RUU και στο LSQ να ενηµερωθούν (updated) η εντολή ολοκληρώνεται. q Commit (in program order): Όταν χρειάζεται, κάνουµε commit το αποτέλεσµα (δεδοµένα) της εντολής στο µηχάνηµα (to the state locations (i.e., update D$ and RegFile)) ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.5 Οργάνωση του RUU σε Queue q Αν χειριστούµε το RUU σαν queue, και γράφουµε τις συµπληρωµένες εντολές από το RUU_Head, οι εντολές γράφονται (committed - aka retired) µε την σειρά την οποία εκδόθηκαν από το Decode & Dispatch logic (in program order) Stores to state locations (RegFile and D$) are buffered (in the RUU and LSQ) until commit time Supports precise interrupts (the only state locations updated are those written by instructions before the interrupting instr) q Το counter (LI) αφήνει πολλαπλές υπάρξεις ενός συγκεκριµένου destination register στο RUU την ίδια ώρα, µέσω register renaming Solves write before write hazards if results from the RUU are returned to the RegFile in program order ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.6 3

Βασικές εργασίες του RUU q Κάθε εργασία γίνεται παράλληλα µε τις άλλες κάθε κύκλο 1. Δέχεται καινούριες εντολές από το Decode & Dispatch logic 2. Επιβλέπει το Result Bus για επίλυση των true dependencies και για write back των αποτελεσµάτων (result data) στο RUU 3. Αποφασίζει ποιες εντολές είναι έτοιµες για εκτέλεση, κρατάει (reserves) το Result Bus, και εκδίδει την εντολή στο αντίστοιχο εκτελεστικό σύστηµα 4. Αποφασίζει αν µια εντολή είναι έτοιµη για εγγραφή (commit --i.e., change the machine state) και εγγράφει την εντολή αναλόγως. ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.7 SS Pipeline Οργάνωση του Pipeline FETCH DECODE & DISPATCH ISSUE & EXECUTE WRITE BACK RESULT COMMIT Fetch multiple instructions Decode instructions Detect Dependences in operands and destination registers Issue Instruction IFF functional unit available, if source operands available and if resulting operand output register has been reserved. copy Result Bus data to matching buffer locations (src1, src2) and forward data in case of an instruction waiting on result write dst Contents to RegFile if no conflicts In Order In Order Out of Order In Order ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.8 4

Σηµ: Content Addressable Memories (CAMs) q Memories that are addressed by their content. Typical applications include RUU source tag field comparison logic, cache tags, and translation lookaside buffers Memory hardware that compares the Search Data to the Match Field entries for each word in the CAM in parallel! On a match the Data Field for that entry is output to Match Data on read or Match Data is written into the Data Field on write and the Hit bit is set. If no match occurs, the Hit bit is reset. CAMs can be designed to accommodate multiple hits. Search Data Match Field Hit Data Field Match Data ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.9 MicroOperations of Load and Store q Ας θυµηθούµε ότι τα loads και τα stores αποστέλλονται στο RUU σαν δυο µικροεντολές (microinstructions) µια υπολογίζει την ενεργή διεύθυνση (effective address) και µια υπολογίζει την προσπέλαση της µνήµης (memory operation) Load lw R1,2(R2) becomes addi R,R2,2 lw R1,R Store sw R1,6(R2) becomes addi R,R2,6 sw R1,R q Την ίδια στιγµή µία θέση στο LSQ παρακρατείται Κάθε entry στο LSQ αποτελείται από το Tag field (RegFile addr LI) και το Content field. Το LI counter επιτρέπει πολλαπλές υπάρξεις store (write) στο ίδιο memory address Όταν ένα load ολοκληρωθεί (the D$ returns the data on the Result Bus) ή ένα store κάνει commit (in program order) το αντίστοιχο entry στο LSQ ελευθερώνεται. Instruction dispatch γίνεται blocked αν δεν υπάρχει 1 ελεύθερo entry στο LSQ και δυο 2 στο RUU. ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.1 5

Load & Store RUU & LSQ Operation Load addi R,R2,2 lw R1,R Store addi R,R2,6 sw R1,R NI LI 2 2 1 1 2 LSQ_Head LSQ_Tail # LI 1 1 1 2 2 D$ 2 9 LSQ RUU_Tail # LI # LI # LI RUU_Head 1 2 1 2 1 y 4 1 1 1 1 1 1 2 1 6 2 y 5 1 1 2 2 2 RUU Result Bus ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.11 ΣΗΜΕΙΩΣΗ Tomasulo s algorithm q Η εφαρµογή του RUU είναι µια παραλλαγή του αλγόριθµου του Tomasulo, που αναγράφεται στο βιβλίο σας (Section 3.2)! ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.12 6

Σειρά Μεταφοράς Εντολών q Ροή Εντολών αριθµός εντολών (run length) που µεταφέρονται µεταξύ διακλαδώσεων που έχουν παρθεί. (taken branches) Instruction fetcher operates most efficiently when processing long runs unfortunately runs are usually quite short q Η µέση ροή εντολών είναι γύρω στις έξι εντολές. Time (cycles) 4-way Instr Fetcher S4 S1 S2 S3 S5 Branch delay q Το εύρος των εντολών είναι µόνο 1.125 εντολές ανά κύκλο. 9 instructions in 8 cycles T4 T1 T2 T3 ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.13 Αναποτελεσµατικότητα στην Μεταφορά q Το σύστηµα µεταφοράς, δεν µπορεί να προσφέρει στον αποκωδικοποιητή εντολών µεγάλο εύρος εντολών για πλήρη αξιοποίηση του ILP, γιατί Ο αποκωδικοποιητής περιµένει το αποτέλεσµα της διακλάδωσης - Μπορεί (συνήθως) να βελτιστοποιηθεί µε δυναµική πρόβλεψη διακλάδωσης (dynamic branch prediction) Η µεταφορά εντολών από διαφορετικές θέσεις στην µνήµη περιορίζει τον αποκωδικοποιητή από το να δουλεύει µε πλήρη χωρητικότητα, ακόµη και αν ο αποκωδικοποιητής επεξεργάζεται βάσιµες εντολές - The fetcher can align fetched instructions to avoid wasted decoder slots - If supported by dynamic branch prediction, the fetcher can also merge instructions from different runs Η εναρµόνιση των εντολών µε την ροή του προγράµµατος µπορεί να γίνει µόνο αν το σύστηµα µεταφοράς έχει αρκετό εύρος (i.e., the fetch rate is faster than the decode rate) ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.14 7

Speedup Speedups of Fetch Alternatives 4 3 2 1 2- base 4- base 2- pred From Johnson, 1992 4- pred 2-max 4-max Low HM High Base: no prediction and no alignment Pred: dynamic branch prediction q A 4-way instr fetcher out performs a 2-way instr fetcher Διπλάσιο πιθανό εύρος µεταφοράς εντολών. Διπλάσιο υλικό αποκωδικοποιητή για να διατηρεί την ταχύτητα (e.g., in decoders and in ports and buses) ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.15 4-Way Decoder Implementation q Ένα 4-way σύστηµα µεταφοράς εντολών, έρχεται µε µεγάλο εύρος. Ποιο είναι το κόστος όµως; 12 dependency checks between the 4 decode instruction op rs rt rd op rs rt rd op rs rt rd op rs rt rd ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.16 8

Reducing 4-Way Decoder Hardware q Μπορούµε να περιορίσουµε τον αριθµό των RegFile ports αφού Not all decoded instructions access two registers Not all decoded instruction are valid (because of misalignment) Some decoded instructions have dependences on one or more simultaneously decoded instructions From Johnson, 1992 q Η ζήτηση καταχωρητών εξυπακούει ότι σχεδίαση µε 8 πόρτες θα είναι σπατάλη. q Σχεδίαση µε 4 πόρτες µειώνει την απόδοση κατά < 2% ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.17 Fraction of total decoded parcels.35.3.25.2.15.1.5 ccom troff 1 2 3 4 5 6 7 8 # of Ports Used SS Branch Prediction q Υπενθύµιση από το Computer Organization ότι για branch prediction σε scalar pipeline χρειαζόµαστε Μηχανισµό πρόβλεψης του branch outcome: το BHT (branch history table) στο στάδιο fetch Μέθοδο για fetch δυο εντολών την εντολή που ακολουθεί το branch (I$) και το branch target instruction (BTB (branch target buffer)) Μηχανισµό ελέγχου για τις εντολές που ακολουθούσαν το branch για µη αλλαγή του machine state µέχρι ακριβή υπολογισµό του branch - Τις αφήνουµε να ολοκληρωθούν (in order commit) στη σωστή πρόβλεψη - Flushed και επανεκκίνηση του pipeline σε περίπτωση λάθους. q Με ένα SS machine, είναι δυνατό να έχουµε πολλές τέτοιες εντολές µε predicted branches active (ενεργά) στο pipeline Flag instructions following branches as speculative until the branch outcome is known ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.18 9

Implementing Branches q A SS processor could have more than one branch per fetch set and could have several uncompleted branches pending I$ at any time Fetch (BHT/BTB) Branch (check predict) Decode (Predict) Dispatch q Must access BHT/BTB for all branch instr s in the fetch set during fetch to reduced branch delay (i.e., need a 4 read-port BHT/BTB for 4-instr fetcher) Pass BHT information to decode stage After decode, choose between I$ set and BTB sets to determine the next fetch set ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.19 Decoding & Dispatching Branches q While multiple branches could be dispatched per cycle, incur only a slight performance decrease (about 2%) from imposing a decoder limit of one branch per fetch set since typically only one branch per cycle can be executed (usually only have one branch FU) q Having minimum branch delay is more important that decoding multiple branches per cycle Speedup 3 2.5 2 1.5 1.5 From Johnson, 1992 Low HM High 4-sinpred 4-mulpred ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.2 1

Speculative Instructions q Speculation The processor (or compiler) guesses the outcome of an instruction (e.g., branches, loads) so as to enable execution of other instructions that depend on the speculated instruction One of the most important methods for finding more ILP in SS and VLIW processors q Producing correct results requires result checking, recovery and restart hardware mechanisms Checking mechanisms to see if the prediction was correct Recovery mechanisms to cancel the effects of instructions that were issued under false assumptions (e.g., branch misprediction) Restart mechanisms to reestablish the correct instruction sequence - For branches the correct program counter restart value is known when the branch outcome is determined ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.21 Speculation Support q For dependent speculative instr s, a speculative flag is set to Yes until the outcome of the driving instr (i.e., the branch) is determined. speculative src operand 1 src operand 2 destination issued functional unit executed PC Yes/No Spec Instr Addr Ready Tag Content Ready Tag Content Tag Content Yes/No Yes/No Unit Number Address ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.22 11

Branch Execution q If the branch was not mispredicted, then the branch and its trailing instructions can commit when at RUU_Head q If the branch was mispredicted, then all subsequent instr s must be discarded (even though subsequent branches may have been correctly predicted) q When there is an exception, all of the RUU entries are discarded in a single cycle and instruction stream fetching restarts on the next cycle. Thus, the RUU provides an easy way to discard instructions coming after a mispredicted branch. ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.23 Result Buses Utilization q Our SS model has only one Result Bus to carry results generated by the FU s to the RUU and LSQ Even at the high levels of performance, the utilization of the Result Bus is only about 7% (i.e., the fraction of capacity actually used) q If a FU requests for the Result Bus is not granted, instruction issue to that FU is stalled until the bus request can be granted (i.e., the FU remains busy ) Avg # of results waiting for the Result Bus 1 From Johnson, 1992 1 2 3 4 # of Result Buses ccom troff ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.24 12

Arbitrating For Result Buses q Reducing the impact of bus contention by adding a second Result Bus improves performance (by almost 19%) q But adding a third Results Buses yields only a very small improvement in performance (less than 3%) q And since there are usually fewer Result Buses than FUs, the FUs must continue to arbitrate for use of the existing Result Buses The arbiter not only decides which FU is granted use of the Result Buses, but also which of the two buses is to be used - Prioritizing old requests over new helps prevent starvation ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.25 Speedup 3 2 1 From Johnson, 1992 1 2 3 4 # of Result Buses Low HM High Result Forwarding q Result forwarding supplies operands directly to the waiting instr s in the RUU to resolve true dependencies that could not be resolved during decode The cost of forwarding is the comparison logic in the RUU to compare the Result Bus Tag to the source operand Tags Need a set of comparators for each Result Bus q About 2/3 rd of all results are forwarded to one waiting operand, and about 1/6 th are forwarded to more than one ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.26 Fraction of Results.8.7.6.5.4.3.2.1 From Johnson, 1992 ccom troff 1 2 3 4 5 6+ # of RUU Entries Receiving Result as Input Operand 13

Performance Advantages q Hardware complexity arises from four major hardware features Out-of-order issue Register renaming Branch prediction 4-way instruction fetch and decode OOI Register Renaming Branch Prediction From Johnson, 1992 4-way Fetch & Decode 52% 36% 3% 18% ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.27 ILP in a Perfect OOI-OOC Processor q The perfect processor has An infinite number of rename registers that eliminates all storage hazards (i.e., write-before-write and write-before-read) No (fetch, decode, dispatch, issue, FU, buses, ports) limit on the number of instr s that can begin execution simultaneously as long as read-before-write true data hazards are not present Perfect branch and jump (including jump register) prediction Loads can be moved before stores (as long as the addresses are not identical) with memory address analysis All FU s have a 1 cycle latency Perfect caches with 1 cycle latency IPC 16 12 8 4 gcc 55 espresso From H&P, 23 63 li 18 fpppp 75 doduc 119 tomcatv 15 ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.28 14

Effect of Instruction Window Size on ILP q Instruction window the set of instructions that are examined simultaneously for execution IPC 16 12 8 From H&P, 23 gcc espresso li fpppp doduc tomcatv 4 Infinite 2K 512 128 32 8 4 ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.29 Effect of Realistic Branch Prediction on ILP q On a processor with an instruction window size of 2K and maximum 64-way issue capability IPC 6 4 2 From H&P, 23 gcc espresso li fpppp doduc tomcatv ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.3 Perfect Tournament Standard 2-bit Static None 15

Effect of Finite Rename Registers q On a processor with an instruction window size of 2K, maximum 64-way issue capability, and a tournament branch predictor with 8K entries IPC 6 4 2 From H&P, 23 gcc espresso li fpppp doduc tomcatv Infinite 256 128 64 32 None ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.31 A SS Example q Intel Pentium 4 (IA-32 ISA) Decodes the IA-32 instructions into microoperations Does register renaming with a RUU-like structure Has a 2 stage pipeline T$ access (Bpredict) µop queue RUU allocation FU queues Instr dispatch RegFile access Execution RUU queue Commit # cycles 5 4 5 2 1 3 7 FUs: 2 integer ALUs, 1 FP ALU, 1FP move, load, store, complex Up to 126 instructions in flight, including 48 loads and 24 stores 4K entry branch predictor ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.32 16

ΚΑΤ ΟΙΚΟΝ ΜΕΛΕΤΗ q Για την 3 η ενότητα του µαθήµατος (ILP): Κεφάλαιο 6 Patterson&Hennessy, 3 η έκδοση (από το βιβλίο του ΗΜΥ212) επανάληψη pipelining και 6.9 για advanced pipelining, speculation, static/dynamic multiple-issue Κεφάλαια 2-3 του βιβλίου σας Παραρτήµατα A(pipelining) και G (VLIW) του βιβλίου σας q Για εξάσκηση σε ILP: ασκήσεις στο τέλος του κεφαλαίου 2 του βιβλίου σας (σελ. 142-149) q Αφήστε έξω τον αλγόριθµο του Tomasulo (για όσους ενδιαφέρονται για τη συνέχεια υπάρχει το ΗΜΥ49!) ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.33 17