ΔΙΑΛΕΞΗ 17: Δυναμικός Παραλληλισμός Εντολών -- Superscalar Επεξεργαστές --

ΗΜΥ 312 -- ΑΡΧΙΤΕΚΤΟΝΙΚΗ ΗΛΕΚΤΡΟΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ ΔΙΑΛΕΞΗ 17: Δυναμικός Παραλληλισμός Εντολών -- Superscalar Επεξεργαστές -- Διδάσκων: Χάρης Θεοχαρίδης Αναπληρωτης Καθηγητής, ΗΜΜΥ (ttheocharides@ucy.ac.cy) [Προσαρμογή από Computer Architecture, Computer Organization and Design, Patterson & Hennessy, 2005 και Superscalar Microprocessor Design, Johnson, 1992 ]

Review: Baseline Superscalar MIPS Processor Model Fetch Decode & Dispatch Issue & Execute Writeback Commit P C BHT BTB I$ N I RUU_Head N L I I L FP I Integer RegFile RegFile 3 4 Load/Store Queue 1 2 FPALU IALU LSQ D$ RUU_Tail RUU 5 IALU 6 IMULT Register Update Unit (managed as a queue) Result Bus ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.2

Ροή Εντολών (Instruction Flow Overview) q Fetch (in program order): Παράλληλη μεταφορά πολλαπλών εντολών απο το I$ q Decode & Dispatch (in program order): Παράλληλη αποκωδικοποίηση των νέων εντολών και scheduling των εντολών για εκτέλεση à dispatch στο RUU Loads/Stores dispatched ως 2 μικρο-εντολές (microinstructions) η πρώτη υπολογίζει την διεύθυνση και η επόμενη εκτελεί την πράξη q Issue & Execute (out of program order): Μόλις το RUU πάρει τα δεδομένα της εντολής (source operands) και το FU είναι διαθέσιμο, οι εντολές εκδίδονται στο FU για εκτέλεση q Writeback (out of program order): Τελειώνοντας το FU βάζει το αποτέλεσμα στο Result Bus επιτρέποντας στο RUU και στο LSQ να ενημερωθούν (updated) η εντολή ολοκληρώνεται. q Commit (in program order): Ανανέωση της κατάστασης του επεξεργαστή (processor state) -- κάνουμε commit το αποτέλεσμα της εντολής στα D$ και RegFiles. ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.3

Τι αλλάζει σε επιμέρους αρχιτεκτονικά στοιχεία; q Αρχείο(α) καταχωρητών εποπρόσθετα πεδία q Register Update Unit RUU q Οργάνωση του pipeline q Load/Store microperations q Υλοποίηση παράλληλης αποκωδικοποίησης q Brach prediction speculation and support q ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.4

Οι καταχωρητές Επιπρόσθετα πεδία qο κάθε καταχωρητής (RegFile) έχει δύο συζευγμένα (associated) n-bit counters (συνήθως n=3) NI (Number of Instances) the number of instances of a register as a destination register in the RUU LI (Latest Instance) the number of the latest instance qόταν μια εντολή με destination register address R i εκδίδεται για εκτέλεση στο RUU, το NI i και το LI i αυξάνονται κατά 1 Dispatch is blocked if a destination register s NI is 2 n -1, so only up to 2 n 1 instances of a register can be present in the RUU at any one time qόταν μια εντολή γράφεται (committed -- updates the R i value) το NI i μειώνεται κατά 1 When NI i = 0 the register is free (there is no instruction in the RUU that is going to write to that register) and LI i is cleared ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.5

Register Update Unit (RUU) q Μια δομή δεδομένων (λίστα) υλοποιημένη σε hardware: - περιέχει όλες τις ενεργές εντολές - κρατά σημαντικές πληροφορίες για τα δεδομένα μιας εντολής - βοηθά στην επίλυση των data dependencies - κρατά σημαντικές πληροφορίες για την κατάσταση μιας εντολής - επιτρέπει το γράψιμο (commit) των συμπληρωμένων εντολών με τη σωστή σειρά qan entry in the RUU speculative src operand 1 src operand 2 destination issued functional unit executed PC Yes/No Spec Instr Addr ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.6 Ready Tag Content Ready Tag Content Tag Tag = RegFile addr LI Content Yes/No Unit Number Yes/No Address

Οργάνωση του RUU σε Queue q Αν χειριστούμε το RUU σαν queue, και γράφουμε τις συμπληρωμένες εντολές από το RUU Head, οι εντολές γράφονται (committed - aka retired) με την σειρά στην οποία εκδόθηκαν από το Decode & Dispatch logic (in program order) Stores to state locations (RegFile and D$) are buffered (in the RUU and LSQ) until commit time Supports precise interrupts (the only state locations updated are those written by instructions before the interrupting instruction) q Το counter (LI) αφήνει πολλαπλές υπάρξεις ενός συγκεκριμένου destination register στο RUU την ίδια ώρα, μέσω register renaming Solves write before write hazards if results from the RUU are returned to the RegFile in program order ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.7

Βασικές εργασίες του RUU qκάθε εργασία γίνεται παράλληλα με τις άλλες κάθε κύκλο 1. Δέχεται καινούριες εντολές από το Decode & Dispatch logic 2. Επιβλέπει το Result Bus για επίλυση των true dependencies και για write back των αποτελεσμάτων (result data) στο RUU 3. Αποφασίζει ποιες εντολές είναι έτοιμες για εκτέλεση, κρατάει (reserves) το Result Bus, και εκδίδει την εντολή στο αντίστοιχο εκτελεστικό σύστημα 4. Αποφασίζει αν μια εντολή είναι έτοιμη για εγγραφή (commit --i.e., change the machine state) και εγγράφει την εντολή αναλόγως. ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.8

SS Pipeline Οργάνωση του Pipeline FETCH DECODE & DISPATCH ISSUE & EXECUTE WRITE BACK RESULT COMMIT Fetch multiple instructions Decode instructions Detect Dependences in operands and destination registers Issue Instruction IFF functional unit available, if source operands available and if resulting operand output register has been reserved. copy Result Bus data to matching buffer locations (src1, src2) and forward data in case of an instruction waiting on result write dst Contents to RegFile if no conflicts In Order In Order Out of Order In Order ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.9

Σημ: Content Addressable Memories (CAMs) qmemories that are addressed by their content. Typical applications include RUU source tag field comparison logic, cache tags, and translation lookaside buffers Memory hardware that compares the Search Data to the Match Field entries for each word in the CAM in parallel! On a match the Data Field for that entry is output to Match Data on read or Match Data is written into the Data Field on write and the Hit bit is set. If no match occurs, the Hit bit is reset. CAMs can be designed to accommodate multiple hits. Search Data Match Field Hit Data Field Match Data ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.10

MicroOperations of Load and Store qας θυμηθούμε ότι τα loads και τα stores αποστέλλονται στο RUU σαν δυο μικροεντολές (microinstructions) μια υπολογίζει την ενεργή διεύθυνση (effective address) και μια υπολογίζει την προσπέλαση της μνήμης (memory operation) Load lw R1,2(R2) becomes addi R0,R2,2 lw R1,R0 Store sw R1,6(R2) becomes addi R0,R2,6 sw R1,R0 qτην ίδια στιγμή μία θέση στο LSQ παρακρατείται Κάθε entry στο LSQ αποτελείται από το Tag field (RegFile addr LI) και το Content field. Το LI counter επιτρέπει πολλαπλές υπάρξεις store (write) στο ίδιο memory address Όταν ένα load ολοκληρωθεί (the D$ returns the data on the Result Bus) ή ένα store κάνει commit (in program order) το αντίστοιχο entry στο LSQ ελευθερώνεται. Instruction dispatch γίνεται blocked αν δεν υπάρχει 1 ελεύθερo entry στο LSQ και δυο 2 στο RUU. ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.11

Load & Store RUU & LSQ Operation Load addi R0,R2,2 lw R1,R0 Store addi R0,R2,6 sw R1,R0 NI LI 2 2 0 1 1 0 0 2 0 0 LSQ_Head LSQ_Tail # LI 1 1 1 0 2 2 D$ 2 9 # LI # LI # LI RUU_Tail RUU_Head 1 2 1 2 0 1 y 1 0 1 1 1 1 2 1 6 0 2 y 1 1 0 2 0 2 RUU LSQ Result Bus ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.12

ΣΗΜΕΙΩΣΗ Tomasulo s algorithm qη εφαρμογή του RUU (reservation station) είναι μια παραλλαγή του αλγόριθμου του Tomasulo, που αναγράφεται στο βιβλίο σας (Section 3.2)! ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.13

Σειρά Μεταφοράς Εντολών (Fetching) qροή Εντολών αριθμός εντολών (run length) που μεταφέρονται μεταξύ διακλαδώσεων που έχουν παρθεί. (taken branches) Instruction fetcher operates most efficiently when processing long runs unfortunately runs are usually quite short qη μέση ροή εντολών είναι γύρω στις έξι εντολές. Time (cycles) 4-way Instr Fetcher S4 S1 S2 S3 S5 Branch delay qτο εύρος των εντολών είναι μόνο 1.125 εντολές ανά κύκλο. 9 instructions in 8 cycles T4 T1 T2 T3 ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.14

Αναποτελεσματικότητα στην Μεταφορά qτο σύστημα μεταφοράς, δεν μπορεί να προσφέρει στον αποκωδικοποιητή εντολών μεγάλο εύρος εντολών για πλήρη αξιοποίηση του ILP, γιατί Ο αποκωδικοποιητής περιμένει το αποτέλεσμα της διακλάδωσης - Μπορεί (συνήθως) να βελτιστοποιηθεί με δυναμική πρόβλεψη διακλάδωσης (dynamic branch prediction) Η μεταφορά εντολών από διαφορετικές θέσεις στην μνήμη περιορίζει τον αποκωδικοποιητή από το να δουλεύει με πλήρη χωρητικότητα, ακόμη και αν ο αποκωδικοποιητής επεξεργάζεται βάσιμες εντολές - The fetcher can align fetched instructions to avoid wasted decoder slots - If supported by dynamic branch prediction, the fetcher can also merge instructions from different runs Η εναρμόνιση των εντολών με την ροή του προγράμματος μπορεί να γίνει μόνο αν το σύστημα μεταφοράς έχει αρκετό εύρος (i.e., the fetch rate is faster than the decode rate) ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.15

Speedups of Fetch Alternatives 4 From Johnson, 1992 Low 3 HM High Speedup 2 1 Base: no prediction and no alignment 0 Pred: dynamic branch prediction 2- base 4- base 2- pred 4- pred 2-max 4-max qa 4-way instr fetcher out performs a 2-way instr fetcher Διπλάσιο πιθανό εύρος μεταφοράς εντολών Διπλάσιο υλικό αποκωδικοποιητή για να διατηρεί την ταχύτητα (e.g., in decoders and in ports and buses) ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.16

4-Way Decoder Implementation (Decoding) qένα 4-way σύστημα μεταφοράς εντολών, έρχεται με μεγάλο εύρος. Ποιο είναι το κόστος όμως; 12 dependency checks between the 4 decode instruction op rs rt rd op rs rt rd op rs rt rd op rs rt rd ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.17

Reducing 4-Way Decoder Hardware qμπορούμε να περιορίσουμε τον αριθμό των RegFile ports αφού Not all decoded instructions access two registers Not all decoded instruction are valid (because of misalignment) Some decoded instructions have dependences on one or more simultaneously decoded instructions From Johnson, 1992 qη ζήτηση καταχωρητών εξυπακούει ότι σχεδίαση με 8 πόρτες θα είναι σπατάλη. qσχεδίαση με 4 πόρτες μειώνει την απόδοση κατά < 2% ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.18 Fraction of total decoded parcels 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 ccom troff 0 1 2 3 4 5 6 7 8 # of Ports Used

SS Branch Prediction qυπενθύμιση από το Computer Organization ότι για branch prediction σε scalar pipeline χρειαζόμαστε Μηχανισμό πρόβλεψης του branch outcome: το BHT (branch history table) στο στάδιο fetch Μέθοδο για fetch δυο εντολών την εντολή που ακολουθεί το branch (I$) και το branch target instruction (BTB (branch target buffer)) Μηχανισμό ελέγχου για τις εντολές που ακολουθούσαν το branch για μη αλλαγή του machine state μέχρι ακριβή υπολογισμό του branch - Τις αφήνουμε να ολοκληρωθούν (in order commit) στη σωστή πρόβλεψη - Flushed και επανεκκίνηση του pipeline σε περίπτωση λάθους qμε ένα SS machine, είναι δυνατό να έχουμε πολλές τέτοιες εντολές με predicted branches active (ενεργά) στο pipeline Flag instructions following branches as speculative until the branch outcome is known ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.19

Implementing Branches qa SS processor could have more than one branch per fetch set and could have several uncompleted branches pending I$ at any time Fetch (BHT/BTB) Branch (check predict) Decode (Predict) Dispatch qmust access BHT/BTB for all branch instr s in the fetch set during fetch to reduced branch delay (i.e., need a 4 read-port BHT/BTB for 4-instr fetcher) Pass BHT information to decode stage After decode, choose between I$ set and BTB sets to determine the next fetch set ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.20

Decoding & Dispatching Branches qwhile multiple branches could be dispatched per cycle, it incurs only a slight performance decrease (about 2%) from imposing a decoder limit of one branch per fetch set since typically only one branch per cycle can be executed (usually only have one branch FU) qhaving minimum branch delay is more important that decoding multiple branches per cycle Speedup 3 2.5 2 1.5 1 0.5 From Johnson, 1992 Low HM High 0 4-sinpred 4-mulpred ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.21

Speculative Instructions qspeculation The processor (or compiler) guesses the outcome of an instruction (e.g., branches, loads) so as to enable execution of other instructions that depend on the speculated instruction One of the most important methods for finding more ILP in SS and VLIW processors qproducing correct results requires result checking, recovery and restart hardware mechanisms Checking mechanisms to see if the prediction was correct Recovery mechanisms to cancel the effects of instructions that were issued under false assumptions (e.g., branch misprediction) Restart mechanisms to reestablish the correct instruction sequence - For branches the correct program counter restart value is known when the branch outcome is determined ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.22

Speculation Support qfor dependent speculative instr s, a speculative flag is set to Yes until the outcome of the driving instr (i.e., the branch) is determined. speculative src operand 1 src operand 2 destination issued functional unit executed PC Yes/No Spec Instr Addr Ready Tag Content Ready Tag Content Tag Content Yes/No Unit Number Yes/No Address ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.23

Branch Execution qif the branch was not mispredicted, then the branch and its trailing instructions can commit when at RUU_Head qif the branch was mispredicted, then all subsequent instr s must be discarded (even though subsequent branches may have been correctly predicted) qwhen there is an exception, all of the RUU entries are discarded in a single cycle and instruction stream fetching restarts on the next cycle. Thus, the RUU provides an easy way to discard instructions coming after a mispredicted branch. ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.24

Remember: Single Result Bus Fetch Decode & Dispatch Issue & Execute Writeback Commit P C BHT BTB I$ N I RUU_Head N L I I L FP I Integer RegFile RegFile 3 4 Load/Store Queue 1 2 FPALU IALU LSQ D$ RUU_Tail RUU 5 IALU 6 IMULT Register Update Unit (managed as a queue) Result Bus ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.25

Result Buses Utilization qour SS model has only one Result Bus to carry results generated by the FU s to the RUU and LSQ Even at the high levels of performance, the utilization of the Result Bus is only about 70% (i.e., the fraction of capacity actually used) q If a FU request for the Result Bus is not granted, instruction issue to that FU is stalled until the bus request can be granted (i.e., the FU remains busy ) Avg # of results waiting for the Result Bus 1 0 From Johnson, 1992 ccom troff 1 2 3 4 # of Result Buses ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.26

Arbitrating For Result Buses q Reducing the impact of bus contention by adding a second Result Bus improves performance (by almost 19%) q But adding a third Results Bus yields only a very small improvement in performance (less than 3%) qand since there are usually fewer Result Buses than FUs, the FUs must continue to arbitrate for use of the existing Result Buses Speedup The arbiter not only decides which FU is granted use of the Result Buses, but also which of the two buses is to be used - Prioritizing old requests over new helps prevent starvation 3 2 1 0 From Johnson, 1992 1 2 3 4 # of Result Buses Low HM High ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.27

Result Forwarding qresult forwarding supplies operands directly to the waiting instr s in the RUU to resolve true dependencies that could not be resolved during decode The cost of forwarding is the comparison logic in the RUU to compare the Result Bus Tag to the source operand Tags Need a set of comparators for each Result Bus qabout 2/3 rd of all results are forwarded to one waiting operand, and about 1/6 th are forwarded to more than one ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.28 Fraction of Results 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 From Johnson, 1992 ccom troff 0 1 2 3 4 5 6+ # of RUU Entries Receiving Result as Input Operand

Performance Advantages qhardware complexity arises from four major hardware features Out-of-order issue Register renaming Branch prediction 4-way instruction fetch and decode OOI Register Renaming Branch Prediction From Johnson, 1992 4-way Fetch & Decode 52% 36% 30% 18% Relative increase in performance by adding one feature on top of others ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.29

ILP in a Perfect OOI-OOC Processor qthe perfect processor has An infinite number of rename registers that eliminates all storage hazards (i.e., write-before-write and write-before-read) No (fetch, decode, dispatch, issue, FU, buses, ports) limit on the number of instr s that can begin execution simultaneously as long as read-before-write true data hazards are not present Perfect branch and jump (including jump register) prediction Loads can be moved before stores (as long as the addresses are not identical) with memory address analysis All FU s have a 1 cycle latency Perfect caches with 1 cycle latency IPC 160 120 80 40 0 gcc 55 espresso From H&P, 2003 63 li 18 fpppp 75 doduc 119 tomcatv 150 ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.30

Effect of Instruction Window Size on ILP qinstruction window the set of instructions that are examined simultaneously for execution IPC 160 120 80 From H&P, 2003 gcc espresso li fpppp doduc tomcatv 40 0 Infinite 2K 512 128 32 8 4 ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.31

Effect of Realistic Branch Prediction on ILP qon a processor with an instruction window size of 2K and maximum 64-way issue capability IPC 60 40 20 From H&P, 2003 gcc espresso li fpppp doduc tomcatv ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.32 0 Perfect Tournament Standard 2-bit Static None

Effect of Finite Rename Registers qon a processor with an instruction window size of 2K, maximum 64-way issue capability, and a tournament branch predictor with 8K entries IPC 60 40 20 From H&P, 2003 gcc espresso li fpppp doduc tomcatv 0 Infinite 256 128 64 32 None ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.33

A SS Example qintel Pentium 4 (IA-32 ISA) Decodes the IA-32 instructions into microoperations Does register renaming with a RUU-like structure Has a 20 stage pipeline T$ access (Bpredict) µop queue RUU allocation FU queues Instr dispatch RegFile access Execution RUU queue Commit # cycles 5 4 5 2 1 3 7 FUs: 2 integer ALUs, 1 FP ALU, 1FP move, load, store, complex Up to 126 instructions in flight, including 48 loads and 24 stores 4K entry branch predictor ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.34

ΚΑΤ ΟΙΚΟΝ ΜΕΛΕΤΗ q Για την 3 η ενότητα του μαθήματος (ILP): Κεφάλαιο 6 Patterson&Hennessy, 3 η έκδοση (από το βιβλίο του ΗΜΥ212) επανάληψη pipelining και 6.9 για advanced pipelining, speculation, static/dynamic multiple-issue Κεφάλαια 2-3 του βιβλίου σας Παραρτήματα A(pipelining) και G (VLIW) του βιβλίου σας q Για εξάσκηση σε ILP: ασκήσεις στο τέλος του κεφαλαίου 2 του βιβλίου σας (σελ. 142-149) q Αφήστε έξω τον αλγόριθμο του Tomasulo (για όσους ενδιαφέρονται για τη συνέχεια υπάρχει το ΗΜΥ409!) ΗΜΥ312 Δ17 SuperScalar Επεξεργαστές Dynamic ILP.35