Κεφάλαιο 4 Εκτίμηση και Κατανόηση Απόδοσης

Κεφάλαιο 4 Εκτίμηση και Κατανόηση Απόδοσης (Assessing and Understanding Performance) 1

Απόδοση H/Y (Computer Performance) Measure, Report, and Summarize Understand major factors that determine performance Make intelligent choices, application matters! See through the marketing hype Key to understanding underlying organizational motivation Why is some hardware better than others for different programs? What factors of system performance are hardware related? (e.g., Do we need a new machine, or a new operating system?) How does the machine's instruction set affect performance? 2

Πιο από τα πιο κάτω αεροπλάνα έχει την καλύτερη απόδοση; Airplane Passengers Range Speed Throughput Model Capacity (miles) (m.p.h.) (passengers x m.p.h.) Boeing 737-100 101 630 598 60,398 Boeing 777 375 4630 610 228,750 Boeing 747 470 4150 610 286,700 BAC/Sud Concorde 132 4000 1350 178,200 Douglas DC-8-50 146 8720 544 79,424 How much faster is the Concorde compared to the 747? How much bigger is the 747 than the Douglas DC-8? Many factors that can define performance time to transport 1 passenger time to transport n passengers (throughput) 3

Απόδοση Η/Υ Response Time (Χρόνος Απόκρισης) or Latency (Χρόνος Αναμονής) or Execution Time (Χρόνος Εκτέλεσης) The total time required for the computer to complete a task (includes memory, I/O, O/S overhead, CPU time, etc) How long does it take for my job to run? How long does it take to execute a job? How long must I wait for the database query? Throughput (Δυναμικότητα) Total amount of work done in a given time How many jobs can the machine run at once? What is the average execution rate? How much work is getting done? 4

Απόδοση Η/Υ (συν.) Response Time (Χρόνος Απόκρισης) or Latency (Χρόνος Αναμονής) or Execution Time (Χρόνος Εκτέλεσης) The total time required for the computer to complete a task (includes memory, I/O, O/S overhead, CPU time, etc) Throughput (Δυναμικότητα) Total amount of work done in a given time If we upgrade a machine with a new processor what do we increase? If we add a new machine to the lab what do we increase? 5

Χρόνος Εκτέλεσης (Execution Time) Execution or Elapsed Time (Διαρρεύσας Χρόνος) counts everything (disk and memory accesses, I/O, etc.) a useful number, but often not good for comparison purposes CPU time (Χρόνος Μικροεπεξεργαστή) doesn't count I/O or time spent running other programs, actual time the CPU spends computing for a program can be broken up into system CPU time (=O/S time spent in a program), and user CPU time (=CPU time spent in a program) Our focus: user CPU time time spent executing the lines of code that are "in" our program 6

Ορισμός Απόδοσης For some program running on machine X and machine Y, Performance X = 1 / Execution time X Performance x > Performance y Execution time y > Execution time x X is n times faster than Y (Performance Ratio) Performance X / Performance Y = n Execution time y / Execution time x = n Problem: machine A runs a program in 20 seconds machine B runs the same program in 25 seconds How much faster is A than B? 7

Κύκλοι Ρολογιού Instead of reporting execution time in seconds, we often use cycles (h/w designer s point of view) CPU execution time = CPU clock cycles x clock cycle time Time = seconds program clock cycles = program seconds clock cycle Clock ticks indicate when to start activities (one abstraction): cycle time (period) = time between ticks = seconds per cycle clock rate (frequency) = cycles per second = 1/ cycle time (1 Hz. = 1 cycle/sec) A 4 Ghz. clock has a cycle period of: 1 4 10 9 seconds = 1 4 10 9 time 10 12 = 250 picoseconds (ps) 8

Βελτίωση Απόδοσης seconds program cycles = program seconds cycle So, to improve performance (everything else being equal) you can either (increase or decrease?) the # of required cycles for a program, or the clock cycle time or, said another way, the clock rate. 9

Παράδειγμα 1 Our favorite program runs in 10 seconds on computer A, which has a 4 GHz. clock. We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer can use new (or perhaps more expensive) technology to substantially increase the clock rate, but has informed us that this increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate should we tell the designer to target?" Don't Panic, can easily work this out from basic principles 10

Πόσοι κύκλοι ρολογιού χρειάζονται για ένα πρόγραμμα; Could assume that number of cycles equals number of instructions: 1st instruction 2nd instruction 3rd instruction 4th 5th 6th... time This assumption is incorrect, different instructions take different amounts of time on different machines. Why? hint: remember that these are machine instructions, not lines of C code 11

Διαφορετικός αριθμός κύκλων ρολογιού για διαφορετικές εντολές μηχανής 1 2 3 4 5 1 2 1 1 3 # cycles per instruction time Multiplication takes more time than addition Floating point operations take longer than integer ones Accessing memory takes more time than accessing registers Important point: changing the cycle time often changes the number of cycles required for various instructions 12

CPI (Κύκλοι ρολογιού εντολής) Clock cycles per instruction (CPI) = the average number of clock cycles per instruction for a program or a program fragment. Thus, CPU clock cycles = # Instructions in program x CPI Substituting this in CPU execution time = CPU clock cycles x clock cycle time We have: CPU execution time = # Instructions in program x CPI x clock cycle time Observe that CPI is an average of all instructions used in a program, not easily calculated without simulation CPI is a good measure to compare different implementations of same ISA. Why? 13

Παράδειγμα 2 (για CPI ) Suppose we have two implementations of the same instruction set architecture (ISA). For some program, Machine A has a clock cycle time of 250 ps and a CPI of 2.0 Machine B has a clock cycle time of 500 ps and a CPI of 1.2 Which machine is faster for this program, and by how much? If two machines have the same ISA which of our quantities (e.g., clock rate, CPI, execution time, # of instructions, MIPS) will always be identical? 14

Τώρα που καταλαβαίνουμε περισσότερα A given program will require some number of instructions (machine instructions) some number of cycles some number of seconds We have a vocabulary that relates these quantities: cycle time (seconds per cycle) clock rate (cycles per second) CPI (average cycles per instruction) a floating point intensive application might have a higher CPI MIPS (millions of instructions per second) this would be higher for a program using simple instructions 15

Απόδοση Performance is determined by execution time Do any of the other variables equal performance? # of cycles to execute program? # of instructions in program? # of cycles per second (frequency)? average # of cycles per instruction (CPI)? average # of instructions per second (MIPS)? Common pitfall: thinking one of the variables is indicative of performance when it really isn t. 16

Παράδειγμα 3 (για # εντολών) A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively). The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. Which sequence will be faster? How much? What is the CPI for each sequence? n Consider that: CPU clock cycles = Σ (CPI i x C i ), i = 1 n=number of instruction classes, C i = # of instructions of class i 17

Αξιολόγηση Απόδοσης How do we evaluate performance between 2 (or more) computers? Workload (Φόρτος Εργασίας) A set of programs run on a computer that is either the actual collection of applications run by a user or is constructed from real programs to approximate such a mix (usually, a set of benchmark programs). Benchmarks (Πρότυπα Προγράμματα Αξιολόγησης) Specifically chosen to measure performance. Nowadays, they are real applications that the user will use more often (for example, scientific applications for engineers, compiler and text editing for s/w engineers) 18

Benchmarks (Πρότυπα Προγράμματα) Performance best determined by running a real application Small benchmarks nice for architects and designers (especially when no compiler is still available) easy to standardize can be abused! SPEC (System Performance Evaluation Cooperative) companies have agreed on a set of real program and inputs effort started in 1989, focusing on workstations and servers using CPU-intensive benchmarks nowadays include CPU performance, graphics, highperformance computing, object-oriented computing, Java applications, client-server models, mail systems, file systems, and Web servers valuable indicator of performance (and compiler technology) can still be abused (see next slide) 19

Τα παιγνίδια των πολυεθνικών με τα Benchmark An embarrassed Intel Corp. acknowledged Friday that a bug in a software program known as a compiler had led the company to overstate the speed of its microprocessor chips on an industry benchmark by 10 percent. However, industry analysts said the coding error was a sad commentary on a common industry practice of cheating on standardized performance tests The error was pointed out to Intel two days ago by a competitor, Motorola came in a test known as SPECint92 Intel acknowledged that it had optimized its compiler to improve its test scores. The company had also said that it did not like the practice but felt to compelled to make the optimizations because its competitors were doing the same thing At the heart of Intel s problem is the practice of tuning compiler programs to recognize certain computing problems in the test and then substituting special handwritten pieces of code Saturday, January 6, 1996 New York Times 20

Αναφορά Απόδοσης Performance report Drafted after performance measurements are obtained from a selected set of suitable benchmarks Reproducibility (Aναπαραγωγικότητα) Include in the report everything another experiment would need to reproduce the reported results (O/S version, compiler, input, computer configuration, etc) Choice of input is very important. Large input is necessary to evaluate the memory system. Large workload is also necessary. 21

Σύγκριση και Σύνοψη Απόδοσης Deciding how to summarize the performance of a group of benchmarks is critical How should a summary be computed? Consider the following example: Program Computer A execution time Computer B execution time 1 1 sec 10 secs 2 1000 secs 100 secs Total 1001 secs 110 secs Based on the definition of faster, both of the following hold: A is 10 times faster than B for program 1 B is 10 times faster than A for program 2 What s the total/big picture? 22

ΣυνολικόςΧρόνοςΕκτέλεσης A consistent summary measure Thus, to summarize relative performance we can use the total execution time of the 2 programs: Performance B = Execution Time A = 1001 = 9.1 Performance A Execution Time B 110 Thus, B is 9.1 times faster than A The average of the execution times that is directly proportional to total execution time is the arithmetic mean (AM): n AM = Σ Execution Time i 1 n i=1 Weighted AM: assign weighting factor to each program to reflect the frequency of the program in the workload. 23

Τι νομίζετε; Suppose you are choosing between 4 different desktops: one Apple Macintosh and three PC compatible (Pentium 4, AMD, and Pentium 5). Which of the following is True? 1. The fastest computer will be the one with the highest clock rate. 2. Since all PCs use the same Intel compatible ISA (same # of instructions), the fastest one will be the one with the highest clock rate. 3. The AMD processor may have different CPIs. Still, you can tell which of the 2 Pentium PCs is faster by looking at the clock rate. 4. Only by looking at the results of benchmarks for tasks similar to your workload you can get an accurate picture of performance. 24

Τι νομίζετε; Assume the following: Program Computer A execution time Computer B execution time 1 2 secs 4 secs 2 5 secs 2 secs Which of the following is True? 1. A is faster than B for program 1. 2. A is faster than B for program 2. 3. A is faster than B for a workload with equal numbers of executions of program 1 and 2. 4. A is faster than B for a workload with twice as many executions of program 1 as of program 2. 25

SPEC 89 (Τα πρώτα πρότυπα) Compiler enhancements and performance 800 700 600 SPEC performance ratio 500 400 300 200 100 0 gcc espresso spice doduc nasa7 li eqntott matrix300 fpppp tomcatv Benchmark Compiler Enhanced compiler 26

SPEC CPU2000 12 integer and 14 floating-point programs different summaries for each group wall clock is reported but since the majority of time spend is in CPU, CPU performance is measured SPEC ratio: normalize execution time by dividing by the execution time on a Sun Ultra 5_10 (300 MHz) by the execution time of the computer examined 27

SPEC CPU2000 for Intel Pentium III and Pentium 4 1400 1200 1000 800 on Dell Precision Pentium 4 CFP2000 Pentium 4 CINT2000 600 400 Pentium III CINT2000 200 Pentium III CFP2000 0 500 1000 1500 2000 2500 3000 3500 Clock rate in MHz Performance scales almost linearly with clock rate Any other observations? 28

SPEC CPU2000 for Intel Pentium III and Pentium 4 1400 1200 1000 800 on Dell Precision Pentium 4 CFP2000 Pentium 4 CINT2000 600 400 Pentium III CINT2000 200 Pentium III CFP2000 0 500 1000 1500 2000 2500 3000 3500 Clock rate in MHz Look at the average value of CFT2000 and CINT2000 divided by clock rate (MHz): CINT2000/Clock rate Pentium III = 0.47 Pentium 4 = 0.36 CFP2000/Clock rate Pentium III = 0.34 Pentium 4 = 0.39 29

SPECweb99 Focuses on throughput, measuring the maximum number of connections a web-server can support Multiprocessor systems are often used Web server s/w is part of the system measured SPECweb99 performance depends on many system characteristics (number of disk drives, number of CPUs, number of networks, clock rate) Example: Pentium III Xeon w/ 7 disk drives, 8 CPUs, 8 networks, at 0.7GHz is much better than Pentium Xeon 4 w/ 5 disk drives, 2 CPUs, 4 networks, at 3.06GHz 30

Απόδοση, Ισχύς και Αποδοτικότητα Ενέργειας Power is becoming a key limitation to processor performance Battery power improves slowly Processor MUST be design to operate efficiently and conserve power and be able to switch between different clock rates Assume 3 operation modes: Maximum power: maximum clock rate best performance Laptop mode: adaptive clock rate Minimum power: minimum clock rate best power efficiency Energy efficiency: performance divided by average power consumption while running benchmarks 31

Απόδοση, Ισχύς και Αποδοτικότητα Ενέργειας Can a machine with a slower clock rate have better performance? 32

Δοκιμάστε το πιο κάτω Phone a major computer retailer and tell them you are having trouble deciding between two different computers, specifically you are confused about the processors strengths and weaknesses (e.g., Pentium 4 at 2Ghz vs. Celeron M at 1.4 Ghz ) What kind of response are you likely to get? What kind of response could you give a friend with the same question? 33

Θυμηθείτε Performance is specific to a particular program/s Total execution time is a consistent summary of performance For a given architecture performance increases come from: increases in clock rate (without adverse CPI affects) improvements in processor organization that lower CPI compiler enhancements that lower CPI and/or instruction count Algorithm/Language choices that affect instruction count Many Pitfalls: Expecting improvement in one aspect of a machine s performance to affect the total performance (Amdahl s Law) Using a subset of the performance equation as a performance metric (MIPS measure) 34

ΟνόμοςτουAmdahl Execution Time Execution Time Execution Time Affected After Improvement = + Unaffected Amount of Improvement Example: "Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?" How about making it 5 times faster? Principle: Make the common case fast 35

Μέτρο MIPS MIPS = Million Instructions Per Second Easy to understand, can be used to specify performance under certain assumptions It can be inaccurate in many occasions since it does not give the whole picture! Describes execution rate but does not look into the instruction capabilities (CPI) Cannot be used to compare computers with different ISAs. Why? 36

Παράδειγμα μέτρου MIPS Two different compilers are being tested for a 4 GHz. machine with three different classes of instructions: Class A, Class B, and Class C, which require one, two, and three cycles (respectively). Both compilers are used to produce code for a large piece of software. The first compiler's code uses 5 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. The second compiler's code uses 10 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. Which sequence will be faster according to MIPS? Which sequence will be faster according to execution time? 37