ΕΠΛ372 Παράλληλη Επεξεργασία Εισαγωγή: Τάσεις Τεχνολογίας και Μετακίνηση στον Παραλληλισμό Γιάννος Σαζεϊδης Εαρινό Εξάμηνο 2014 READING www.elsevierdirect.com/samplechapters/9780123838728/hennessy_chapter%201.pdf Sections 1.1,1.2,1.4,1.5,1.6,1.8,1.9,1.10 http://www.youtube.com/watch?v=cnn_ttxabua How a CPU is made http://www.cs.utexas.edu/~lin/cs380p/free_lunch.pdf The free lunch is over Slides based on Elsevier Material Copyright 2012, Elsevier Inc. All rights reserved.
Σειριακή Εκτέλεση Εκτέλεση μία μία τις εντολές Drawing from tut at LLNL
Παράλληλη Εκτέλεση Multiple instructions at a time(as many as CPUs) Αύξηση της επίδοσης (P φορές= # of CPUs) Drawing from tut at LLNL
Παράλληλη Εκτέλεση + Finish a problem faster and better + Easier to build parallel than faster + Save money + Solve bigger problems + Concurrency (internet, google, facebook) - Many services independent requests - Not always obvious how to parallelize - Dependences/Communication limit parallelism - System size determines how much communication can afford - Writing parallel program can be tricky
372 Introduction Προσπαθεί να εκπαιδεύσει και να απαντήσει για θέματα σε σχέση με: Γιατί ο παραλληλισμός είναι πιο κρίσιμος σήμερα; Παράλληλες Μηχανές/Υπολογιστές Πώς να γράψετε παράλληλους αλγόριθμους/ προγράμματα Παράλληλες Γλώσσες Εκτιμήσετε τις δυσκολίες
Computer Technology (Τεχνολογία Υπολογιστών) Introduction Performance improvements: semiconductor technology Feature size (moore s law), clock speed computer architectures ilp, caching, microarchitecture System Software High Level Languages (HLL), C, C++, Java Compilers, gcc, g++ Operating system (OS), unix, linux, ios, windows
Single Processor Performance Move to multi-processor Introduction RISC
Bandwidth and Latency Bandwidth or throughput (Εύρος ή ιεκπεραιωτική Ικανότητα) Total work done in a given time 10,000-25,000X improvement for processors 300-1200X improvement for memory and disks Latency or response time (Χρόνος Ανταπόκρισης) Time between start and completion of an event 30-80X improvement for processors 6-8X improvement for memory and disks Trends in Technology
Bandwidth and Latency Trends in Technology Log-log plot of bandwidth and latency milestones Historic trend: bandwidth grows more than latency
Trends in Technology Integrated circuit technology Transistor density: 35%/year Die size: 10-20%/year Integration overall: 40-55%/year Trends in Technology DRAM capacity: 25-40%/year (slowing) Flash capacity: 50-60%/year 15-20X cheaper/bit than DRAM Magnetic disk technology: 40%/year 15-25X cheaper/bit then Flash 300-500X cheaper/bit than DRAM
Reason 1: Non uniform scaling of Transistors and Wires Feature size Minimum size of transistor or wire in x or y dimension 10 microns in 1971 to.014 microns (14nm) in 2014 Transistor performance scales linearly Wire delay does not improve with feature size! Integration density scales quadratically Non-uniform scaling supports localized activity and communication Fast compute elements, slow communication Trends in Technology
Reason 2. Can not increase Power Density Cheaply Power: ρυθμός κατανάλωσης ενέργειας(watts,w) Energy: το έργο που χρειάζεται για να γίνει μια εργασία (Joules, J) Problem: Get power in, get power out αλλιώς υπερθέρμανση Doubling of transistors per new technology node not matched with a more than half reduction in energy per transistor Thermal Design Power (TDP) Characterizes sustained power consumption Used as target for power supply and cooling system Cost implications Lower than peak power, higher than average power consumption 130W for server/desktops vs few watts for mobile/tablets Data Center Racks can air cool more than ~20KW/rack Trends in Power and Energy
Intel 80386 consumed ~ 2 W 3.3 GHz Intel Core i7 consumes 130 W Heat must be dissipated from 1.5 x 1.5 cm chip This is the limit of what can be cooled by air Power Wall Trends in Power and Energy Power limits single thread performance
Dynamic Energy and Power Dynamic energy Transistor switch from 0 -> 1 or 1 -> 0 Ε = ½ x Capacitive load x Voltage 2 Depends on number of transistors switching Trends in Power and Energy Dynamic power = Energy/Time Power in a cycle: P = ½ x Capacitive load x Voltage 2 x Frequency switched Reducing clock rate reduces power, not energy Often report energy per task
Reducing Power Techniques for reducing power: Do nothing well (για Data Centers αλλά και για κινητές συσκευές) Dynamic Voltage-Frequency Scaling Low power state for DRAM, disks Power gating: turning off cores Parallelism: get same performance with less power. Trends in Power and Energy
Reducing Power with Parallelism Assumptions: Provided a program is parallelizable F ~ V (simplified example) We know P = C V 2 F Case A: one processor running at F burns C V 2 F Case B: two processors running at F/2 will finish the same time as A but burn 2 (C (V/2) 2 F/2) = C V 2 F /4 (25%!) 2X real estate. Trends in Power and Energy
Static Power Καταναλώνεται συνέχεια στους επεξεργαστές, DRAM και όταν δεν κάνουμε χρήσιμο έργο λόγο διαρροής (leakage) Trends in Power and Energy Static power consumption Current static x Voltage Scales with number of transistors To reduce: power gating, lower voltage
Current Trends in Architecture Introduction For several years more ILP (How? 370) Cannot continue to leverage Instruction-Level parallelism (ILP) Single processor performance improvement ended in 2003 technology and program properties Other types of parallelism pursued: Data-level parallelism (DLP) Task-level parallelism (TLP) Require parallel architectures and explicit restructuring of the applications (e.g. parallel programming)
Classes of Computers Personal Mobile Device (PMD) e.g. smart phones, tablet computers Emphasis on energy efficiency and real-time Desktop Computing Emphasis on price-performance Servers Emphasis on availability, scalability, throughput Clusters/Warehouse Scale Computers Used for Software as a Service (SaaS) also as IaaS, PaaS Emphasis on availability and price-performance Supercomputers, emphasis: floating-point performance and fast internal networks Embedded Emphasis: price Classes of Computers
Current Parallelism in Computer Classes (processors/cores) Personal Mobile Device (PMD) 1 <10 e.g. start phones, tablet computers Emphasis on energy efficiency and real-time Desktop Computing 1-2 10-100 Emphasis on price-performance Servers 1-10s 10-1k Emphasis on availability, scalability, throughput Clusters/Warehouse Scale Computers 1K to 100K (x10) Used for Software as a Service (SaaS) also as IaaS PaaS Emphasis on availability and price-performance Supercomputers, emphasis: floating-point performance and fast internal networks 1K to 100K (x10) Embedded Computers can vary Emphasis: price Classes of Computers
Parallelism Classes of parallelism in applications: Data-Level Parallelism (DLP) Task-Level Parallelism (TLP) Classes of Computers Classes of architectural parallelism: Instruction-Level Parallelism (ILP) Vector architectures/graphic Processor Units (GPUs) Thread-Level Parallelism Request-Level Parallelism (Throughput)
Flynn s Taxonomy (1960s) Single instruction stream, single data stream (SISD) Single instruction stream, multiple data streams (SIMD) Vector architectures Multimedia extensions Graphics processor units Classes of Computers Multiple instruction streams, single data stream (MISD) No commercial implementation Multiple instruction streams, multiple data streams (MIMD) Tightly-coupled MIMD (multi and many-cores) Loosely-coupled MIMD (clusters, ware-houses)
Classes of Computers PP is not New, March 1961:
Parallel Processing is not New Multiprocessors and multicomputers have been around for decades Using specialized or commodity parts Parallel programming for certain domains very well developed art Many parallel programming languages and frameworks Numerous efforts for automatic (compiler) parallelization and vectorization Classes of Computers
What is new The lack of other alternatives to go forward for more general purpose market segments Power wall (non-ideal scaling of technology parameters, wires vs xtrs, voltage vs xtr area, leakage vs xtr area) Embrace of multi/many cores across industry Motivates both hardware and software industry to develop effective means to exploit parallel resources Emergence of large scale RLP (more throughput) No silver bullet for all applications Specialization and heterogeneity Small cores, big cores, gpus, accelerators etc Classes of Computers
Το μάθημα Εισαγωγή στις βασικές έννοιες της Παράλληλης Επεξεργασίας: υπολογιστές, αλγοριθμική σκέψη και προγραμματισμός Μέθοδος διδασκαλίας ιαλέξεις -Π 9-10.30 ΧΩ 01 004 Φροντιστήριο Τ 11-12 ΘΕΕ01 103 Προαπαιτούμενα: 121,131,221,222 Βιβλιογραφία: Επιλεγμένα άρθρα/βίντεο Principles of Parallel Programming, Calvin Lin & Lawrence Snyder Computer Architecture a Quantitative Approach, 5 th ed. Αξιολόγηση Εργασίες/Project/Παρουσιαση 35% εργασίες (~20) Project (~15) Ενδιάμεση Εξέταση 20% Τελική Εξέταση 45%
Γιατί να πάρετε το μάθημα Παράλληλοι υπολογιστές είναι παντού multi-cores/gpus σε phones, notebooks, desktops, games Πχ 8 πυρήνες στον AMD bulldozer, 4 πυρήνες στο ipad και σε κινητά Servers, clusters, data-centers, cloud computing, supercomputers και θα γίνουν ακόμη πιο διαδεδομένοι Πως γράφουμε παράλληλα προγράμματα γράφουμε παράλληλα ένα πρόγραμμα για να έχουμε πιο γρήγορη εκτέλεση for(i=0;i<n;++i) => forall(i in 0 to n-1 step 1) Τι περιορίζει/επηρεάζει την επίδοση; Πως βελτιώνουμε την επίδοση; Scalable και portable
Ύλη Βασικά-Εισαγωγή Παράλληλοι Υπολογιστές Multicores, GPUs, Data Centers, Supercomputers Μέτρηση/Σύγκριση Επίδοσης Παράλληλη Αφαιρετικότητα Βασικές παράλληλες δομές ελέγχου και δεδομένων Επεκτάσιμες Αλγοριθμικές Τεχνικές Παράλληλες Γλώσσες Προγραμματισμού Threads: p-threads, open-mp Message Passing: MPI CUDA Παρούσα Κατάσταση και σύγχρονες τάσεις Transactions MapReduce