s vs. Planes Vcc Vss Vccq Vssq Async Sync D Q[7:0] D Q[7:0] N/A D QS I/O control A ddress register Status register Command register CE# CLE ALE WE# RE# WP# CE# CLE ALE CLK W/R# WP# Control logic Ro Ro w Decode Column decode Decode N A ND ND Flash array Array (2 planes) R/B# R/B# Data register Register Cache register Register Notes: 1. N/A: This signal is tri-stated w hen the asynchronous interface is active.
s vs. Planes Figure 8: Array Organization: 32Gb and 64Gb Devices Cache register 4096 4314 bytes 218 4096 4314 bytes 218 I/O0 I/O7 Data register 4096 218 4096 218 4096 blocks per plane 8192 blocks per die 1 block 1 block 1 page = (4K + 218 bytes) 1 block = (4K + 218) bytes x 128 pages = (512K + 27K) bytes 1 plane = (512K + 27K) bytes x 4096 blocks = 17,256Mb 1 die = 17,256Mb x 2 planes = 34,512Mb Plane of even-numbered blocks (0, 2, 4, 6,..., 8188, 8190) Plane of odd-numbered blocks (1, 3, 5, 7,..., 4095, 8191)
DISK & FLASH
Disk Disk Case Spindle & Motor Magnet structure of Voice Coil Motor Load/Unload Ramp Actuator Flex cable
Flash SSD Unload anism Flash memory arrays Flash Memory Arrays Circuit board tor ATA Interface
Disk Issues Keeping ahead of Flash in price-per-gb is difficult (and expensive) Dealing with timing in a polar-coordinate system is non-trivial OS schedules disk requests to optimize both linear & rotational latencies; ideally, OS should not have to become involved at that level Tolerating long-latency operations creates fun problems E.g., block-fill not atomic; must reserve buffer for duration; Belady s MIN designed for disks & thus does not consider incoming block in analysis Internal cache & prefetch mechanisms are slightly behind the times
Flash SSD Issues Flash does not allow in-place update of data (must block-erase first); implication is significant amount of garbage collection & storage management Asymmetric read [1x] & program times [10x] (plus erase time [100x]) Proprietary firmware (heavily IP-oriented, not public, little published) Lack of models: timing/performance & power, notably Flash Translation Layer is a black box (both good & bad) Ditto with garbage collection heuristics, wear leveling, ECC, etc. Result: poorly researched (potentially?) E.g., heuristics? how to best organize concurrency? etc.
SanDisk SSD Ultra ATA 2.5 Block Diagram
Flash SSD Organization & Operation Host x16 Ctrl IDE ATA SCSI Host I/F Layer x32 SRAM MPU ECC Data & Ctrl Data Buffer x32 Flash Translation Layer Flash Controller x16 x16 x8 x8 x8 x8 NAND I/F Layer Flash Array Flash Array Flash Array Flash Array Numerous Flash arrays Arrays controlled externally (controller rel. simple, but can stripe or interleave requests) Ganging is device-specific FTL manages mapping (VM), ECC, scheduling, wear leveling, data movement Host interface emulates HDD
Flash SSD Organization & Operation Host x16 Ctrl IDE ATA SCSI x32 SRAM MPU ECC Data & Ctrl Data Buffer x32 Flash Controller x16 x16 x8 x8 x8 x8 Flash Array Flash Array Flash Array Flash Array 2 KB Page 128 KB Block 2 μs page read 200 μs page program 3 ms block erase Host I/F Layer Flash Translation Layer NAND I/F Layer 32 GB total storage 2K bytes CE# W# R# I/O I/O Control Control Logic Addr Reg Status Reg Cmd Reg Row Column Flash Array Data Reg Cache Reg Data Reg Cache Reg 1 Block 1 Page = 2 K bytes 1 Blk = 64 Pages 1024 Blocks per Device (1 Gb) Flash Memory
Flash SSD Timing Read page from memory array into data register Xfer from data to cache register 25 us 3 us Subsequent page is accessed while data is read out from cache register Read 8 KB (4 Pages) R/W I/O [7:0] Cmd Addr Rd0 Rd1 Rd2 Rd3 DO0 DO1 DO2 DO3 5 cycles 0.2 us 2048 cycles 81.92 us Xfer from cache to data register 3 us 200 us Page is programmed while data for subsequent page is written into cache register Write 8 KB (4 Pages) R/W I/O [7:0] Cmd Pr0 Pr1 Pr2 Pr3 Addr DI0 DI1 DI2 DI3 5 cycles 0.2 us 2048 cycles 81.92 us
Some Performance Studies Host Host I/F Flash Controller Host Host I/F Flash Controller Host Host I/F Flash Controller (a) Single channel (b) Dedicated channel for each bank (c) Multiple shared channels Read Response Time vs. Organization 4 3 Response Time (msec) 2 1 0 1 x 8 x 25 1 x 8 x 50 2 x 8 x 25 1 x 16 x 50 1 x 8 x 100 2 x 8 x 50 1 x 16 x 50 4 x 8 x 25 2 x 16 x 25 1 x 32 x 25 2 x 8 x 100 1 x 16 x 100 4 x 8 x 50 2 x 16 x 50 1 x 32 x 50 4 x 16 x 25 2 x 32 x 25 4 x 8 x 100 2 x 16 x 100 1 x 32 x 100 4 x 16 x 50 2 x 32 x 50 4 x 32 x 25 4 x 16 x 100 2 x 32 x 100 4 x 32 x 50 4 x 32 x 100 Configuration
I/O Access Optimization Access time increasing with level of banking on single channel Increase cache register size 8 KB Write, 2 KB reg. - 4 I/O accesses Flash Controller ndent s Flash Array Data Reg Data Reg Cache Reg Cache Reg Cache Reg Data Reg Flash Array Flash Array Cache Reg Data Reg Flash Array 2 K bytes Cache Reg Data Reg 1 Block 1 Page 1 Page 64 Pages 2 K bytes Cache Reg Data Reg 1 Block 4 Pages R/W 1 Page I/O [7:0] 64 Pages Xfer from cache to data register 3 us Cmd 5 cycles 0.2 us 2048 cycles 81.92 us 200 us Page is programmed while data for subsequent page is transferred Pr0 Pr1 Pr2 Pr3 Addr DI0 DI1 DI2 DI3 Xfer from cache to data register 3 us 8 KB reg. - Single I/O access 200 us Page is programmed Reduce # of I/O access requests R/W I/O [7:0] Cmd Addr DI0 DI1 DI2 DI3 Pr0 Pr1 Pr2 Pr3 5 cycles 0.2 us 8192 cycles 327.68 us
I/O Access Optimization Implement different bus-access policies for reads and writes Xfer from cache to data register 3 us 200 us Page is programmed while data for subsequent page is read Write 8 KB (4 Pages) R/W I/O [7:0] Cmd Pr0 Pr1 Pr2 Pr3 Addr DI0 DI1 DI2 DI3 5 cycles 0.2 us Read page from memory array 2048 cycles 81.92 us Xfer from data to cache register 25 us 3 us do not need I/O access as freq Writes do not need I/O access as frequently as reads Subsequent page is accessed while data is read out R/W Read 8 KB (4 Pages) I/O [7:0] Cmd Addr Rd0 Rd1 Rd2 Rd3 DO0 DO1 DO2 DO3 5 cycles 0.2 us 2048 cycles 81.92 us Hold I/O bus betwee Reads: Hold I/O bus between data bursts
!"#$%&'())* +),-./!'/01),0 '(.#2$!3456 7'&894:3; <)=1>.)!? @.>=)!AB=-C DBE#*!FB2$ G2#E).0#1/ -H!7.)1) '<ID?!59 GJIK?L'IMN!OP!+QLN<QJD @K5K-KC<+1+@R.5+@>?@+!7.,