Why is sequential access still faster than random access on an SSD (author claim...

banachtarski · on July 26, 2016

Regardless of the physical medium (SSD vs spinning disk) data is read in pages, in much the same way data from main memory is read to the CPU caches in chunks (called cache lines). Reading from either source byte by byte is extremely slow. Intuitively then, reading adjacent bytes on a disk will be faster. No matter how fast reading the page is, reading one page is always going to be faster than reading two. SSDs make random access faster than compared to a spinning disk (lower latency/seek time) but it can't break the laws of physics and make two seeks the same cost as one.

cmurphycode · on July 26, 2016

Just to clarify, I think you meant to ask why random access is still slower than sequential on ssd.

Writes smaller than the ssd erase size are more expensive per byte because of write amplification. This is not nearly as big of a factor as for disks, but it does seem to matter even on enterprise flash, if you benchmark properly (truly random and enough data so that it cannot be buffered by clever controllers). And as my sibling comment mentioned, the write amplification wastes flash lifetime, too!

On enterprise flash, I've found that random reads top out at the same bandwidth ad sequential reads, but it takes a LOT of parallelism to get there, so sequential reading, if possible, is still nice (let the controller do the parallelism for you!)

russell_sears · on July 26, 2016

There are two effects at play. If you're reading a multiple of a flash page at a time, then you can get random bandwidth to match sequential. Under the hood, decent SSDs actually round robin sequential reads across all of the underlying hardware, so they can run as fast as whatever their bottleneck resource is. The bottleneck could be error correction, the SATA link, or the channels (copper traces) between the NAND and the controller, for example. Random reads with enough parallelism give the SSD enough scheduling freedom to have the same throughput as sequential.

The second effect comes from the fact that application data is rarely exactly the same size as a flash page. If you group all the related data on one page, then you don't waste time transferring bits you won't look at. B-Trees sort of do this, but they leave holes in pages so they can support efficient inserts (a good SSD or database will compress the holes, but that doesn't help random reads much, since it's still going to transfer entire flash pages to the SSD controller).

LSM-Trees push things a bit further, since new data ends up being grouped on higher levels of the tree. When I was benchmarking bLSM's predecessor, I worked out the expected throughput on the back of an envelope while I was waiting for results. It did much better than expected, since the benchmark heavily favored recently created data! YMMV, of course.

cmrdporcupine · on July 26, 2016

Oops yeah thanks I meant to write the opposite, edited.

Guvante · on July 26, 2016

SSD's write at the page level. If you are flipping bits in a page it has to read the page, update the bits you asked it to update and then write that page somewhere else (for wear leveling purposes).

Thus the cost to do a write that doesn't completely replace the page is a per page cost, not a per byte cost. Thus if you have very small writes you are going to get terrible performance. By extension this also leads to write amplification from a wear perspective.

cmrdporcupine · on July 26, 2016

Thanks, makes sense. Though there isn't the cost of seek time associated with magnetic media, there is the cost of rewriting the whole page.