In June we saw an update to the NVMe standard. The update defines a software interface to assist in actually reading and writing to the drives in a way to which SSDs and NAND flash actually works.

Instead of emulating the traditional block device model that SSDs inherited from hard drives and earlier storage technologies, the new NVMe Zoned Namespaces optional feature allows SSDs to implement a different storage abstraction over flash memory. This is quite similar to the extensions SAS and SATA have added to accommodate Shingled Magnetic Recording (SMR) hard drives, with a few extras for SSDs. ‘Zoned’ SSDs with this new feature can offer better performance than regular SSDs, with less overprovisioning and less DRAM. The downside is that applications and operating systems have to be updated to support zoned storage, but that work is well underway.

The NVMe Zoned Namespaces (ZNS) specification has been ratified and published as a Technical Proposal. It builds on top of the current NVMe 1.4a specification, in preparation for NVMe 2.0. The upcoming NVMe 2.0 specification will incorporate all the approved Technical Proposals, but also reorganize that same functionality into multiple smaller component documents: a base specification, plus one for each command set (block, zoned, key-value, and potentially more in the future), and separate specifications for each transport protocol (PCIe, RDMA, TCP). The standardization of Zoned Namespaces clears the way for broader commercialization and adoption of this technology, which so far has been held back by vendor-specific zoned storage interfaces and very limited hardware choices.

Zoned Storage: An Overview

The fundamental challenge of using flash memory for a solid state drive is all of our computers are built around the concept of how hard drives work, and flash memory doesn't behave like a hard drive. Flash is organized very differently from a hard drive, and so optimizing our computers for the enhanced performance characteristics of flash memory will make it worth the trouble.

Magnetic platters are a fairly analog storage medium, with no inherent structure to dictate features like sector sizes. The long-lived standard of 512-byte sectors was chosen merely for convenience, and enterprise drives now support 4K byte sectors as we reach drive capacities in the multi-TB range. By contrast, a flash memory chip has several levels of structure baked into the design. The most important numbers are the page size and erase block size. Data can be read with page size granularity (typically on the order of several kB) and an empty page can be written to with a program operation, but erase operations clear an entire multi-MB block. The substantial size mismatch between read/program operations and erase operations is a complication that ordinary mechanical hard drives don't have to deal with. The limited program/erase cycle endurance of flash memory also adds to challenge, as writing fewer times increases the lifespan.

Almost all SSDs today are presented to software as an abstraction of a simple HDD-like block storage device with 512-byte or 4kB sectors. This hides all the complexities of SSDs that we’ve gone into detail over the years, such as page and erase block sizes, wear leveling and garbage collection. This abstraction is also part of why SSD controllers and firmware are so much bigger and more complicated (and more bug-prone) than hard drive controllers. For most purposes, the block device abstraction is still the right compromise, because it allows unmodified software to enjoy most of the performance benefits of flash memory, and the downsides like write amplification are manageable.

For years, the storage industry has been exploring alternatives to the block storage abstraction. There have been several proposals for Open Channel SSDs, which expose many of the gory details of flash memory directly to the host system, moving many of the responsibilities of SSD firmware over to software running on the host CPU. The various open channel SSD standards that have been promoted have struck different balances along the spectrum, between a typical SSD with a fully drive-managed flash translation layer (FTL) to a fully software-managed solution. The industry consensus was that some of the earliest standards, like the LightNVM 1.x specification, exposed too many details, requiring software to handle some differences between different vendors' flash memory, or between SLC, MLC, TLC, etc. Newer standards have sought to find a better balance and a level of abstraction that will allow for easier mass adoption while still allowing software to bypass the inefficiencies of a typical SSD.

Tackling the problem from the other direction, the NVMe standard has been gaining features that allow drives to share more information with the host about optimal patterns for data access and layout. For the most part, these are hints and optional features that software can take advantage of. This works because software that isn't aware of these features will still function as usual. Directives and Streams, NVM Sets, Predictable Latency Mode, and various alignment and granularity hints have all been added over the past few revisions of the NVMe specification to make it possible for software and SSDs to better cooperate.

Lately, a third approach has been gaining momentum, influenced by the hard drive market. Shingled Magnetic Recording (SMR) is a technique for increasing storage density by partially overlapping tracks on hard drive platters. The downside of this approach is that it's no longer possible to directly modify arbitrary bytes of data without corrupting adjacent overlapping tracks, so SMR hard drives group tracks into zones and only allow sequential writes within a zone. This has severe performance implications for workloads that include random writes, which is part of why drive-managed SMR hard drives have seen a mixed reception at best in the marketplace. However, in the server storage market, host-managed SMR is also a viable option: it requires the OS, filesystem and potentially the application software to be directly aware of zones, but making the necessary software changes is not an insurmountable challenge when working with a controlled environment.

The zoned storage model used for SMR hard drives turns out to also be a good fit for use with flash, and is a precursor to NVMe Zoned Namespaces. The zone-like structure of SMR hard drives mirrors the page and erase block structure of an SSD. The restrictions on writes aren't an exact match, but it comes close enough.

In this article, we’ll cover what NVMe Zoned Namespaces are, and why this is an important thing.

How to Enable NVMe Zoned Namespaces
Comments Locked

45 Comments

View All Comments

  • Carmen00 - Friday, August 7, 2020 - link

    Fantastic article, both in-depth and accessible, a great primer for what's coming up on the horizon. This is what excellence in tech journalism looks like!
  • Steven Wells - Saturday, August 8, 2020 - link

    Agree with @Carmen00. Super well written. Fingers crossed that one of these “Not a rotating rust emulator” architectures can get airborne. As long as the flash memory chip designers are unconstrained to do great things to reduce cost generation to generation with the SSD maintaining the fixed abstraction I’m all for this.
  • Javier Gonzalez - Friday, August 7, 2020 - link

    Great article Billy. A couple of pointers to other parts of the ecosystem that are being upstreamed at the moment are:

    - QEMU support for ZNS emulation (several patches posted in the mailing list)
    - Extensions to fio: Currently posted and waiting for stabilizing support for append in the kernel
    - nvme-cli: Several patches for ZNS management are already merged

    Also, a comment to xZTL is that it is intended to be used on several LSM-based databases. We ported RocksDB as a first step, but other DBs are being ported on top. xZTL gives the necessary abstractions for the DB backend to be pretty thin - you can see the RocksDB HDFS backend as an example.

    Again, great article!
  • Billy Tallis - Friday, August 7, 2020 - link

    Thanks for the feedback, and for your presentations that were a valuable source for this article!
  • Javier Gonzalez - Friday, August 7, 2020 - link

    Happy to hear that it helped.

    Feel free to reach out if you have questions on a follow-up article :)
  • jabber - Friday, August 7, 2020 - link

    And for all that, will still slow to Kbps and take two hours when copying a 2GB folder full of KB sized microfiles.

    We now need better more efficient file systems not hardware.
  • AntonErtl - Friday, August 7, 2020 - link

    Thank you for this very interesting article.

    It seems to me that ZNS strikes the right abstraction balance:

    It leaves wear leveling to the device, which probably does know more about wear and device characteristics, and the interface makes the job of wear leveling more straightforward than the classic block interface.

    A key-value would cover a significant part of what a file system does, and it seems to me that after all these years, there is still enough going on in this area that we do not want to bake it into drive firmware.
  • Spunjji - Friday, August 7, 2020 - link

    Everything up to the "Supporting Multiple Writers" section seemed pretty universally positive... and then it all got a bit hazy for me. Kinda seems like they introduced a whole new problem, there?

    I guess if this isn't meant to go much further than enterprise hardware then it likely won't be much of an issue, but still, that's a pretty keen limitation.
  • Spunjji - Friday, August 7, 2020 - link

    Great article, by the way. Realised I didn't mention that, but I really appreciate the perspective that's in-depth but not too-in-depth for the average tech-head 😁
  • AntonErtl - Saturday, August 8, 2020 - link

    As long as the zone is not divided between file systems, or direct-access databases, it is natural that writes are are synchronized and sequenced. And talking to the SSD through one NVMe/PCIe interface means that all writes (even to multiple zones) are sent to the drive in sequence.

    OTOH, you have software and hardware with synchronous interfaces (waits for some feedback before sending the next request), and in such a setting doing everything through one thread costs throughput.

    So you can either design everything to work with asynchronous interfaces (e.g., SCSI tagged command queuing), at least at all single-thread levels, or you design synchronous interfaces that work with multiple threads. The "write it somewhere, and then tell where you wrote" approach seems to be along the latter lines. What's the status of asynchronous interfaces for NVMe?

Log in

Don't have an account? Sign up now