Community FAQ

Compilation of Q&A with the engineers.

NVMe: Different from traditional storage protocols
The role of NVMe-oF as the next storage network
Aligning NVMe-oF to your workloads

NVMe: Different from traditional storage protocols.

Even if it is the same flash (and often it is not, due to something called flash binning where better-performing flash can be sold for higher profits), the components around the flash and the interface used impact things massively. NVMe is a much simpler, lower latency protocol than either SATA or SAS, based on PCIe. The flash controllers used are often significantly more powerful than SATA or SAS ones, and they can communicate with many flash chips in parallel, allowing for higher throughput and lower latency.
The Linux kernel got support for NVMe in 2012, so it is safe to say that any major Linux distribution from the last five years should have support out-of-the-box. RedHat Enterprise Linux (RHEL) 7.x and 8.x, CentOS 7.x and 8.x, and all LTS Ubuntu distributions (16.04 and 18.04) or later support it.

VMware ESXi 6.5 and later has support for NVMe, too. For NVMe-over-Fabrics, VMware ESXi 7.0 is required (expected to launch in early 2020).

Microsoft Windows Server 2016 and later, as well as Windows 10 on the client, also support NVMe.
Not really. While technically it is possible to have an external PCIe chassis with special controller cards in both the server and JBOD, it is uneconomical and impractical (due to signaling limitations on the PCIe bus). Other ways of attaching NVMe over distances, like NVMe over Fabrics, are much more practical and commonplace.
The root cause is that in a cloud application, the data needs to be highly available. With a DAS configuration, though, data in any one server can’t be accessed by another server. In a traditional enterprise application, this would be worked around by using a central SAN, but cloud applications rarely have one available. Instead, an application keeps copies of one or more other servers’ data. Should the remote server go down, this copy can be used to spin up a replacement instance of the application on the surviving server. Often 2 additional copies of data are made to allow for greater than 99% availability, so each server in the cluster has 1/3 of its storage filled by its own (unique) data and the remaining 2/3s filled with copies of other servers’ data.
Not at all. Most enterprise-class applications such as scale-up databases are using disaggregated storage, in the form of a SAN or NAS.
It’s not really like SAS, with a fixed connection to a set of 2 servers. It’s more like iSCSI multipathing via Ethernet, or old-school Fibre Channel SAN multipathing. Normally two Ethernet ports are connected to two separate switches, which each, in turn, have wired connections to the NVMe-oF array. Through the magic of Ethernet (standard in Windows, Linux, and ESXi), if one link failure is detected, the other link gets the IP address of the failed connection and everything migrates seamlessly. An application may experience a wait of 1-2 seconds while this transition happens, but I/Os are not failed and the application generally doesn’t even know the link has failed over.
Actually, it may need significantly less (up to 2/3 less). In a DAS scale-out DB like MongoDB or Apache Cassandra, each server keeps a couple of replicas of other servers’ data, to preserve availability in the case of server failure (where the failed server’s DAS is obviously not accessible). With NVMe-oF, these replicas are no longer necessary and a new server instance can be mapped immediately to the volume holding the data from the failed server. So, no replication needed for high availability and big savings in flash costs.
It’s absolutely possible, and often the norm in organizations rolling out 100G-backbone Ethernet. By using a spine-and-leaf organization, RDMA within a rack can move between racks easily with minimal overhead. Alternatively, with NVMe-oF there is a pure TCP-based mode which routes exactly the same as any other TCP/IP connection over existing backbones, even if they are not RDMA-enabled.
While NVMe SSDs are present in most major cloud vendors’ portfolios, NVMe-oF is a little harder to find. You really need to ask your vendor what they can provide.

The role of NVMe-oF as the next storage network

It’s a lot of choices, but in reality it normally boils down to either RDMA over Converged Ethernet (RoCE) or InfiniBand. Fibre Channel and iWarp may be supported by a small subset of vendors retying to repurpose legacy hardware for NVMe-oF, but there just isn’t much market traction for them. In HPC and academic environments where InfiniBand is already deployed, it’s a no-brainer to use NVMe-oF over InfiniBand (since RDMA has always been a major part of InfiniBand’s utility). Otherwise, 100G and above Ethernet with NVMe-oF over RoCE is what you’ll find today.
SAS expanders are fine for things like hard drive arrays, where individual hard drives can’t come anywhere near bottlenecking the bandwidth. But with NVMe, an individual NVMe SSD can pump over 3GB/s to a server. With a single active drive, the PCIe expander will be fine. But the whole purpose of a PCIe expander is to have many, many drives connected. If you have 24 SSDs (easily accommodated in a 2U box), that’s over 70GB/s. A single 16-lane PCIe 4.0 (bleeding edge) slot can only theoretically transfer 64GB/s, and the more common 16-lane PCIe 3.0 slot can only theoretically hit 32GB/s. So it’s obvious that PCIe expanders are a horrible bottleneck with NVMe SSDs.
It depends on the I/O profile: For throughput-limited applications, NVMe-over-TCP can provide a good portion of the RDMA bandwidth; but for IOPS-limited applications the overhead and extra latency of TCP can severely hamper performance. What’s more, at 100Gbit, every switch and NIC I’ve seen has full support for RDMA baked in at no additional cost.
It depends on a custom PCIe fabric that can allow multiple CPUs to work on the rebuild in parallel. Many CPUs are needed because to rebuild a RAID-6 set you need to recalculate parity by reading out the entire RAID set’s contents. A single CPU can quickly become bottlenecked, but if portions of the SSD are assigned to different CPUs, work can be done in an embarrassingly parallel manner.
It depends, but not in most cases. WAN-area replication is often done using proprietary TCP or even UDP based protocols that better handle the longer latencies you see in WAN links.

There are a couple things in play here. First, if an application really needed 500GB of space today, you’d overprovision 25-30% just to be safe when it (inevitably) grows. Thin provisioning lets you hide than unused overprovisioning until it is actually needed.

Second, while a single node may need 500GB of space, in many cases if you have a cluster of dozens of servers in a scale-out application, not every server will have the same exact data needs (and you can’t generally predict which nodes will fill up first). In this case, using thin provisioning lets you exploit the difference between nodes and only allocate flash for nodes that really need it, on-the-fly.

Finally, there is the issue of how storage needs vary over time. Most applications start with smaller datasets, and the “data requirement” are guesstimates about the final state (months or years in the future) which may or may not happen. Thin provisioning lets you give applications as much space as they think they’ll need, but doesn’t actually dedicate the flash expense until it is really needed.

Aligning NVMe-oF to your workloads

AI obeys the GIGO principle: Garbage In, Garbage Out. There is a significant part of the AI data pipeline which simply involve processing massive amounts of data in standard servers: collecting raw data from masses of sensors (with something like Apache Spark), translating that raw input data into the proper format (ETL-like), cleaning the properly formatted data to ensure invalid inputs are filtered before they’re presented to the training array, etc.

Another trend we’re seeing is that while early AI training was GPU based and really didn’t have massive input throughput requirements, enterprises are moving to dedicated hardware AI accelerators that are orders of magnitude faster (and so need much more data, faster, or will sit idle). Couple that with the increase in average size of individual training vectors, and the throughput needs only increase.
Kubernetes doesn’t natively understand NVMe-oF, but it does have a standardized API that lets it connect to any persistent storage provider (assuming the array vendor provides a compatible plug-in) using something called the Container Storage Interface. Using that K8s can automatically generate persistent storage for containers on-the-fly using NVMe-oF, and migrate that storage around as containers move.
Snapshots and backups are very different things. If the data center gets destroyed in a natural disaster, having a snapshot of your database on it isn’t going to allow you to bring it up at another site. However, snapshots can save on backup expenses because it lets you back up data from a point-in-time without having to stop your application. Using NVMe-oF to make for the snapshot infrastructure also can increase the speed you can get data out of the array, minimizing the backup window.
Normally you do not need SEDs if you want to use encryption, but you should check with your vendor just to make sure. By using standard drives and doing a software-level encryption of the data on them, an array can save the end user money (since SEDs normally are 20-40% more expensive than non-encrypting drives). It also allows the use of an external key management server infrastructure which can be critical for certain industries.
Good catch. CRCs don’t have enough information to recover data, but they can confirm that no bits have flipped in the block. When a bit corruption is detected the array can treat this as a disk error and regenerate the block by using the parity calculations in RAID-5/6. Once regenerated in memory, it can be rewritten to the SSD. Any array should, of course, log the fact that a bit corruption was detected as this may indicate other potential issues with a specific SSD.