NVMe is the SSD

NVMe-oF is the storage network.

101: NVMe Explained

An overview of today’s hottest interface for flash storage, NVMe. This class gives background on the SATA III, SAS, and the NVMe interfaces and how the NVMe interface evolved. It also shows why applications that require high-performance and continuous operations should use the NVMe interface instead of the traditional SATA III and SAS interfaces.

Learning Objectives

  • Flash interface evolution
  • What is the NVMe interface
  • Why NVMe is ideal for flash

Even if it is the same flash (and often it is not, due to something called flash binning where better-performing flash can be sold for higher profits), the components around the flash and the interface used impact things massively. NVMe is a much simpler, lower latency protocol than either SATA or SAS, based on PCIe. The flash controllers used are often significantly more powerful than SATA or SAS ones, and they can communicate with many flash chips in parallel, allowing for higher throughput and lower latency.

The Linux kernel got support for NVMe in 2012, so it is safe to say that any major Linux distribution from the last five years should have support out-of-the-box. RedHat Enterprise Linux (RHEL) 7.x and 8.x, CentOS 7.x and 8.x, and all LTS Ubuntu distributions (16.04 and 18.04) or later support it.

VMware ESXi 6.5 and later has support for NVMe, too. For NVMe-over-Fabrics, VMware ESXi 7.0 is required (expected to launch in early 2020).

Microsoft Windows Server 2016 and later, as well as Windows 10 on the client, also support NVMe.
Not really. While technically it is possible to have an external PCIe chassis with special controller cards in both the server and JBOD, it is uneconomical and impractical (due to signaling limitations on the PCIe bus). Other ways of attaching NVMe over distances, like NVMe over Fabrics, are much more practical and commonplace.

102: Why Disaggregation

Industry experts have predicted that disaggregation compute and storage accessed with low-latency protocols like NVMe-oF must be used instead of direct access storage to support the scale of the web-scale application growth. This class describes the benefits seen when disaggregating storage from compute for today’s web-scale applications.

Learning Objectives

  • Define CDI (composable disaggregated infrastructure)
  • Common enterprise features needed
  • Disaggregated storage vs SAN & Cloud
The root cause is that in a cloud application, the data needs to be highly available. With a DAS configuration, though, data in any one server can’t be accessed by another server. In a traditional enterprise application, this would be worked around by using a central SAN, but cloud applications rarely have one available. Instead, an application keeps copies of one or more other servers’ data. Should the remote server go down, this copy can be used to spin up a replacement instance of the application on the surviving server. Often 2 additional copies of data are made to allow for greater than 99% availability, so each server in the cluster has 1/3 of its storage filled by its own (unique) data and the remaining 2/3s filled with copies of other servers’ data.
Not at all. Most enterprise-class applications such as scale-up databases are using disaggregated storage, in the form of a SAN or NAS.
It’s not really like SAS, with a fixed connection to a set of 2 servers. It’s more like iSCSI multipathing via Ethernet, or old-school Fibre Channel SAN multipathing. Normally two Ethernet ports are connected to two separate switches, which each, in turn, have wired connections to the NVMe-oF array. Through the magic of Ethernet (standard in Windows, Linux, and ESXi), if one link failure is detected, the other link gets the IP address of the failed connection and everything migrates seamlessly. An application may experience a wait of 1-2 seconds while this transition happens, but I/Os are not failed and the application generally doesn’t even know the link has failed over.

103: DAS, Cloud, NVMe-oF

Industry experts have predicted that disaggregation compute and storage accessed with low-latency protocols like NVMe-oF must be used instead of direct access storage to support the scale of the web-scale application growth. Attend this short class to learn more about disaggregating storage from compute.

Learning Objectives

  • What is DAS and Cloud storage and how it is used
  • Benefits of using NVMe-based disaggregated storage 
  • Which storage to use for web scale app
Actually, it may need significantly less (up to 2/3 less). In a DAS scale-out DB like MongoDB or Apache Cassandra, each server keeps a couple of replicas of other servers’ data, to preserve availability in the case of server failure (where the failed server’s DAS is obviously not accessible). With NVMe-oF, these replicas are no longer necessary and a new server instance can be mapped immediately to the volume holding the data from the failed server. So, no replication needed for high availability and big savings in flash costs.
It’s absolutely possible, and often the norm in organizations rolling out 100G-backbone Ethernet. By using a spine-and-leaf organization, RDMA within a rack can move between racks easily with minimal overhead. Alternatively, with NVMe-oF there is a pure TCP-based mode which routes exactly the same as any other TCP/IP connection over existing backbones, even if they are not RDMA-enabled.
While NVMe SSDs are present in most major cloud vendors’ portfolios, NVMe-oF is a little harder to find. You really need to ask your vendor what they can provide.

104: NVMe-oF and AFAs

This short class introduces using NVMe-over-Fabrics (NVMe-oF), why it’s important, and its major use cases. We’ll also explore important features NVMe-oF brings to all flash arrays (AFA) and their need to support these use cases.

Learning Objectives

  • NVMe-oF compared to NVMe
  • How many AFAs compromise on NVMe SSD performance
  • Using NVMe-oF/RDMA vs NVMe-oF/TCP
It’s a lot of choices, but in reality it normally boils down to either RDMA over Converged Ethernet (RoCE) or InfiniBand. Fibre Channel and iWarp may be supported by a small subset of vendors retying to repurpose legacy hardware for NVMe-oF, but there just isn’t much market traction for them. In HPC and academic environments where InfiniBand is already deployed, it’s a no-brainer to use NVMe-oF over InfiniBand (since RDMA has always been a major part of InfiniBand’s utility). Otherwise, 100G and above Ethernet with NVMe-oF over RoCE is what you’ll find today.
SAS expanders are fine for things like hard drive arrays, where individual hard drives can’t come anywhere near bottlenecking the bandwidth. But with NVMe, an individual NVMe SSD can pump over 3GB/s to a server. With a single active drive, the PCIe expander will be fine. But the whole purpose of a PCIe expander is to have many, many drives connected. If you have 24 SSDs (easily accommodated in a 2U box), that’s over 70GB/s. A single 16-lane PCIe 4.0 (bleeding edge) slot can only theoretically transfer 64GB/s, and the more common 16-lane PCIe 3.0 slot can only theoretically hit 32GB/s. So it’s obvious that PCIe expanders are a horrible bottleneck with NVMe SSDs.
It depends on the I/O profile: For throughput-limited applications, NVMe-over-TCP can provide a good portion of the RDMA bandwidth; but for IOPS-limited applications the overhead and extra latency of TCP can severely hamper performance. What’s more, at 100Gbit, every switch and NIC I’ve seen has full support for RDMA baked in at no additional cost.

105: NVMe-oF Arrays

This short class builds on Storage Academy 104 and covers additional features NVMe-oF based all-flash array must provide.

Learning Objectives

  • Enterprise Storage Management Features used by an NVMe SSD array and and how they benefit applications
    • RAID
    • Snapshot
    • Replication
    • Thin Provisioning
It depends on a custom PCIe fabric that can allow multiple CPUs to work on the rebuild in parallel. Many CPUs are needed because to rebuild a RAID-6 set you need to recalculate parity by reading out the entire RAID set’s contents. A single CPU can quickly become bottlenecked, but if portions of the SSD are assigned to different CPUs, work can be done in an embarrassingly parallel manner.
It depends, but not in most cases. WAN-area replication is often done using proprietary TCP or even UDP based protocols that better handle the longer latencies you see in WAN links.
There are a couple things in play here. First, if an application really needed 500GB of space today, you’d overprovision 25-30% just to be safe when it (inevitably) grows. Thin provisioning lets you hide than unused overprovisioning until it is actually needed.

Second, while a single node may need 500GB of space, in many cases if you have a cluster of dozens of servers in a scale-out application, not every server will have the same exact data needs (and you can’t generally predict which nodes will fill up first). In this case, using thin provisioning lets you exploit the difference between nodes and only allocate flash for nodes that really need it, on-the-fly.

Finally, there is the issue of how storage needs vary over time. Most applications start with smaller datasets, and the “data requirement” are guesstimates about the final state (months or years in the future) which may or may not happen. Thin provisioning lets you give applications as much space as they think they’ll need, but doesn’t actually dedicate the flash expense until it is really needed.

106: Managing NVMe-oF

This class describes how many of today’s HPC applications for Artificial Intelligence, Machine Learning, and High-Frequency Trading create an insatiable need for performance. This short class covers how legacy storage architectures can choke NVMe performance at the controller or force inefficient and uneconomical workarounds such as Software-Defined Storage (SDS) or Direct Attached Storage (DAS) and why it no is longer necessary.

Learning Objectives

  • How modern applications, including ML/AI and HFT benefit from using NVMe SSDs
  • How to lower cost for modern applications by using storage disaggregation
  • How NVMe can be used by containers
AI obeys the GIGO principle: Garbage In, Garbage Out. There is a significant part of the AI data pipeline which simply involve processing massive amounts of data in standard servers: collecting raw data from masses of sensors (with something like Apache Spark), translating that raw input data into the proper format (ETL-like), cleaning the properly formatted data to ensure invalid inputs are filtered before they’re presented to the training array, etc.

Another trend we’re seeing is that while early AI training was GPU based and really didn’t have massive input throughput requirements, enterprises are moving to dedicated hardware AI accelerators that are orders of magnitude faster (and so need much more data, faster, or will sit idle). Couple that with the increase in average size of individual training vectors, and the throughput needs only increase.
Kubernetes doesn’t natively understand NVMe-oF, but it does have a standardized API that lets it connect to any persistent storage provider (assuming the array vendor provides a compatible plug-in) using something called the Container Storage Interface. Using that K8s can automatically generate persistent storage for containers on-the-fly using NVMe-oF, and migrate that storage around as containers move.

107: HA for NVMe-oF

This short video teaches you about the features that disaggregated storage provides to many web-scale applications. This short class covers features like continuous operations, data protection, and security (at rest and in-flight), and should be taken by all customers looking to disaggregate their web-scale application.

Learning Objectives

  • Benefits of NVMe disaggregated storage for containers
  • Advanced enterprise storage management features used by an NVMe SSD array and how they benefit applications
    • Snapshots & Clones
    • Multi-pathing
    • Encryption
    • Ensuring data assurance
Snapshots and backups are very different things. If the data center gets destroyed in a natural disaster, having a snapshot of your database on it isn’t going to allow you to bring it up at another site. However, snapshots can save on backup expenses because it lets you back up data from a point-in-time without having to stop your application. Using NVMe-oF to make for the snapshot infrastructure also can increase the speed you can get data out of the array, minimizing the backup window.
Normally you do not need SEDs if you want to use encryption, but you should check with your vendor just to make sure. By using standard drives and doing a software-level encryption of the data on them, an array can save the end user money (since SEDs normally are 20-40% more expensive than non-encrypting drives). It also allows the use of an external key management server infrastructure which can be critical for certain industries.
Good catch. CRCs don’t have enough information to recover data, but they can confirm that no bits have flipped in the block. When a bit corruption is detected the array can treat this as a disk error and regenerate the block by using the parity calculations in RAID-5/6. Once regenerated in memory, it can be rewritten to the SSD. Any array should, of course, log the fact that a bit corruption was detected as this may indicate other potential issues with a specific SSD.