Date: 2018-05-30
Categories: research; filesystems

Potential improvements in filesystem performance

I am currently trying to build a new filesystem, provisionally called "ImpFS". Some of the motivations are:

Provide a verified filesystem kernel, based on B-trees, to improve reliability and make possible formal reasoning about FS behaviour.
Produce a filesystem library, so that new filesystems are easier to construct out of components.
Provide a filesystem for MirageOS (a high-level unikernel operating system).
Provide a widely used, formally verified operating system component.
Provide a filesystem with lots of "interesting" features (snapshots etc).

One motivation that I haven't emphasized till now is to try to compete with -- and hopefully even surpass -- existing filesystems in terms of performance.

A priori, this goal seems unlikely to be met when you consider that existing filesystems have had decades of work put into them, including significant tuning for underlying hardware etc. Why is it reasonable to suppose that we might be able to build something better?

In this post, I want to try to draw out the reasons why I think this is possible. Even though I may not achieve the end result with ImpFS, the arguments themselves may be worth considering.

Existing filesystem landscape

Hardware is very diverse. From phones down to the smallest embedded devices, there is a huge range of capabilities. For small devices, the typical choice of filesystem is exFAT / VFAT / MSDOS (i.e., Microsoft traditional filesystem) because it is simple to implement and doesn't require much resources. However, FAT has significant reliability problems, and other drawbacks (some of which are understandable given how old it is).

For phones, Apple has recently developed APFS (to run on all its devices), and for Android the site https://www.all-things-android.com/content/understanding-android-file-hierarchy reports that phones run a variety of filesystems such as: exFAT, F2FS, JFFS2, YAFFS2, EXT4, MSDOS and VFAT.

For servers, probably ZFS is the usual filesystem of choice (although other alternatives such as XFS are possible). Companies like NetApp have their own proprietary filesystems.

It is worth stressing that developing a modern filesystem is a huge amount of effort. For example, BTRFS, the current next-gen Linux filesystem, has been under development for close to 10 years.

Reason 1: hardware diversity, custom filesystems

My work involves building a library of filesystem components. The idea is that these components can be assembled to produce the final filesystem. The components can be chosen to exactly match the hardware.

For example, at the moment a desktop system might have a large HDD for data and a smaller SSD which holds the main OS and applications. These are typically mounted as two separate systems. On Linux it is possible (e.g. using technologies such as bcrypt) to use the SSD as a cache for the HDD. However, this configuration is rare. With the introduction of an even faster storage layer above SSD (NVRAM), a desktop may even have three types of persistent storage. And obviously the desire would be to use the NVRAM as a cache for the SSD which is in turn a cache for the HDD.

Setting all this up using existing technologies is hard. The underlying reason is that the existing technologies are not intended to span such a diverse range of storage devices.

Clearly, with a library filesystem, with separate components for caching etc., configuring such a setup might be easy. Thus, the library filesystem fits much better the diverse range of hardware that we see in systems today.

As another example of hardware diversity, consider concurrency. Early storage devices accepted a single command at a time. Modern devices have queues of pending commands, and performance is best if those queues can be kept reasonably full. Existing filesystems, and intermediate kernel layers, may need to be radically re-engineered to take advantage of these low-level features. In contrast, the hope is that a filesystem library could be configured to take advantage of these fairly directly.

Now consider the operating of flushing (via the sync command) data to disk. In the past, devices supported a single sync command that flushed all pending data to disk before returning (we leave aside questions of whether the devices actually implemented this correctly). However, this is clearly not optimal. Consider the scenario where one process is streaming a huge media file to disk and doesn't need to sync the data, and another process is writing a few bytes to record a transaction and does need to sync the data. If we only have a per-device sync command, then the second process may issue the sync and have to wait a very long time for all the media data to be flushed to disk.

Currently, one can usually execute a sync command against a file or directory. However, exactly how this is translated to the lower levels is often unclear: it must depend on the sync capabilities of the underlying device, and also on the ability of the OS and filesystem to track data through the OS layers and relate it back to the high-level file that the user has synced.

In this scenario, we need ways to specify more precisely which data needs to be put on disk. The transaction queues above offer one possibility: if there is a per-queue sync command, then the filesystem could potentially route each process' data to a separate queue, allowing a database process to sync independently of a media-streaming process. However, this clearly involves quite complicated re-engineering.

Reason 2: end-to-end design

As outlined above, we need:

An API which allows the user to more precisely state what bits of data need to be flushed to disk.
Some way for this information to be reliably translated to commands the low-level device can understand.

In ImpFS, the components talk about objects (files, directories, symlinks, etc), and at all levels of the stack we keep track of the object id associated with data. Thus, when a user syncs a particular object (a file, a directory), at each layer we know which data has to be flushed down to the device, and which can be left e.g. in cache.

We accurately track all the data dependencies as well, and only write the minimum data required. This is possible because of the very high-level view we take, and the fact that we have complete control over all layers, and do not have to reuse e.g. the Linux VFS or the Linux block layer. The disadvantage, of course, is that we have to write all these layers ourselves.

Reason 3: close attention to minimizing writes

When a user issues a sync on a dirty object, at least one device write must occur. We have architected our system so that, as far as possible, this lower bound is actually achieved. This is possible by choosing a particular approach to persistence which we outline in the following section.

Reason 4: alternative design choices (no log, no BSD soft updates)

Existing systems tend to use a transaction log (for journalling filesystems, see https://en.wikipedia.org/wiki/Journaling_file_system), a log-structured filesystem approach (https://en.wikipedia.org/wiki/Log-structured_file_system), or techniques such as "soft updates" (https://en.wikipedia.org/wiki/Soft_updates).

Soft updates were an attempt to guarantee consistent on-disk metadata (not data) by closely examining the intermediate states that an FS passes through, and then making sure that a crash at any point did not leave the metadata in an inconsistent state. The approach is clearly complicated (you have to look at all intermediate states) and application to different filesystems is difficult (your reasoning is specialized to the operations of a particular filesystem). It seems this approach never made it beyond the BSD fast file system, and is now largely abandoned.

An alternative approach, that most filesystems follow, is to first write changes into a transaction log, and then execute these changes against the disk itself. If operations are idempotent, then a crash at any point can be dealt with by replaying the log and re-executing the changes. The downside is that there is clearly the overhead of all the logging, compared to doing no logging at all. In fact, the logging is so costly and slow that most systems only log the metadata changes, not the data changes. This makes it hard to be sure what guarantees are provided by such a scheme (presumably only the weakest guarantee: that the on-disk structures are internally consistent... which says nothing about the actual values of the metadata or data at all).

So on the one hand, logging/journalling is relatively well understood and generally applicable across lots of filesystems. On the other hand, it is so slow that most users turn it off (even for production databases). And alternatives, such as soft updates, have proven to be very complicated.

For ImpFS, I chose to look at things in a different way. The result is a technique that is generally applicable, simple to understand, and which resembles logging. However, unlike logging, the cost is minimal. This relies on the whole filesystem being Copy-on-Write (CoW). In this scenario, updates to state are performed by swinging a root pointer. ImpFS uses a clever combination of logging and pointer swinging to ensure that all metadata and data is consistent and correct, without the overhead of a traditional journal.

"What is this fantastic technique???" I hear you ask. That is a subject for another post!