Tom Ridge
2015-04-21
(with Thomas Tuerk, David Sheets, Andrea Giugliano, Anil Madhavapeddy, Peter Sewell; talk at Cambridge REMS meeting; 20 mins talk+10 mins qus)
A formal specification of filesystem behaviour (POSIX, Linux and Mac)
A test oracle (based directly on the spec) that can determine whether an observed trace of a real-world system conforms to the spec
An extensive test suite of >20k scripts, and the results of running those tests on about 40 different combinations of libc+OS+filesystem
Test scripts mostly generated automatically, with additional hand-written scripts
Test scripts are fed to a test executor which executes the scripts on a real-world libc+OS+filesystem stack
Results are recorded as traces
Traces are checked by SibylFS (the spec) to determine whether they represent allowable behaviour or not; the output is in the form of a checked trace
A test script is essentially a list of libc calls to execute on a real-world system
@type script
#####################################
# Test link___link_nonempty_dir1__d2__sl_dotdot_d2___nonempty_dir1__nonexist_4
#####################################
mkdir "empty_dir1" 0o777
mkdir "empty_dir2" 0o777
mkdir "nonempty_dir1" 0o777
mkdir "nonempty_dir1/d2" 0o777
open "nonempty_dir1/d2/f3.txt" [O_CREAT;O_WRONLY] 0o666
write! (FD 3) "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor inc" 83
close (FD 3)
... // further setup commands
link "nonempty_dir1/d2/sl_dotdot_d2" "nonempty_dir1/nonexist_4"
...
A trace records the libc calls from the test script, and the responses received from the real-world system
# processing file 'link___link_nonempty_dir1__d2__sl_dotdot_d2___nonempty_dir1__nonexist_4-int.trace' ...
@type trace
#####################################
# Test link___link_nonempty_dir1__d2__sl_dotdot_d2___nonempty_dir1__nonexist_4
#####################################
5: mkdir "empty_dir1" 0o777
Tau
RV_none
6: mkdir "empty_dir2" 0o777
Tau
RV_none
7: mkdir "nonempty_dir1" 0o777
Tau
RV_none
8: mkdir "nonempty_dir1/d2" 0o777
Tau
RV_none
9: open "nonempty_dir1/d2/f3.txt" [O_CREAT;O_WRONLY] 0o666
Tau
RV_num(3)
10: write! (FD 3) "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor inc" 83
Tau
RV_num(83)
...
28: link "nonempty_dir1/d2/sl_dotdot_d2" "nonempty_dir1/nonexist_4"
Tau
RV_none
...
A specification of the behaviour of libc+OS+filesystem, i.e., the behaviour a user process (application) sees when linked to libc
Written in Lem; available in HOL4, Isabelle/HOL and OCaml; about 6k lines of Lem/HOL for the specification (approx 2.5 person-years of work for spec; 0.5-1 year for test infrastructure)
Form: labelled transition system (see next slide)
HTML version of main file here
A labelled transition system (LTS): for states \(s,s'\) and label \(lbl\), the spec tells you when \(s \overset{lbl}\rightarrow s'\)
rename p1 p2
etc) type os_label =
| OS_CALL of (ty_pid * ty_os_command)
| OS_RETURN of (ty_pid * error_or_value ret_value)
| OS_CREATE of (ty_pid * uid * gid)
| OS_DESTROY of ty_pid
| OS_TAU
We specify the behaviour of the stack under any sequence of labels; multiple processes can execute concurrently, calls can overlap in time
What are the libc calls we handle? No memory-mapped files, no *-at forms of libc calls, no "filesystem out of space" errors, but mostly everything else related to files:
type ty_os_command =
| OS_CLOSE of ty_fd (* per-process file descriptors *)
| OS_LINK of (cstring * cstring) (* hard links *)
| OS_MKDIR of (cstring * file_perm) (* cstring can include null *)
| OS_OPEN of (cstring * int_open_flags * maybe file_perm) (* open flags as fixed-width int, platform specific encoding *)
| OS_PREAD of (ty_fd * size_t * off_t)
| OS_READ of (ty_fd * size_t)
| OS_READDIR of ty_dh
| OS_OPENDIR of cstring
| OS_REWINDDIR of ty_dh
| OS_CLOSEDIR of ty_dh
| OS_READLINK of cstring (* symlinks *)
| OS_RENAME of (cstring * cstring)
| OS_RMDIR of cstring
| OS_STAT of cstring
| OS_LSTAT of cstring
| OS_SYMLINK of (cstring * cstring)
| OS_TRUNCATE of (cstring * off_t)
| OS_UNLINK of cstring
| OS_PWRITE of (ty_fd * ty_bytes * size_t * off_t)
| OS_WRITE of (ty_fd * ty_bytes * size_t)
| OS_UMASK of file_perm (* permissions *)
| OS_CHMOD of (cstring * file_perm)
| OS_CHOWN of (cstring * uid * gid) (* users, groups *)
| OS_CHDIR of cstring (* model of per-process state *)
| OS_LSEEK of (ty_fd * off_t * int_seek_command)
| ...
Interpreting POSIX
Writing the spec so that it can be used to efficiently check real-world traces of behaviour (the problem is nondeterminism and state space explosion)
Dealing with the complexity of observed real-world behaviour
A scenario: 18 months into the project, the POSIX and Linux variants of the spec were mostly complete; we wanted to extend the spec to cover Mac
We ran the existing test scripts on Mac and checked them using the initial version of the Mac spec, derived from POSIX
>
20k test scripts; of the resulting traces, thousands failed to check (>5000 for open
alone)
These failing traces have to be examined and the spec updated so that the spec behaviour matches the real-world behaviour
To update the spec for open
on the Mac took about 1.5 months
This is all very painful and takes a long time, but it only has to be done once, then everyone can benefit
What helped: modularity; we tried to make the spec as modular as possible, with each component having a clearly defined role, with clearly defined interfaces etc.
The spec is reasonably large as a specification (c. 6000 lines); how can we gain confidence that it is correct?
From the beginning we wanted to extensively validate the spec, by using it to check traces of real-world behaviour
This form of testing also uncovers bugs in real-world systems
Existing filesystem test suites hardcode the expected answers for a given libc call; practically, they tend to have a relatively small number of tests
Our approach is different: a test script is just a sequence of libc calls (we don't need to say what should happen after each call - the spec already contains this knowledge)
This enables randomized and combinatorial testing
We try to exhaustively combinatorially test the libc interface using tests that are generated automatically
The spec defines those sequences of transitions \(s_0 \overset{lbl_1}\rightarrow s_1 \overset{lbl_2}\rightarrow s_2 \ldots\) that are allowable
OS_CALL
and OS_RETURN
constructors are omitted, pids are omitted if there is only one pid involved)NetSem gave a specification of UDP and TCP/IP as an LTS which was then used as a test oracle; our approach is broadly based on the NetSem approach
NetSem took 2500 CPU-hours to check 1000 traces; this is at the limit of practicality; the cost of checking made it very difficult to update and revise the spec
Checking a trace against an LTS is a very general problem which I think deserves a bit more attention
The SibylFS spec was designed from the start for efficient trace checking: checking >20k tests on a 4-core i7 takes about 79s (it takes 152s to execute the tests on an in-memory tmpfs filesystem)
Our testing involves a very large number of test scripts; checking is extremely fast; indeed SibylFS is fast enough that it could be used to check behaviour "online"
Error codes are quite often non-POSIX
Path resolution, particularly when a trailing slash is involved, is variable, non-POSIX
Treatment of paths referencing symlinks, particularly when the path ends in a trailing slash, is highly variable
Various overlay filesystems, and FUSE filesystems, mess up things like permissions
More serious: posixovl/VFAT (posix emulation on top of VFAT) gets link count wrong when rename overwrites a file that is linked elsewhere; possible to get to a state where the filesystem contains no files, but there is no free space (space leak)
OpenZFS on Linux 3.13.0-34 (Ubuntu Trusty): files opened with O_APPEND
would not seek to the end of the file before write
or pwrite
(probably causing applications that use this functionality to fail)
OpenZFS on OS/X: possible to execute a sequence of calls which leads to the calling process hanging using 100% CPU, unresponsive to signals; volume cannot be unmounted, machine cannot be shut down; force unmounting may cause storage device to become unusable until next system restart
Permissions: the Linux implementation of permissions should give the same behaviour from one kernel version to the next; however, we found test scripts involving file and directory access that failed on Linux kernel 3.13 and succeeded on 3.14 which we believe is due to a buggy 3.13 implementation of permissions in edge cases
Huge number of tests, mostly automatically generated; tested on a large number of stacks
SibylFS is very efficient at checking tests
Excellent trace acceptance and coverage figures
In edge cases there are numerous differences between POSIX, Linux and Mac; most are not very interesting, but the spec gives a complete description of them; the testing even uncovered some relatively serious bugs (which we didn't expect)
The real result of testing is that we have confidence in the spec
Documentation (formal counterpart to POSIX)
Testing existing/new filesystems (particularly non-traditional filesystems)
Reference specification for verified filesystem implementations
Reference model (e.g. for systems research - a lot of papers construct their own little models of parts of the filesystem)
Basis for formal proof (Thomas Tuerk did some initial work on this in Isabelle/HOL) e.g. of properties of filesystems, or as a basis for soundness proofs of program logics etc.
To provide a POSIX compatiblity layer e.g. on Windows, or non-traditional settings such as Mirage
For analysis of applications: e.g. do any applications use libc calls in edge cases where the behaviour of Linux and Mac differs?
For analysis of filesystems: e.g. do Linux and Mac behave the same when restricted to some subset of libc?
To identify gaps in existing test suites
etc
We need a new spec of timestamps to enable efficient checking.
A spec of filesystem behaviour over system crashes is desirable; a problem here is that it is not clear what this behaviour is; it seems at least possible that most filesystems can be trashed by crashing the recovery routine at key points (see Yang et al. "Using model checking to find serious file system errors")
A prerequisite for the crash spec is probably a dependency model: a model of which filesystem modifications can be carried out independently of others (optimizing filesystems presumably perform some of these optimizations, and the effects of these optimizations are likely visible after a crash); this would also help understand the space of optimizations that a filesystem might employ
A verified implementation; the problem here is that to do something that competes with the performance of current filesystems is a challenging multi-year project