System structure

Test scripts mostly generated automatically, with additional hand-written scripts
Test scripts are fed to a test executor which executes the scripts on a real-world libc+OS+filesystem stack
Results are recorded as traces
Traces are checked by SibylFS (the spec) to determine whether they represent allowable behaviour or not; the output is in the form of a checked trace

Example test script

A test script is essentially a list of libc calls to execute on a real-world system

 @type script
 #####################################
 # Test link___link_nonempty_dir1__d2__sl_dotdot_d2___nonempty_dir1__nonexist_4
 #####################################
 mkdir "empty_dir1" 0o777
 mkdir "empty_dir2" 0o777
 mkdir "nonempty_dir1" 0o777
 mkdir "nonempty_dir1/d2" 0o777
 open "nonempty_dir1/d2/f3.txt" [O_CREAT;O_WRONLY] 0o666
 write! (FD 3) "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor inc" 83
 close (FD 3)
 ... // further setup commands
 link "nonempty_dir1/d2/sl_dotdot_d2" "nonempty_dir1/nonexist_4"
 ...

Example trace (behaviour of real-world system)

A trace records the libc calls from the test script, and the responses received from the real-world system

# processing file 'link___link_nonempty_dir1__d2__sl_dotdot_d2___nonempty_dir1__nonexist_4-int.trace' ...
        @type trace
        #####################################
        # Test link___link_nonempty_dir1__d2__sl_dotdot_d2___nonempty_dir1__nonexist_4
        #####################################
     5: mkdir "empty_dir1" 0o777
        Tau
        RV_none
     6: mkdir "empty_dir2" 0o777
        Tau
        RV_none
     7: mkdir "nonempty_dir1" 0o777
        Tau
        RV_none
     8: mkdir "nonempty_dir1/d2" 0o777
        Tau
        RV_none
     9: open "nonempty_dir1/d2/f3.txt" [O_CREAT;O_WRONLY] 0o666
        Tau
        RV_num(3)
    10: write! (FD 3) "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor inc" 83
        Tau
        RV_num(83)

...

    28: link "nonempty_dir1/d2/sl_dotdot_d2" "nonempty_dir1/nonexist_4"
        Tau
        RV_none

...

Example checked trace

See e.g. here (local) or here (online) BROKEN LINK

The specification

A specification of the behaviour of libc+OS+filesystem, i.e., the behaviour a user process (application) sees when linked to libc
Written in Lem; available in HOL4, Isabelle/HOL and OCaml; about 6k lines of Lem/HOL for the specification (approx 2.5 person-years of work for spec; 0.5-1 year for test infrastructure)
Form: labelled transition system (see next slide)
3 main variants for POSIX, Linux and Mac
- for POSIX, a spec of allowable behaviours
- for Linux and Mac, a spec of existing real-world behaviours (with some looseness)
- the parts of the spec that are Linux/Mac weird behaviours are clearly identifiable

HTML version of main file here

Form of the specification: LTS

A labelled transition system (LTS): for states \(s,s'\) and label \(lbl\), the spec tells you when \(s \overset{lbl}\rightarrow s'\)
What are the labels? (Labels correspond to the events we are interested in)
- Interactions between a user process and the libc+OS+filesystem stack at the libc interface (rename p1 p2 etc)
- Process creation and destruction
- Tau events (e.g. internal OS processing)

  type os_label =  
    | OS_CALL of (ty_pid * ty_os_command)
    | OS_RETURN of (ty_pid * error_or_value ret_value)
    | OS_CREATE of (ty_pid * uid * gid)
    | OS_DESTROY of ty_pid
    | OS_TAU

We specify the behaviour of the stack under any sequence of labels; multiple processes can execute concurrently, calls can overlap in time
What are the libc calls we handle? No memory-mapped files, no *-at forms of libc calls, no "filesystem out of space" errors, but mostly everything else related to files:

  type ty_os_command = 
    | OS_CLOSE of ty_fd                                         (* per-process file descriptors *)
    | OS_LINK of (cstring * cstring)                            (* hard links *)
    | OS_MKDIR of (cstring * file_perm)                         (* cstring can include null *)
    | OS_OPEN of (cstring * int_open_flags * maybe file_perm)   (* open flags as fixed-width int, platform specific encoding *)
    | OS_PREAD of (ty_fd * size_t * off_t) 
    | OS_READ of (ty_fd * size_t) 
    | OS_READDIR of ty_dh
    | OS_OPENDIR of cstring
    | OS_REWINDDIR of ty_dh
    | OS_CLOSEDIR of ty_dh
    | OS_READLINK of cstring                                    (* symlinks *)
    | OS_RENAME of (cstring * cstring)
    | OS_RMDIR of cstring
    | OS_STAT of cstring
    | OS_LSTAT of cstring
    | OS_SYMLINK of (cstring * cstring) 
    | OS_TRUNCATE of (cstring * off_t)
    | OS_UNLINK of cstring
    | OS_PWRITE of (ty_fd * ty_bytes * size_t * off_t)
    | OS_WRITE of (ty_fd * ty_bytes * size_t) 
    | OS_UMASK of file_perm                                     (* permissions *) 
    | OS_CHMOD of (cstring * file_perm)
    | OS_CHOWN of (cstring * uid * gid)                         (* users, groups *)
    | OS_CHDIR of cstring                                       (* model of per-process state *)
    | OS_LSEEK of (ty_fd * off_t * int_seek_command)
    | ...

The main challenges when writing the spec

Interpreting POSIX
Writing the spec so that it can be used to efficiently check real-world traces of behaviour (the problem is nondeterminism and state space explosion)
Dealing with the complexity of observed real-world behaviour

The complexity of real-world behaviour

A scenario: 18 months into the project, the POSIX and Linux variants of the spec were mostly complete; we wanted to extend the spec to cover Mac
We ran the existing test scripts on Mac and checked them using the initial version of the Mac spec, derived from POSIX
>20k test scripts; of the resulting traces, thousands failed to check (>5000 for open alone)
These failing traces have to be examined and the spec updated so that the spec behaviour matches the real-world behaviour

We need to:
- figure out why the traces are failing
- figure out what changes are needed to the spec; the changes should preserve the structure, readability and concision of the spec
To update the spec for open on the Mac took about 1.5 months
This is all very painful and takes a long time, but it only has to be done once, then everyone can benefit
What helped: modularity; we tried to make the spec as modular as possible, with each component having a clearly defined role, with clearly defined interfaces etc.

Testing

The spec is reasonably large as a specification (c. 6000 lines); how can we gain confidence that it is correct?
From the beginning we wanted to extensively validate the spec, by using it to check traces of real-world behaviour
This form of testing also uncovers bugs in real-world systems

Test oracle enables combinatorial testing

Existing filesystem test suites hardcode the expected answers for a given libc call; practically, they tend to have a relatively small number of tests
Our approach is different: a test script is just a sequence of libc calls (we don't need to say what should happen after each call - the spec already contains this knowledge)
This enables randomized and combinatorial testing
We try to exhaustively combinatorially test the libc interface using tests that are generated automatically

Use of LTS for testing real-world systems

The spec defines those sequences of transitions \(s_0 \overset{lbl_1}\rightarrow s_1 \overset{lbl_2}\rightarrow s_2 \ldots\) that are allowable
A trace is a sequence of labels \(lbl_1,lbl_2,\ldots\)
- the trace from earlier is here BROKEN!; for this trace, labels are printed in a "concrete" form (e.g. OS_CALL and OS_RETURN constructors are omitted, pids are omitted if there is only one pid involved)
To check the trace, we need to determine whether whether there exists a sequence \(s_0 \overset{lbl_1}\rightarrow s_1 \overset{lbl_2}\rightarrow s_2 \ldots\)
- \(s_0\) is given, and corresponds to an empty filesystem; the spec needs to determine whether any such \(s_1,\ldots\) exist
Note that the states are effectively abstract: you give me a sequence of labels, and I tell you whether they represent allowable behaviour or not
- we treat the libc+OS+filesystem stack as a black box: we can't observe the state, only the interactions at the libc interface

Difficulty of LTS trace checking

NetSem gave a specification of UDP and TCP/IP as an LTS which was then used as a test oracle; our approach is broadly based on the NetSem approach
NetSem took 2500 CPU-hours to check 1000 traces; this is at the limit of practicality; the cost of checking made it very difficult to update and revise the spec
Checking a trace against an LTS is a very general problem which I think deserves a bit more attention
Things that make it difficult:
- the trace may include only partial information about states (NetSem), or even no information at all (SibylFS)
- some events (e.g. Tau events) may not be observable; NetSem had to resort to backtracking search in an attempt to insert such events at various places in the trace
- nondeterminism results in state space explosion if not handled carefully
The SibylFS spec was designed from the start for efficient trace checking: checking >20k tests on a 4-core i7 takes about 79s (it takes 152s to execute the tests on an in-memory tmpfs filesystem)
Our testing involves a very large number of test scripts; checking is extremely fast; indeed SibylFS is fast enough that it could be used to check behaviour "online"

Test results, stats

Trace acceptance
- Linux: 21061 accepted by spec; 9 rejected (21070 total)
- Mac: 21036 accepted by spec, 34 rejected (21070 total)
- essentially no barrier to getting 100% trace acceptance
Coverage: >98% (of the spec)
- by way of comparison, a paper (Groce et al., 2007, "Randomized differential testing...") from NASA scientists that applied randomized testing to a model of a filesystem achieved 89% coverage

Test results, strange behaviours

Error codes are quite often non-POSIX
Path resolution, particularly when a trailing slash is involved, is variable, non-POSIX
Treatment of paths referencing symlinks, particularly when the path ends in a trailing slash, is highly variable
Various overlay filesystems, and FUSE filesystems, mess up things like permissions
More serious: posixovl/VFAT (posix emulation on top of VFAT) gets link count wrong when rename overwrites a file that is linked elsewhere; possible to get to a state where the filesystem contains no files, but there is no free space (space leak)
OpenZFS on Linux 3.13.0-34 (Ubuntu Trusty): files opened with O_APPEND would not seek to the end of the file before write or pwrite (probably causing applications that use this functionality to fail)
OpenZFS on OS/X: possible to execute a sequence of calls which leads to the calling process hanging using 100% CPU, unresponsive to signals; volume cannot be unmounted, machine cannot be shut down; force unmounting may cause storage device to become unusable until next system restart
Permissions: the Linux implementation of permissions should give the same behaviour from one kernel version to the next; however, we found test scripts involving file and directory access that failed on Linux kernel 3.13 and succeeded on 3.14 which we believe is due to a buggy 3.13 implementation of permissions in edge cases

Testing summary

Huge number of tests, mostly automatically generated; tested on a large number of stacks
SibylFS is very efficient at checking tests
Excellent trace acceptance and coverage figures
In edge cases there are numerous differences between POSIX, Linux and Mac; most are not very interesting, but the spec gives a complete description of them; the testing even uncovered some relatively serious bugs (which we didn't expect)
The real result of testing is that we have confidence in the spec

Possible uses of the spec

Documentation (formal counterpart to POSIX)
Testing existing/new filesystems (particularly non-traditional filesystems)
Reference specification for verified filesystem implementations
Reference model (e.g. for systems research - a lot of papers construct their own little models of parts of the filesystem)
Basis for formal proof (Thomas Tuerk did some initial work on this in Isabelle/HOL) e.g. of properties of filesystems, or as a basis for soundness proofs of program logics etc.
To provide a POSIX compatiblity layer e.g. on Windows, or non-traditional settings such as Mirage
For analysis of applications: e.g. do any applications use libc calls in edge cases where the behaviour of Linux and Mac differs?
For analysis of filesystems: e.g. do Linux and Mac behave the same when restricted to some subset of libc?
To identify gaps in existing test suites
etc

Future work

We need a new spec of timestamps to enable efficient checking.
A spec of filesystem behaviour over system crashes is desirable; a problem here is that it is not clear what this behaviour is; it seems at least possible that most filesystems can be trashed by crashing the recovery routine at key points (see Yang et al. "Using model checking to find serious file system errors")
A prerequisite for the crash spec is probably a dependency model: a model of which filesystem modifications can be carried out independently of others (optimizing filesystems presumably perform some of these optimizations, and the effects of these optimizations are likely visible after a crash); this would also help understand the space of optimizations that a filesystem might employ
A verified implementation; the problem here is that to do something that competes with the performance of current filesystems is a challenging multi-year project

SibylFS: formal specification and oracle-based testing
for POSIX and real-world filesystems

Overview

What have we done?

How does everything fit together?

System structure

Example test script

Example trace (behaviour of real-world system)

Example checked trace

The specification

The specification

Form of the specification: LTS

The main challenges when writing the spec

The complexity of real-world behaviour

Testing

Testing

Test oracle enables combinatorial testing

Use of LTS for testing real-world systems

Difficulty of LTS trace checking

Test results, stats

Test results, strange behaviours

Testing summary

Conclusion

Possible uses of the spec

Future work

SibylFS: formal specification and oracle-based testing for POSIX and real-world filesystems

Overview

What have we done?

How does everything fit together?

System structure

Example test script

Example trace (behaviour of real-world system)

Example checked trace

The specification

The specification

Form of the specification: LTS

The main challenges when writing the spec

The complexity of real-world behaviour

Testing

Testing

Test oracle enables combinatorial testing

Use of LTS for testing real-world systems

Difficulty of LTS trace checking

Test results, stats

Test results, strange behaviours

Testing summary

Conclusion

Possible uses of the spec

Future work

SibylFS: formal specification and oracle-based testing
for POSIX and real-world filesystems