File systems

Disks

Basics:

Access in the unit of sectors (512B)
arm movement & disk spinning
100MB/s for sequential access but extremely slow for random access.
DRAM: tens of GB/s -- much faster
SSDs don't have arms and disks. High-end SSDs can reach 500MB/s.

IDE interface: One request at a time. Uses IO ports to communicate with the hardware.

Simple driver code in ide.c
Fake ide driver that uses an in-memory copy of the disk image (memide.c)
IDE specs: https://wiki.osdev.org/PCI_IDE_Controller

Essential file system concepts

Two concepts for managing data on the disk:

file: contains a sequence of bytes of any length.
directory: a list of names. a name can be either a regular file or another directory (sub-directory).

Virtual file system (VFS)

The tree structure can be used to index the devices and store the metadata of the OS.

Windows has multiple trees to manage different kind of resources:

One tree for each disk
Devices
Registry
services
etc..

*NIX uses one tree, aka. the VFS, to organize everything that can be exposed to the user space:

all concrete file systems on your disks
volatile file systems (e.g., tmpfs)
devices
OS metadata (procfs, sysfs) (on recent Linux: $ head /sys/devices/system/cpu/vulnerabilities/*)

We use "mount" to link of create a file system in VFS, at a certain path.

$ mount # without parameters, you see all current mount pointers.
specify mount points in /etc/fstab

Mount point:

Specifies the API (the file operation functions) to call when accessing the files in the subtree starting from the mountpoint.

xv6's file system implementation

Related source code:

fs (fs.h contains the structures for the on-disk file system layout.
file
log
bio
ide
spinlock and sleeplock

Your task: read the code and get familiar with the core operations of read and write.

mkfs.c: A standalone tool for creating file system images. It's NOT part of the kernel file system code.

The structure of xv6's concrete file system (fs.{h,c})

On the disk: [ boot block (the mbr) | file system ]

In a real system, the MBR also contains a partition table. While the partition table is intended to be interpreted by the OS, xv6's mbr does not initialize it.

The superblock contains the metadata of all other parts of the file system.

inode

An inode corresponds to a file. A file has a number of bytes.

The file system uses some special files to manage file-system metadata, including directories and shortcuts (softlinks).

In a file system, each inode is assigned a unique number, the "inode number" (inum)

There two inode structures in xv6:

inode, file.h -- this is the inode structure used but the "vfs"
dinode, fs.h -- this is the inode structure stored on the disk.

In the source code, most of the complexity comes from maintaining the consistency between the on-disk fs and the in-memory copy.

Directory -- a special file

struct dirent { // directory entry

ushort inum; // a number

char name[]; // a string

}

the content of a directory file is effectively a struct dirent[]

However, the concept of inode and the inode number is completely hidden from user applications.

When you list a directory by giving its name to ls, it shows a bunch of other names.
Then you can specify a name to access an regular file in that directory.

Inside the file system, the media is the inode number:

Directory files contain the "name-to-inum" mapping information.
A file does NOT know its own name!
The fs needs to translate a file name to a inum using the dirent.
Think of DNS: people almost always access the internet with a (domain) name like uic.edu. But your brower has to access the websites using an ip address.

Example: open the file "/proc/1/pid"

sys_open
namei() -- "translate name to inode"
namex() -- path = "/proc/1/pid", name is a temporary buffer
skipelem() -- extract the directory/file name from the path, one at a time.
dirlookup() -- find the corresponding inum under the directory ip, return the inode corresponding to name.

The root directory's inum is hardcoded to 1 (see ROOTINO in fs.h)

softlink vs hardlink

softlink is a special file. It contains a path, works like a shortcut.

The target can be a regular file or a directory. The link itself does not care.
The target can be missing, that's why you sometimes see "broken link" in Linux.

hardlink is actually no different than any other files. See this example:

When you create a file, the FS allocate an inode, and create a name-to-inum mapping in the directory file.
When you create a hardlink, the FS creates a name-to-inum mapping in the directory file, using an existing inode.
There is only one physical file. but can be found from two paths.
with a refcount==2, to remove the file you have to call "unlink" on both paths.

softlink can be created for anything. hardlink cannot be created across FS boundaries.

Maintaining consistency on persistent storage (the disks)

What is "inconsistency"?

$ mv -r Documents/my_thesis Archive/
system crashed.
What would you find in Documents and Archive? The worst case could be my_thesis is missing!

old systems:

fsck -- check and maybe "recover" the file system to a usable condition. This does not guarantee the recovery of files/data.

new systems:

Consistency of the FILE SYSTEM is in general guaranteed. no fsck needed.
your data may still lose for many different reasons (disk failure, power failure)

key technologies:

Journalling (the most commonly used: ext4, xfs, ntfs)
COW/snapshots (better functionality, but usually slower: zfs, btrfs)

In most Linux distros, the default setting only enables metadata-journaling. For example, rewriting of a file does not use the journal. You can see a mix of old/new contents in the file after a crash.

xv6 logging

Let's start from filewrite() in file.c -- the "VFS" write().

begin_op() and end_op() mark the critical section. There is no concurrency between write operations.

in writei():

call bread to obtain the buffer (struct buf) of a block.
* bmap() checks the inode's mapping table and performs the offset-to-sector-number translation.
new data goes to the buffer (in memory), and is appended to the log (also in memory)
calls brelse() "block release" to decrement the reference count of the buffer. As the real write will happen in the commit phase, the buffer is never really release here.
if the file size is increased, it also need to update the inode with iupdate. Similarly iupdate() logs the write.
writei() can append many blocks to the in-memory log.

end_op() calls commit() to do the actual writes.

commit():

write_log()
write_head() -- flip the "committed" bit
install_trans() -- write to the home locations
now the write-back is done.
write_head() -- The log is no longer needed. Discard the log by rewrite the log head with a "log.lh.n ==0"

When calling commit(), the log.lock is not held!

Instead of using the lock, the log.committing blocks other writers from entering the critical section. See begin_op().

xv6's write() is not scalable.

Every operation is on the critical path -- full of slow I/O operations
There is no parallelism -- many expensive flush operations (see bwrite()->iderw()->idestart())
It's OK for an IDE hard disk.

opportunities to improve:

buffer the updates in the buffer cache -- no I/O unless instructed by the user. extra benefit: write absorption
batched commit -- commit-per-write is naive
multiple logging areas for write operations -- don't always have to wait for others when editing different parts of the FS
delayed write-back to home locations -- your data is secure once persisted in the log. You have to keep the data in the log until written back

Tree indexes and their overhead

B+-tree

Excellent read
expensive write. only sequential writes is better.

B-epsilon tree

buffered write at each tree node. random write is better than B+-tree
When a buffer is full, the updates are pushed down to its children nodes.
read is more expensive than B+-tree.

For consistency, the above structures may have to employ expensive logging or COW mechanisms.

LSM-tree (log-structured merge tree)

Log write
multiple sorted layers.
Merge-sorts lazily.
Read can be even more expensive.
Write is OK at the beginning -- all log-writes. But it becomes worse as the tree grows.