There are a large number of file systems derived from the original Unix file system. They all have a great many characteristics in common, so we can discuss them somewhat generically. This will be something of a lowest common denominator, simplified discussion; after we've given the basic idea we can go on to some of the details that make them practical.
An inode-based file system will divide a disk partition into five parts: the superblock, a bunch of "information nodes," or inodes, the data blocks, and two groups of bitmaps used to represent allocated and freed inodes and datablocks, respectively. Here's a picture showing the layout of the five regions on the disk.
We won't be describing these areas in the order shown above; we'll talk about the superblock, then the inodes, then the data blocks, and then come back to the bitmaps.
The first block of the filesystem is called the superblock. The superblock gives information regarding the tuneable parameters of the filesystem: the number of inodes, the number of data blocks, the size of the data blocks... It may also include information such as a volume name to identify the partition.
Following the superblock is a set of data structures called inodes, with exactly inode per disk file. So far as the filesystem data structures are concerned, the inodes represent the files.
A file's inode is quite small; typically 32 or 64 bytes. It includes all the information regarding the file it describes. Taking a look at the documentation for the Linux ext2 file system, this includes:
... pointers to the filesystem blocks which contain the data held in the object and all of the metadata about an object except its name. The metadata about an object includes the permissions, owner, group, flags, size, number of blocks used, access time, change time, modification time, deletion time, number of links, fragments, version (for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs).
So, it contains a lot of permission, ownership, and file type information, as well as pointers to all the data blocks containing the data making up the file (for a regular file or a directory). It looks something like this:
Everything towards the top of the inode is "metadata": the UID (User ID) and Group ID (GID) of the file owner, the modification time of the file, the file type (regular file, directory, character special, block special....), the length of the file (not relevant if it's a special file), and a bunch more (the etc entries). Exactly what all is in the inode varies with which flavor of Unix-like file system you're looking at.
The pointers, indirect pointer, and double indirect pointer will be described later, when we talk about regular files.
The data blocks are just what the name implies: they contain the actual data for the filesystem object.
The bitmaps keep track of allocated and free inodes (for the inode bitmap) and data blocks (for the data blocks). Every inode (and every data block) has a corresponding bit in the bitmaps. The state of that bit describes the state of its object.
A regular file is represented by an inode and a bunch of data blocks. As stated above, the inode contains almost all the metadata for the file, and pointers to the file data. These pointers are simply data block numbers, and tell where the data for the file is. So the first data pointer points to the first block with data for the file, the second points to the second block, and so forth.
In designing the inode, this gives us a tradeoff we have to make: to represent a large file, we need to have a data pointer for each block. But the vast majority of files are small, so this results in a lot of wasted space. In the case of Ext2fs, there are twelve data block pointers in the inode; with a 4K block size, this limits files to only 48K. That's unacceptable.
The solution is to create indirect, double indirect, and triple indirect pointers. When we run out of pointers, we point to a data block that is, itself, full of pointers to data blocks. With 32 bit pointers this gives 1024 pointers in the data block, so an indirect block increases the possible file size to 4 MB.
When 4MB is too small for the file, we go on to a double-indirect pointer. This points to a data block full of pointers to indirect blocks; this gives us 1048576 (1024 * 1024) blocks, for a maximum file size of 4 gigabytes.
There are even a few applications that need files bigger than this! For instance, movies will be on files that are bigger than this. The solution is to go another step to a triple-indirect pointer; it points to a data block full of double-indirect pointers. At this point a file can be 4 terabytes. Note that some Unix-like file systems don't go as far as triple-indirect pointers (Ext2FS does).
Here's a picture showing the structure of a file in a Unix-like file system:
As you can see, the inode serves as the root of a tree. There are several blocks pointed to directly by the inode; there is an indirect block, and there is a double indirect block. The figure doesn't show a triple indirect block because it was getting too messy already!
The other basic file type is the directory. A directory has an inode (just like any other file), but the data in its data blocks is structured as several entries. Here's a picture.
So it's just a lookup table, mapping filenames to the inode numbers of
the files in the directory. The
first two entries are always "." and "..".
. is a pointer back to this very directory, and
.. is a pointer back to this directory's parent in the
directory structure.
Notice some of the things that are not in the directory: ownership, file size, file type, file permissions. These are all over in the inode itself.
So here's a typical (incomplete) Unix directory structure. The first
figure shows an intermediate level of detail: the idea here is to
show the directories. The root of the structure is "/"
(notice that this name doesn't actually appear). The top-level
directories are bin, etc, home,
and usr. bin contains ls, and
home contains a single subdirectory, pfeiffer.
Normally, we refer to the files by their full pathnames, not just by
their directory entries. So ls located in
bin located in / is called
/bin/ls.
Now, remember that each of these directories, and the single regular
file in the figure (/bin/ls) are represented by inodes
and data blocks. Going one level of detail down in the figure looks
like this.
In this figure, file are in the dashed boxes. Every file consists of
an inode and its associated data blocks. For the five directories in
the figure, the inode is marked as being a directory (the
D), and its data blocks are structured. The indices in
the entries refer to the inodes of the files in the figure. For the
one regular file, the inode is marked as being a regular file (the
R), and the data blocks simply contain data.
Normally, the directory structure is shown without the implementation details. This structure would be drawn like this.
One last thing to mention about directory entries is that their format is another of the distinctions between various members of the Unix file system family. Some of the earliest ones just had a fixed-size filename and the inode number in each entry. The current implementation of ext2 uses a singly-linked list of directory entries with variable-size file names (I don't know how long the filenames can be).
The pointers from the directories to directories and files are called "links", or sometimes "hard links" (to distinguish them from soft links, which we'll be covering later).
An obvious question to ask here is, "what's to stop two directory entries from linking to the same file"? And the answer turns out to be "absolutely nothing." It really is possible to do just that.
Now you can see why a lot of information that you might expect to be in the directory entry is in the file itself instead: so if there are two links to the file, the file still has exactly the same permissions regardless of which one you open it with, it has the same owner, and so forth.
The figure above shows a file with two names, in two directories, at
two different levels in the directory hierarchy: /d1/e1/f
and /d2/g (for simplicity, the figure doesn't show the
. or the .. links). Which is its "real"
name? Both are. Neither name is more real than the other, and there
is no way to determine which link was set up first. The file could
quite easily have even been created as yet another name, which has
been deleted in the mean time!
One restriction on this, by the way is directories themselves. If a
directory has entries in two parent directories, which one does the
.. pointer point to? Since there is no right answer,
this is forbidden.
The file's inode contains a "link count" which tells how many links are currently pointing to a file. This includes both links in the file system, and also processes that have the file open. So you can actually remove a file from all directories, but it won't go away until the last file that had it open goes away.
While having multiple links to a file has a few uses, nearly all of them are better served today by symbolic (or soft) links, which we'll discuss in a second. Multiple hard links to a file were developed back when Unix was first developed, while symbolic links came along much later. I'll go out on a limb and say that at this point they are pretty much a historical artifact (and figure that if I'm too far off base, somebody who comes across these notes on my web site will correct me!).
Over time, a number of extra capabilities have been developed as part of Unix-like file systems. Some of these have actually been around since the beginning, but I'm including them here because they are providing features that you wouldn't really expect from a file system.
The normal Unix access to devices is also through the file system.
The inode's type field can specify that, instead of a file
or a directory, the inode is the hook to access a block or a character
device. In this case, the device major and minor numbers are also in
the inode, and when the file is opened the file operations tables from
that device driver are connected to the process, instead of the
filesystem's normal file operations. By convention, Unix systems put
special files in the /dev directory; they could actually
go anywhere.
This use of the file system has been around since the very first Unix versions.
Another use of the file system is to provide an interprocess
communication method. We can create a "file" of type
pipe (also called a FIFO, for first-in-first-out). Once
we do this, we can use the "file" to communicate between two
processes. Even though the file is there in the filesystem, when you
write to it the writes only go to internal buffers in the kernel;
nothing ever goes out to disk. And then the other program can read
them.
This is a really easy way to do IPC on a single computer, without learning about sockets and things. But it doesn't work across several computers (not even if you've got NFS) so you're probably better off just biting the bullet and learning how to use sockets.
In modern Unix systems, anonymous pipes between processes use the same code as named piptes.
A very common thing to do these days is to create a "file" that is a symbolic link (also called a soft link). A symbolic link is just an alias for a file: the link puts the name of the target file in the link's inode or a data block (ext2fs does either, automatically. If the link is short enough it goes in the inode; if it won't fit there, it goes in a data block).
This turns out to be just incredibly useful. For example, my class
directory structure takes the form
~pfeiffer/classes/classnum/sem/semester;
so, the files for CS 474 in Fall, 2005 are in
~pfeiffer/classes/474/sem/f05. I have quite a bit of
software that I keep using from semester to semester, so I've got a
symbolic link in ~pfeiffer/classes/474/sem/ called
this, which hooks up to f05. Now my
software can all just refer to
~pfeiffer/classes/474/sem/this, and as long as I remember
to update the link every semester, everything works.
You can see several fundamental differences between hard links and symbolic links here.
When there are two hard links to a file, both are equally the "real" name of the file. when you've got a hard link and a soft link, the hard link is the real name, and the soft link is an alias.
Because a soft link is just an alias, it's possible to have a soft link to a directory (in fact, my example does just that).
You can actually have more complex structures than I've just
described. To take another example, there are several different
programs that provide the yacc parser generator: the
original AT&T yacc, and the GNU bison.
In the Debian Linux distiribution, there is a directory called
/etc/alternatives which is used to keep track of
which alternative version of a program is used to provide these
functions. On my machine at home, /usr/bin/yacc is a
soft link to /etc/alternatives/yacc, which in turn is
a symbolic link to /usr/bin/bison.yacc. Replacing
one alternative version of a program with another is much easier
to maintain this way.
Because a soft link is just an alias, it's entirely possible to have a soft link to a file that doesn't exist. This is called a dangling link, and is almost always the result of an error (and I only say "almost" because I'll bet somebody, somewhere has done it on purpose for some good reason. I just don't know what it is).
For that matter, it's possible to create all kinds of trouble for yourself with soft links. You can do things like create a cycle between two soft links, and never have an actual link to a real file! This would be extremely difficult to detect reliably; what's done instead is to count how many symbolic links have been followed, and decide at some point (I don't know what the limit actually is), that it must be in a cycle.
While most Unix-like file systems support symbolic links, NTFS is a conspicuous exception.
Unix uses a simple, straightforward protection mechanism for files. In essence, we can define a level of protection for the account that created the file, another level of protection for an arbitrary set of accounts (called a group), and a third level of protection for all users.
A nine-bit bitmask defines these protections: one bit each for read, write, and xecute permission for the owner; then three similar bits for the group; then three similar bits for other accounts). We frequently refer to this bitmask with a three octal digit number; 640 means "read and write permission for the owner, read-only for the same group, and no access for anybody else).
There are frequent claims that more fine-grained control than this is required; we might want some level of permission (rwx maybe) for the owner, a different level for some group (w maybe, for someone who could write to the file but could not read or execute it), a different level (rw maybe) for some other group, and so forth.
So, today most Unix-like filesystems support the ability to create an arbitrary list of accounts, and allow access on a file-by-file basis to accounts on an account-by-account basis. This is called an access control list (ACL).
Frankly, I'm skeptical of just how necessary ACLs really are. I've certainly never had an occasion to have to learn how to set one up. However, NTFS has supported them for a long time, and seemed like there were frequent complaints that Linux didn't have them, so projects had to use Windows. As a result, Ext2fs has them.
An early decision in the development of Unix was to make extensive use of buffering in the kernel and of reordering of writes to improve performance. Unfortunately, this means that at any give time, it's nearly certain that the actual data structures on disk are in an inconsistent state. For instance, we might have a directory entry pointing to an inode, but the inode may not actually have been allocated and set up for a file yet. Or a file may have been deleted from all directories, but might still have an inode. And there are just plain more possible examples here than I can even pretend to list.
The "right" way to handle this is that before we shut down the computer,
we synchronize the file system so it is internally consistent and all
writes have actually happened. There is even a sync
command to do exactly this.
Unfortunately, we can't always be sure the file system is synced when
the machine is shut down. It may have shut down because somebody
tripped over the power cord. It may have even shut down because of a
kernel bug. In these cases, the traditional answer is to do a
filesystem check when the machine is rebooted; there is a command
called fsck to do just this. This takes a long
time, since it basically has to traverse the entire directory structure
of the file system and compare it against all the inodes to find
inconsistencies.
An alternative approach is to use a journalling filesystem, which keeps a log of changes for use in fixing a broken filesystem at boot time. An example of a journalling filesystem is Ext3fs, the Linux third extended filesystem.
The first thing to point out about Ext3fs is that it serves as an example of the flexibility of Ext2fs: it isn't actually a new filesystem. One of the nice features of Ext2fs is the possibilities of extensions: part of the superblock is a description of what extensions are enabled for the filesystem. Ext3fs is actually an Ext2fs extension: you can mount an Ext3fs filesystem on a machine that only understands Ext2fs, and you just won't have the journalling capabilities.
Ext3fs actually has several different modes of operation, which trade off performance against consistency. The safest (and slowest) mode, performs all writes to the journal first, and then to the actual filesystem. Once the write has been made to the actual filesystem, it is deleted from the journal.
On a crash, only the journal needs to be inspected. Anything that appears in the journal but not in the actual filesystem is copied out to it, very quickly bringing the actual filesystem up to date without doing a full filesystem check.
One thing to mention here is that the system, with commendable
paranoia, assumes the authors aren't perfect and will do a
fsck occasionally on general principles (typically every
30 reboots or six months). I've seen an occasional claim that this
means NTFS is more reliable than Ext3fs, since it never does a check
unless there was an unclean shutdown. No, it's just a measure of the
authors' confidence, not whether that confidence is justified.