fsinterface, section 4.

4. Name translation issues

Each of the systems described include a mechanism for performing pathname-to-internal-representation translation. The style of the name translation function is very different in all three systems. As described above, the AT&T and DEC systems retain the namei function. The two are quite different, however, as the ULTRIX interface uses the namei calling convention introduced in 4.3BSD. The parameters and context for the name lookup operation are collected in a nameidata structure which is passed to namei for operation. Intent to create or delete the named file is declared in advance, so that the final directory scan in namei may retain information such as the offset in the directory at which the modification will be made. Filesystems that use such mechanisms to avoid redundant work must therefore lock the directory to be modified so that it may not be modified by another process before completion. In the System V filesystem, as in previous versions of UNIX, this information is stored in the per-process user structure by namei for use by a low-level routine called after performing the actual creation or deletion of the file itself. In 4.3BSD and in the GFS interface, these side effects of namei are stored in the nameidata structure given as argument to namei, which is also presented to the routine implementing file creation or deletion.

The ULTRIX namei routine is responsible for the generic parts of the name translation process, such as copying the name into an internal buffer, validating it, interpolating the contents of symbolic links, and indirecting at mount points. As in 4.3BSD, the name is copied into the buffer in a single call, according to the location of the name. After determining the type of the filesystem at the start of translation (the current directory or root directory), it calls the filesystem's namei entry with the same structure it received from its caller. The filesystem-specific routine translates the name, component by component, as long as no mount points are reached. It may return after any number of components have been processed. Namei performs any processing at mount points, then calls the correct translation routine for the next filesystem. Network filesystems may pass the remaining pathname to a server for translation, or they may look up the pathname components one at a time. The former strategy would be more efficient, but the latter scheme allows mount points within a remote filesystem without server knowledge of all client mounts.

The AT&T namei interface is presumably the same as that in previous UNIX systems, accepting the name of a routine to fetch pathname characters and an operation (one of: lookup, lookup for creation, or lookup for deletion). It translates, component by component, as before. If it detects that a mount point crosses to a remote filesystem, it passes the remainder of the pathname to the remote server. A pathname-oriented request other than open may be completed within the namei call, avoiding return to the (unmodified) system call handler that called namei.

In contrast to the first two systems, Sun's VFS interface has replaced namei with lookupname. This routine simply calls a new pathname-handling module to allocate a pathname buffer and copy in the pathname (copying a character per call), then calls lookuppn. Lookuppn performs the iteration over the directories leading to the destination file; it copies each pathname component to a local buffer, then calls the filesystem lookup entry to locate the vnode for that file in the current directory. Per-filesystem lookup routines may translate only one component per call. For creation and deletion of new files, the lookup operation is unmodified; the lookup of the final component only serves to check for the existence of the file. The subsequent creation or deletion call, if any, must repeat the final name translation and associated directory scan. For new file creation in particular, this is rather inefficient, as file creation requires two complete scans of the directory.

Several of the important performance improvements in 4.3BSD were related to the name translation process [McKusick85][Leffler84]. The following changes were made:

1.: A system-wide cache of recent translations is maintained. The cache is separate from the inode cache, so that multiple names for a file may be present in the cache. The cache does not hold ``hard'' references to the inodes, so that the normal reference pattern is not disturbed.
2.: A per-process cache is kept of the directory and offset at which the last successful name lookup was done. This allows sequential lookups of all the entries in a directory to be done in linear time.
3.: The entire pathname is copied into a kernel buffer in a single operation, rather than using two subroutine calls per character.
4.: A pool of pathname buffers are held by namei, avoiding allocation overhead.

All of these performance improvements from 4.3BSD are well worth using within a more generalized filesystem framework. The generalization of the structure may otherwise make an already-expensive function even more costly. Most of these improvements are present in the GFS system, as it derives from the beta-test version of 4.3BSD. The Sun system uses a name-translation cache generally like that in 4.3BSD. The name cache is a filesystem-independent facility provided for the use of the filesystem-specific lookup routines. The Sun cache, like that first used at Berkeley but unlike that in 4.3, holds a ``hard'' reference to the vnode (increments the reference count). The ``soft'' reference scheme in 4.3BSD cannot be used with the current NFS implementation, as NFS allocates vnodes dynamically and frees them when the reference count returns to zero rather than caching them. As a result, fewer names may be held in the cache than (local filesystem) vnodes, and the cache distorts the normal reference patterns otherwise seen by the LRU cache. As the name cache references overflow the local filesystem inode table, the name cache must be purged to make room in the inode table. Also, to determine whether a vnode is in use (for example, before mounting upon it), the cache must be flushed to free any cache reference. These problems should be corrected by the use of the soft cache reference scheme.

A final observation on the efficiency of name translation in the current Sun VFS architecture is that the number of subroutine calls used by a multi-component name lookup is dramatically larger than in the other systems. The name lookup scheme in GFS suffers from this problem much less, at no expense in violation of layering.

A final problem to be considered is synchronization and consistency. As the filesystem operations are more stylized and broken into separate entry points for parts of operations, it is more difficult to guarantee consistency throughout an operation and/or to synchronize with other processes using the same filesystem objects. The Sun interface suffers most severely from this, as it forbids the filesystems from locking objects across calls to the filesystem. It is possible that a file may be created between the time that a lookup is performed and a subsequent creation is requested. Perhaps more strangely, after a lookup fails to find the target of a creation attempt, the actual creation might find that the target now exists and is a symbolic link. The call will either fail unexpectedly, as the target is of the wrong type, or the generic creation routine will have to note the error and restart the operation from the lookup. This problem will always exist in a stateless filesystem, but the VFS interface forces all filesystems to share the problem. This restriction against locking between calls also forces duplication of work during file creation and deletion. This is considered unacceptable.