6.  The Berkeley Proposal

      The Sun VFS interface has been most widely used of the three described here. It is also the most general of the three, in that filesystem-specific data and operations are best separated from the generic layer. Although it has several disadvantages which were described above, most of them may be corrected with minor changes to the interface (and, in a few areas, philosophical changes). The DEC GFS has other advantages, in particular the use of the 4.3BSD namei interface and optimizations. It allows single or multiple components of a pathname to be translated in a single call to the specific filesystem and thus accommodates filesystems with either preference. The FSS is least well understood, as there is little public information about the interface. However, the design goals are the least consistent with those of the Berkeley research groups. Accordingly, a new filesystem interface has been devised to avoid some of the problems in the other systems. The proposed interface derives directly from Sun's VFS, but, like GFS, uses a 4.3BSD-style name lookup interface. Additional context information has been moved from the user structure to the nameidata structure so that name translation may be independent of the global context of a user process. This is especially desired in any system where kernel-mode servers operate as light-weight or interrupt-level processes, or where a server may store or cache context for several clients. This calling interface has the additional advantage that the call parameters need not all be pushed onto the stack for each call through the filesystem interface, and they may be accessed using short offsets from a base pointer (unlike global variables in the user structure).

      The proposed filesystem interface is described very tersely here. For the most part, data structures and procedures are analogous to those used by VFS, and only the changes will be treated here. See [Kleiman86] for complete descriptions of the vfs and vnode operations in Sun's interface.

      The central data structure for name translation is the nameidata structure. The same structure is used to pass parameters to namei, to pass these same parameters to filesystem-specific lookup routines, to communicate completion status from the lookup routines back to namei, and to return completion status to the calling routine. For creation or deletion requests, the parameters to the filesystem operation to complete the request are also passed in this same structure. The form of the nameidata structure is:

/*
 * Encapsulation of namei parameters.
 * One of these is located in the u. area to
 * minimize space allocated on the kernel stack
 * and to retain per-process context.
 */
struct nameidata {
		/* arguments to namei and related context: */
	caddr_t	ni_dirp;		/* pathname pointer */
	enum	uio_seg ni_seg;		/* location of pathname */
	short	ni_nameiop;		/* see below */
	struct	vnode *ni_cdir;		/* current directory */
	struct	vnode *ni_rdir;		/* root directory, if not normal root */
	struct	ucred *ni_cred;		/* credentials */

		/* shared between namei, lookup routines and commit routines: */
	caddr_t	ni_pnbuf;		/* pathname buffer */
	char	*ni_ptr;		/* current location in pathname */
	int	ni_pathlen;		/* remaining chars in path */
	short	ni_more;		/* more left to translate in pathname */
	short	ni_loopcnt;		/* count of symlinks encountered */

		/* results: */
	struct	vnode *ni_vp;		/* vnode of result */
	struct	vnode *ni_dvp;		/* vnode of intermediate directory */

/* BEGIN UFS SPECIFIC */
	struct diroffcache {		/* last successful directory search */
		struct	vnode *nc_prevdir;	/* terminal directory */
		long	nc_id;			/* directory's unique id */
		off_t	nc_prevoffset;		/* where last entry found */
	} ni_nc;
/* END UFS SPECIFIC */
};
/*
 * namei operations and modifiers
 */
#define	LOOKUP	0	/* perform name lookup only */
#define	CREATE	1	/* setup for file creation */
#define	DELETE	2	/* setup for file deletion */
#define	WANTPARENT	0x10	/* return parent directory vnode also */
#define	NOCACHE	0x20	/* name must not be left in cache */
#define	FOLLOW	0x40	/* follow symbolic links */
#define	NOFOLLOW	0x0	/* don't follow symbolic links (pseudo) */
As in current systems other than Sun's VFS, namei is called with an operation request, one of LOOKUP, CREATE or DELETE. For a LOOKUP, the operation is exactly like the lookup in VFS. CREATE and DELETE allow the filesystem to ensure consistency by locking the parent inode (private to the filesystem), and (for the local filesystem) to avoid duplicate directory scans by storing the new directory entry and its offset in the directory in the ndirinfo structure. This is intended to be opaque to the filesystem-independent levels. Not all lookups for creation or deletion are actually followed by the intended operation; permission may be denied, the filesystem may be read-only, etc. Therefore, an entry point to the filesystem is provided to abort a creation or deletion operation and allow release of any locked internal data. After a namei with a CREATE or DELETE flag, the pathname pointer is set to point to the last filename component. Filesystems that choose to implement creation or deletion entirely within the subsequent call to a create or delete entry are thus free to do so.

      The nameidata is used to store context used during name translation. The current and root directories for the translation are stored here. For the local filesystem, the per-process directory offset cache is also kept here. A file server could leave the directory offset cache empty, could use a single cache for all clients, or could hold caches for several recent clients.

      Several other data structures are used in the filesystem operations. One is the ucred structure which describes a client's credentials to the filesystem. This is modified slightly from the Sun structure; the ``accounting'' group ID has been merged into the groups array. The actual number of groups in the array is given explicitly to avoid use of a reserved group ID as a terminator. Also, typedefs introduced in 4.3BSD for user and group ID's have been used. The ucred structure is thus:

/*
 * Credentials.
 */
struct ucred {
	u_short	cr_ref;			/* reference count */
	uid_t	cr_uid;			/* effective user id */
	short	cr_ngroups;		/* number of groups */
	gid_t	cr_groups[NGROUPS];	/* groups */
	/*
	 * The following either should not be here,
	 * or should be treated as opaque.
	 */
	uid_t   cr_ruid;		/* real user id */
	gid_t   cr_svgid;		/* saved set-group id */
};

      A final structure used by the filesystem interface is the uio structure mentioned earlier. This structure describes the source or destination of an I/O operation, with provision for scatter/gather I/O. It is used in the read and write entries to the filesystem. The uio structure presented here is modified from the one used in 4.2BSD to specify the location of each vector of the operation (user or kernel space) and to allow an alternate function to be used to implement the data movement. The alternate function might perform page remapping rather than a copy, for example.

/*
 * Description of an I/O operation which potentially
 * involves scatter-gather, with individual sections
 * described by iovec, below.  uio_resid is initially
 * set to the total size of the operation, and is
 * decremented as the operation proceeds.  uio_offset
 * is incremented by the amount of each operation.
 * uio_iov is incremented and uio_iovcnt is decremented
 * after each vector is processed.
 */
struct uio {
	struct	iovec *uio_iov;
	int	uio_iovcnt;
	off_t	uio_offset;
	int	uio_resid;
	enum	uio_rw uio_rw;
};

enum	uio_rw { UIO_READ, UIO_WRITE };
/*
 * Description of a contiguous section of an I/O operation.
 * If iov_op is non-null, it is called to implement the copy
 * operation, possibly by remapping, with the call
 *	(*iov_op)(from, to, count);
 * where from and to are caddr_t and count is int.
 * Otherwise, the copy is done in the normal way,
 * treating base as a user or kernel virtual address
 * according to iov_segflg.
 */
struct iovec {
	caddr_t	iov_base;
	int	iov_len;
	enum	uio_seg iov_segflg;
	int	(*iov_op)();
};
/*
 * Segment flag values.
 */
enum	uio_seg {
	UIO_USERSPACE,		/* from user data space */
	UIO_SYSSPACE,		/* from system space */
	UIO_USERISPACE		/* from user I space */
};