paper, section 6.

6. Internal layering

The internal structure of the network system is divided into three layers. These layers correspond to the services provided by the socket abstraction, those provided by the communication protocols, and those provided by the hardware interfaces. The communication protocols are normally layered into two or more individual cooperating layers, though they are collectively viewed in the system as one layer providing services supportive of the appropriate socket abstraction.

The following sections describe the properties of each layer in the system and the interfaces to which each must conform.

6.1. Socket layer

The socket layer deals with the interprocess communication facilities provided by the system. A socket is a bidirectional endpoint of communication which is ``typed'' by the semantics of communication it supports. The system calls described in the Berkeley Software Architecture Manual [Joy86] are used to manipulate sockets.

A socket consists of the following data structure:

struct socket {
	short	so_type;		/* generic type */
	short	so_options;		/* from socket call */
	short	so_linger;		/* time to linger while closing */
	short	so_state;		/* internal state flags */
	caddr_t	so_pcb;			/* protocol control block */
	struct	protosw *so_proto;	/* protocol handle */
	struct	socket *so_head;	/* back pointer to accept socket */
	struct	socket *so_q0;		/* queue of partial connections */
	short	so_q0len;		/* partials on so_q0 */
	struct	socket *so_q;		/* queue of incoming connections */
	short	so_qlen;		/* number of connections on so_q */
	short	so_qlimit;		/* max number queued connections */
	struct	sockbuf so_rcv;		/* receive queue */
	struct	sockbuf so_snd;		/* send queue */
	short	so_timeo;		/* connection timeout */
	u_short	so_error;		/* error affecting connection */
	u_short	so_oobmark;		/* chars to oob mark */
	short	so_pgrp;		/* pgrp for signals */
};

Each socket contains two data queues, so_rcv and so_snd, and a pointer to routines which provide supporting services. The type of the socket, so_type is defined at socket creation time and used in selecting those services which are appropriate to support it. The supporting protocol is selected at socket creation time and recorded in the socket data structure for later use. Protocols are defined by a table of procedures, the protosw structure, which will be described in detail later. A pointer to a protocol-specific data structure, the ``protocol control block,'' is also present in the socket structure. Protocols control this data structure, which normally includes a back pointer to the parent socket structure to allow easy lookup when returning information to a user (for example, placing an error number in the so_error field). The other entries in the socket structure are used in queuing connection requests, validating user requests, storing socket characteristics (e.g. options supplied at the time a socket is created), and maintaining a socket's state.

Processes ``rendezvous at a socket'' in many instances. For instance, when a process wishes to extract data from a socket's receive queue and it is empty, or lacks sufficient data to satisfy the request, the process blocks, supplying the address of the receive queue as a ``wait channel' to be used in notification. When data arrives for the process and is placed in the socket's queue, the blocked process is identified by the fact it is waiting ``on the queue.''

6.1.1. Socket state

A socket's state is defined from the following:

#define	SS_NOFDREF	0x001	/* no file table ref any more */
#define	SS_ISCONNECTED	0x002	/* socket connected to a peer */
#define	SS_ISCONNECTING	0x004	/* in process of connecting to peer */
#define	SS_ISDISCONNECTING	0x008	/* in process of disconnecting */
#define	SS_CANTSENDMORE	0x010	/* can't send more data to peer */
#define	SS_CANTRCVMORE	0x020	/* can't receive more data from peer */
#define	SS_RCVATMARK	0x040	/* at mark on input */

#define	SS_PRIV	0x080	/* privileged */
#define	SS_NBIO	0x100	/* non-blocking ops */
#define	SS_ASYNC	0x200	/* async i/o notify */

The state of a socket is manipulated both by the protocols and the user (through system calls). When a socket is created, the state is defined based on the type of socket. It may change as control actions are performed, for example connection establishment. It may also change according to the type of input/output the user wishes to perform, as indicated by options set with fcntl. ``Non-blocking'' I/O implies that a process should never be blocked to await resources. Instead, any call which would block returns prematurely with the error EWOULDBLOCK, or the service request may be partially fulfilled, e.g. a request for more data than is present.

If a process requested ``asynchronous'' notification of events related to the socket, the SIGIO signal is posted to the process when such events occur. An event is a change in the socket's state; examples of such occurrences are: space becoming available in the send queue, new data available in the receive queue, connection establishment or disestablishment, etc.

A socket may be marked ``privileged'' if it was created by the super-user. Only privileged sockets may bind addresses in privileged portions of an address space or use ``raw'' sockets to access lower levels of the network.

6.1.2. Socket data queues

A socket's data queue contains a pointer to the data stored in the queue and other entries related to the management of the data. The following structure defines a data queue:

struct sockbuf {
	u_short	sb_cc;		/* actual chars in buffer */
	u_short	sb_hiwat;	/* max actual char count */
	u_short	sb_mbcnt;	/* chars of mbufs used */
	u_short	sb_mbmax;	/* max chars of mbufs to use */
	u_short	sb_lowat;	/* low water mark */
	short	sb_timeo;	/* timeout */
	struct	mbuf *sb_mb;	/* the mbuf chain */
	struct	proc *sb_sel;	/* process selecting read/write */
	short	sb_flags;	/* flags, see below */
};

Data is stored in a queue as a chain of mbufs. The actual count of data characters as well as high and low water marks are used by the protocols in controlling the flow of data. The amount of buffer space (characters of mbufs and associated data pages) is also recorded along with the limit on buffer allocation. The socket routines cooperate in implementing the flow control policy by blocking a process when it requests to send data and the high water mark has been reached, or when it requests to receive data and less than the low water mark is present (assuming non-blocking I/O has not been specified).*

When a socket is created, the supporting protocol ``reserves'' space for the send and receive queues of the socket. The limit on buffer allocation is set somewhat higher than the limit on data characters to account for the granularity of buffer allocation. The actual storage associated with a socket queue may fluctuate during a socket's lifetime, but it is assumed that this reservation will always allow a protocol to acquire enough memory to satisfy the high water marks.

The timeout and select values are manipulated by the socket routines in implementing various portions of the interprocess communications facilities and will not be described here.

Data queued at a socket is stored in one of two styles. Stream-oriented sockets queue data with no addresses, headers or record boundaries. The data are in mbufs linked through the m_next field. Buffers containing access rights may be present within the chain if the underlying protocol supports passage of access rights. Record-oriented sockets, including datagram sockets, queue data as a list of packets; the sections of packets are distinguished by the types of the mbufs containing them. The mbufs which comprise a record are linked through the m_next field; records are linked from the m_act field of the first mbuf of one packet to the first mbuf of the next. Each packet begins with an mbuf containing the ``from'' address if the protocol provides it, then any buffers containing access rights, and finally any buffers containing data. If a record contains no data, no data buffers are required unless neither address nor access rights are present.

A socket queue has a number of flags used in synchronizing access to the data and in acquiring resources:

#define	SB_LOCK	0x01	/* lock on data queue (so_rcv only) */
#define	SB_WANT	0x02	/* someone is waiting to lock */
#define	SB_WAIT	0x04	/* someone is waiting for data/space */
#define	SB_SEL	0x08	/* buffer is selected */
#define	SB_COLL	0x10	/* collision selecting */

The last two flags are manipulated by the system in implementing the select mechanism.

6.1.3. Socket connection queuing

In dealing with connection oriented sockets (e.g. SOCK_STREAM) the two ends are considered distinct. One end is termed active, and generates connection requests. The other end is called passive and accepts connection requests.

From the passive side, a socket is marked with SO_ACCEPTCONN when a listen call is made, creating two queues of sockets: so_q0 for connections in progress and so_q for connections already made and awaiting user acceptance. As a protocol is preparing incoming connections, it creates a socket structure queued on so_q0 by calling the routine sonewconn(). When the connection is established, the socket structure is then transferred to so_q, making it available for an accept.

If an SO_ACCEPTCONN socket is closed with sockets on either so_q0 or so_q, these sockets are dropped, with notification to the peers as appropriate.

6.2. Protocol layer(s)

Each socket is created in a communications domain, which usually implies both an addressing structure (address family) and a set of protocols which implement various socket types within the domain (protocol family). Each domain is defined by the following structure:

struct	domain {
	int	dom_family;		/* PF_xxx */
	char	*dom_name;
	int	(*dom_init)();		/* initialize domain data structures */
	int	(*dom_externalize)();	/* externalize access rights */
	int	(*dom_dispose)();	/* dispose of internalized rights */
	struct	protosw *dom_protosw, *dom_protoswNPROTOSW;
	struct	domain *dom_next;
};

At boot time, each domain configured into the kernel is added to a linked list of domain. The initialization procedure of each domain is then called. After that time, the domain structure is used to locate protocols within the protocol family. It may also contain procedure references for externalization of access rights at the receiving socket and the disposal of access rights that are not received.

Protocols are described by a set of entry points and certain socket-visible characteristics, some of which are used in deciding which socket type(s) they may support.

An entry in the ``protocol switch'' table exists for each protocol module configured into the system. It has the following form:

struct protosw {
	short	pr_type;		/* socket type used for */
	struct	domain *pr_domain;	/* domain protocol a member of */
	short	pr_protocol;		/* protocol number */
	short	pr_flags;		/* socket visible attributes */
/* protocol-protocol hooks */
	int	(*pr_input)();		/* input to protocol (from below) */
	int	(*pr_output)();		/* output to protocol (from above) */
	int	(*pr_ctlinput)();	/* control input (from below) */
	int	(*pr_ctloutput)();	/* control output (from above) */
/* user-protocol hook */
	int	(*pr_usrreq)();		/* user request */
/* utility hooks */
	int	(*pr_init)();		/* initialization routine */
	int	(*pr_fasttimo)();	/* fast timeout (200ms) */
	int	(*pr_slowtimo)();	/* slow timeout (500ms) */
	int	(*pr_drain)();		/* flush any excess space possible */
};

A protocol is called through the pr_init entry before any other. Thereafter it is called every 200 milliseconds through the pr_fasttimo entry and every 500 milliseconds through the pr_slowtimo for timer based actions. The system will call the pr_drain entry if it is low on space and this should throw away any non-critical data.

Protocols pass data between themselves as chains of mbufs using the pr_input and pr_output routines. Pr_input passes data up (towards the user) and pr_output passes it down (towards the network); control information passes up and down on pr_ctlinput and pr_ctloutput. The protocol is responsible for the space occupied by any of the arguments to these entries and must either pass it onward or dispose of it. (On output, the lowest level reached must free buffers storing the arguments; on input, the highest level is responsible for freeing buffers.)

The pr_usrreq routine interfaces protocols to the socket code and is described below.

The pr_flags field is constructed from the following values:

#define	PR_ATOMIC	0x01		/* exchange atomic messages only */
#define	PR_ADDR	0x02		/* addresses given with messages */
#define	PR_CONNREQUIRED	0x04		/* connection required by protocol */
#define	PR_WANTRCVD	0x08		/* want PRU_RCVD calls */
#define	PR_RIGHTS	0x10		/* passes capabilities */

Protocols which are connection-based specify the PR_CONNREQUIRED flag so that the socket routines will never attempt to send data before a connection has been established. If the PR_WANTRCVD flag is set, the socket routines will notify the protocol when the user has removed data from the socket's receive queue. This allows the protocol to implement acknowledgement on user receipt, and also update windowing information based on the amount of space available in the receive queue. The PR_ADDR field indicates that any data placed in the socket's receive queue will be preceded by the address of the sender. The PR_ATOMIC flag specifies that each user request to send data must be performed in a single protocol send request; it is the protocol's responsibility to maintain record boundaries on data to be sent. The PR_RIGHTS flag indicates that the protocol supports the passing of capabilities; this is currently used only by the protocols in the UNIX protocol family.

When a socket is created, the socket routines scan the protocol table for the domain looking for an appropriate protocol to support the type of socket being created. The pr_type field contains one of the possible socket types (e.g. SOCK_STREAM), while the pr_domain is a back pointer to the domain structure. The pr_protocol field contains the protocol number of the protocol, normally a well-known value.

6.3. Network-interface layer

Each network-interface configured into a system defines a path through which packets may be sent and received. Normally a hardware device is associated with this interface, though there is no requirement for this (for example, all systems have a software ``loopback'' interface used for debugging and performance analysis). In addition to manipulating the hardware device, an interface module is responsible for encapsulation and decapsulation of any link-layer header information required to deliver a message to its destination. The selection of which interface to use in delivering packets is a routing decision carried out at a higher level than the network-interface layer. An interface may have addresses in one or more address families. The address is set at boot time using an ioctl on a socket in the appropriate domain; this operation is implemented by the protocol family, after verifying the operation through the device ioctl entry.

An interface is defined by the following structure,

struct ifnet {
	char	*if_name;		/* name, e.g. ``en'' or ``lo'' */
	short	if_unit;		/* sub-unit for lower level driver */
	short	if_mtu;			/* maximum transmission unit */
	short	if_flags;		/* up/down, broadcast, etc. */
	short	if_timer;		/* time 'til if_watchdog called */
	struct	ifaddr *if_addrlist;	/* list of addresses of interface */
	struct	ifqueue if_snd;		/* output queue */
	int	(*if_init)();		/* init routine */
	int	(*if_output)();		/* output routine */
	int	(*if_ioctl)();		/* ioctl routine */
	int	(*if_reset)();		/* bus reset routine */
	int	(*if_watchdog)();	/* timer routine */
	int	if_ipackets;		/* packets received on interface */
	int	if_ierrors;		/* input errors on interface */
	int	if_opackets;		/* packets sent on interface */
	int	if_oerrors;		/* output errors on interface */
	int	if_collisions;		/* collisions on csma interfaces */
	struct	ifnet *if_next;
};

Each interface address has the following form:

struct ifaddr {
	struct	sockaddr ifa_addr;	/* address of interface */
	union {
		struct	sockaddr ifu_broadaddr;
		struct	sockaddr ifu_dstaddr;
	} ifa_ifu;
	struct	ifnet *ifa_ifp;		/* back-pointer to interface */
	struct	ifaddr *ifa_next;	/* next address for interface */
};
#define	ifa_broadaddr	ifa_ifu.ifu_broadaddr	/* broadcast address */
#define	ifa_dstaddr	ifa_ifu.ifu_dstaddr	/* other end of p-to-p link */

The protocol generally maintains this structure as part of a larger structure containing additional information concerning the address.

Each interface has a send queue and routines used for initialization, if_init, and output, if_output. If the interface resides on a system bus, the routine if_reset will be called after a bus reset has been performed. An interface may also specify a timer routine, if_watchdog; if if_timer is non-zero, it is decremented once per second until it reaches zero, at which time the watchdog routine is called.

The state of an interface and certain characteristics are stored in the if_flags field. The following values are possible:

#define	IFF_UP	0x1	/* interface is up */
#define	IFF_BROADCAST	0x2	/* broadcast is possible */
#define	IFF_DEBUG	0x4	/* turn on debugging */
#define	IFF_LOOPBACK	0x8	/* is a loopback net */
#define	IFF_POINTOPOINT	0x10	/* interface is point-to-point link */
#define	IFF_NOTRAILERS	0x20	/* avoid use of trailers */
#define	IFF_RUNNING	0x40	/* resources allocated */
#define	IFF_NOARP	0x80	/* no address resolution protocol */

If the interface is connected to a network which supports transmission of broadcast packets, the IFF_BROADCAST flag will be set and the ifa_broadaddr field will contain the address to be used in sending or accepting a broadcast packet. If the interface is associated with a point-to-point hardware link (for example, a DEC DMR-11), the IFF_POINTOPOINT flag will be set and ifa_dstaddr will contain the address of the host on the other side of the connection. These addresses and the local address of the interface, if_addr, are used in filtering incoming packets. The interface sets IFF_RUNNING after it has allocated system resources and posted an initial read on the device it manages. This state bit is used to avoid multiple allocation requests when an interface's address is changed. The IFF_NOTRAILERS flag indicates the interface should refrain from using a trailer encapsulation on outgoing packets, or (where per-host negotiation of trailers is possible) that trailer encapsulations should not be requested; trailer protocols are described in section 14. The IFF_NOARP flag indicates the interface should not use an ``address resolution protocol'' in mapping internetwork addresses to local network addresses.

Various statistics are also stored in the interface structure. These may be viewed by users using the netstat(1) program.

The interface address and flags may be set with the SIOCSIFADDR and SIOCSIFFLAGS ioctls. SIOCSIFADDR is used initially to define each interface's address; SIOGSIFFLAGS can be used to mark an interface down and perform site-specific configuration. The destination address of a point-to-point link is set with SIOCSIFDSTADDR. Corresponding operations exist to read each value. Protocol families may also support operations to set and read the broadcast address. In addition, the SIOCGIFCONF ioctl retrieves a list of interface names and addresses for all interfaces and protocols on the host.

6.3.1. UNIBUS interfaces

All hardware related interfaces currently reside on the UNIBUS. Consequently a common set of utility routines for dealing with the UNIBUS has been developed. Each UNIBUS interface utilizes a structure of the following form:

struct	ifubinfo {
	short	iff_uban;			/* uba number */
	short	iff_hlen;			/* local net header length */
	struct	uba_regs *iff_uba;		/* uba regs, in vm */
	short	iff_flags;			/* used during uballoc's */
};

Additional structures are associated with each receive and transmit buffer, normally one each per interface; for read,

struct	ifrw {
	caddr_t	ifrw_addr;			/* virt addr of header */
	short	ifrw_bdp;			/* unibus bdp */
	short	ifrw_flags;			/* type, etc. */
#define	IFRW_W	0x01				/* is a transmit buffer */
	int	ifrw_info;			/* value from ubaalloc */
	int	ifrw_proto;			/* map register prototype */
	struct	pte *ifrw_mr;			/* base of map registers */
};

and for write,

struct	ifxmt {
	struct	ifrw ifrw;
	caddr_t	ifw_base;			/* virt addr of buffer */
	struct	pte ifw_wmap[IF_MAXNUBAMR];	/* base pages for output */
	struct	mbuf *ifw_xtofree;		/* pages being dma'd out */
	short	ifw_xswapd;			/* mask of clusters swapped */
	short	ifw_nmr;			/* number of entries in wmap */
};
#define	ifw_addr	ifrw.ifrw_addr
#define	ifw_bdp	ifrw.ifrw_bdp
#define	ifw_flags	ifrw.ifrw_flags
#define	ifw_info	ifrw.ifrw_info
#define	ifw_proto	ifrw.ifrw_proto
#define	ifw_mr	ifrw.ifrw_mr

One of each of these structures is conveniently packaged for interfaces with single buffers for each direction, as follows:

struct	ifuba {
	struct	ifubinfo ifu_info;
	struct	ifrw ifu_r;
	struct	ifxmt ifu_xmt;
};
#define	ifu_uban	ifu_info.iff_uban
#define	ifu_hlen	ifu_info.iff_hlen
#define	ifu_uba		ifu_info.iff_uba
#define	ifu_flags	ifu_info.iff_flags
#define	ifu_w		ifu_xmt.ifrw
#define	ifu_xtofree	ifu_xmt.ifw_xtofree

The if_ubinfo structure contains the general information needed to characterize the I/O-mapped buffers for the device. In addition, there is a structure describing each buffer, including UNIBUS resources held by the interface. Sufficient memory pages and bus map registers are allocated to each buffer upon initialization according to the maximum packet size and header length. The kernel virtual address of the buffer is held in ifrw_addr, and the map registers begin at ifrw_mr. UNIBUS map register ifrw_mr[-1] maps the local network header ending on a page boundary. UNIBUS data paths are reserved for read and for write, given by ifrw_bdp. The prototype of the map registers for read and for write is saved in ifrw_proto.

When write transfers are not at least half-full pages on page boundaries, the data are just copied into the pages mapped on the UNIBUS and the transfer is started. If a write transfer is at least half a page long and on a page boundary, UNIBUS page table entries are swapped to reference the pages, and then the initial pages are remapped from ifw_wmap when the transfer completes. The mbufs containing the mapped pages are placed on the ifw_xtofree queue to be freed after transmission.

When read transfers give at least half a page of data to be input, page frames are allocated from a network page list and traded with the pages already containing the data, mapping the allocated pages to replace the input pages for the next UNIBUS data input.

The following utility routines are available for use in writing network interface drivers; all use the structures described above.

if_ubaminit(ifubinfo, uban, hlen, nmr, ifr, nr, ifx, nx);
if_ubainit(ifuba, uban, hlen, nmr);

if_ubaminit allocates resources on UNIBUS adapter uban, storing the information in the ifubinfo, ifrw and ifxmt structures referenced. The ifr and ifx parameters are pointers to arrays of ifrw and ifxmt structures whose dimensions are nr and nx, respectively. if_ubainit is a simpler, backwards-compatible interface used for hardware with single buffers of each type. They are called only at boot time or after a UNIBUS reset. One data path (buffered or unbuffered, depending on the ifu_flags field) is allocated for each buffer. The nmr parameter indicates the number of UNIBUS mapping registers required to map a maximal sized packet onto the UNIBUS, while hlen specifies the size of a local network header, if any, which should be mapped separately from the data (see the description of trailer protocols in chapter 14). Sufficient UNIBUS mapping registers and pages of memory are allocated to initialize the input data path for an initial read. For the output data path, mapping registers and pages of memory are also allocated and mapped onto the UNIBUS. The pages associated with the output data path are held in reserve in the event a write requires copying non-page-aligned data (see if_wubaput below). If if_ubainit is called with memory pages already allocated, they will be used instead of allocating new ones (this normally occurs after a UNIBUS reset). A 1 is returned when allocation and initialization are successful, 0 otherwise.

m = if_ubaget(ifubinfo, ifr, totlen, off0, ifp);
m = if_rubaget(ifuba, totlen, off0, ifp);

if_ubaget and if_rubaget pull input data out of an interface receive buffer and into an mbuf chain. The first interface passes pointers to the ifubinfo structure for the interface and the ifrw structure for the receive buffer; the second call may be used for single-buffered devices. totlen specifies the length of data to be obtained, not counting the local network header. If off0 is non-zero, it indicates a byte offset to a trailing local network header which should be copied into a separate mbuf and prepended to the front of the resultant mbuf chain. When the data amount to at least a half a page, the previously mapped data pages are remapped into the mbufs and swapped with fresh pages, thus avoiding any copy. The receiving interface is recorded as ifp, a pointer to an ifnet structure, for the use of the receiving network protocol. A 0 return value indicates a failure to allocate resources.

if_wubaput(ifubinfo, ifx, m);
if_wubaput(ifuba, m);

if_ubaput and if_wubaput map a chain of mbufs onto a network interface in preparation for output. The first interface is used by devices with multiple transmit buffers. The chain includes any local network header, which is copied so that it resides in the mapped and aligned I/O space. Page-aligned data that are page-aligned in the output buffer are mapped to the UNIBUS in place of the normal buffer page, and the corresponding mbuf is placed on a queue to be freed after transmission. Any other mbufs which contained non-page-sized data portions are copied to the I/O space and then freed. Pages mapped from a previous output operation (no longer needed) are unmapped.