sysperf, section 4.

4. Performance Improvements

This section outlines the changes made to the system since the 4.2BSD distribution. The changes reported here were made in response to the problems described in Section 3. The improvements fall into two major classes; changes to the kernel that are described in this section, and changes to the system libraries and utilities that are described in the following section.

4.1. Performance Improvements in the Kernel

Our goal has been to optimize system performance for our general timesharing environment. Since most sites running 4.2BSD have been forced to take advantage of declining memory costs rather than replace their existing machines with ones that are more powerful, we have chosen to optimize running time at the expense of memory. This tradeoff may need to be reconsidered for personal workstations that have smaller memories and higher latency disks. Decreases in the running time of the system may be unnoticeable because of higher paging rates incurred by a larger kernel. Where possible, we have allowed the size of caches to be controlled so that systems with limited memory may reduce them as appropriate.

4.1.1. Name Cacheing

Our initial profiling studies showed that more than one quarter of the time in the system was spent in the pathname translation routine, namei, translating path names to inodes[note 15] . An inspection of namei shows that it consists of two nested loops. The outer loop is traversed once per pathname component. The inner loop performs a linear search through a directory looking for a particular pathname component.

Our first idea was to reduce the number of iterations around the inner loop of namei by observing that many programs step through a directory performing an operation on each entry in turn. To improve performance for processes doing directory scans, the system keeps track of the directory offset of the last component of the most recently translated path name for each process. If the next name the process requests is in the same directory, the search is started from the offset that the previous name was found (instead of from the beginning of the directory). Changing directories invalidates the cache, as does modifying the directory. For programs that step sequentially through a directory with [equation] files, search time decreases from to .

The c [equation] the cache is ab c(aband 16 bytes per prst Iuser vect

As a quick benchmark t [equation] y the maximum effectiveness the cache we ran ``ls -l'' iles. Befused 22.3 sec system time. After adding the cache the pr user time, but the system time dr

This change pr [equation] iled system Inamei. The results sh Inamei drstill acc the system call time, 18% the kernel, all the machine cycles. This am rThe results are sh

D'l |192u 0'
parttime%  kernel
D'l |192u 0'
self11.0 ms/call9.2%
child10.6 ms/call8.9%
D'l |192u 0'
t        D'l 0 |0u-1v'
D'l 0 |0u-1v'
Table 9. Call times f
Inamei with per-pr

The small perfwas caused by a lAlth [equation] fective when hit, it was the names being translated. An additi alth time spent in namei itself decreased substantially, msince each direct rand r

Frequent requests f [equation] names are best handled with a cache recent name translatiThis has the effect eliminating the inner l namei. Fnamei first l recent translatifIf it exists, the direct

The system already maintained a cache recently accessed insmaintained a simple name-incheck each c [equation] a path name during name translatiWe cwith its mbut eventually decided tkept names with pTagging inmany inthe in time, but are never lup by name. Other infrequently by many different names (e.g. ``..''). By keeping a separate table names, the cache can truly reflect the mAn added benefit is that the table can be sized independently the in [equation] memcan reduce the size the cache (with ying the in

Anh [equation] erences tN erences'' by incrementing the reference c erence. Since the system reuses erence ca hard reference insures that the inH the name cache h erences, it is limited t racti the size the insince s t free f iles. It als the kernel t y s a device ile. These reas erences with [equation] fecting the behavi the inThus, we ch t references'' prby a capability - a 32-bit number guaranteed tWhen an entry is made in the name cache, the capability its inWhen an inWhen a name cache hit the capability the name cache entry is cwith the capability the in erences. If the capabilities dSince the name cache h [equation] t references, it may be sized independent the size the inA final benefit using capabilities is that all cached names fsearching thrinstead all y

The c [equation] the name cache is ab c(aband 48 bytes per cache entry. Depending the system, ab igured, using 10-50 kil physical memThe name cache is resident in mem

After adding the system wide name cache we reran ``ls -l'' The user time remained the same, hThis was n [equation] Inamei nbut was never able t it.

An [equation] iled system was created and measurements were csh Inamei, with namei acc the system call time, 13% the time in the kernel, all the machine cycles. System time dr rThe results are sh

D'l |192u 0'
parttime%  kernel
D'l |192u 0'
self4.2 ms/call6.2%
child4.4 ms/call6.6%
D'l |192u 0'
t        D'l 0 |0u-1v'
D'l 0 |0u-1v'
Table 10.  Call times f
Inamei with b

On [equation] ind that during the twelve h rname translatiStatistics bthe large perfcaused by the high hit ratiThe name cache has a hit rate 70%-80%; the direct fset cache gets a hit rate 5%-15%. The c the twWith the additi the twthe percentage system time devdr rWhile the system wide cache reduces b [equation] time in the r Inamei calls as well as namei itself (since fewer directit is interesting t system time spent in namei itself increases even thactual time per call decreases. This is because less thence a smaller abs

4.1.2. Intelligent Aut

Mit can either generate an interrupt each time a character is received, T [equation] la silAscii terminals nan input rate less than 30 characters per secAt this input rate they are m ficiently handled with interrupt per character msince this generates fewer interrupts than draining the input sil the terminal multiplexWhen input is being generated by an unctithe input rate is usually mIt is m [equation] ficient tsince this generates fewer interrupts than handling each character as a separate interrupt. Since a given dialup pit is imp ficient input m

We thereft [equation] the sil per-character interrupts. At linterrupt basis, av checking each interface During peri sustained input, the handler enables the siland starts a timer tThis timer runs less frequently than the cland is used input. The transiti rdamped t transiti fic (such as in netwInput characters serve tlush the silBy switching between these tw [equation] the checking the silwhen necessary.

In additithe cla stware interrupt after each hardware interrupt tThe stware-interrupt level p [equation] the clneeded when timers expire an executi ile. Thus, the number interrupts attributable tis substantially reduced.

4.1.3. Pr

As systems have gr [equation] the prhas gr ar past 200 entries. With large tables, linear searches must be eliminated fr requently used facility. The kernel pr active and zA third list threads unused prFree slfr r the free list. The number prSince the 4.2BSD release, the kernel maintained linked lists the descendents each prThis linkage is nparents seeking the exit status children n the prIn additi [equation] inding all descendents an exiting pr the prThis has been changed t

When fthe system must assign it a unique pr [equation] ier. The system previa new pr ier that was nN the system c unused identifiers that can be directly assigned. Only when the set identifiers is exhausted is anscan required.

4.1.4. Scheduling

PreviPr [equation] priPrbeen sleeping fOn systems running many prthe scheduler represented nearly 20% the system time. Tthe scheduler has been changed trunnable prT still get their apprtheir priSince the set runnable pr racti the t prthe c inv

4.1.5. Cl

The hardware clat high priAs m [equation] the clthe system schedules a l tware interrupt ttime-critical events such as cpu scheduling and timeOften there are n tware interrupt handler finds nThe high pri there are levents tif there is n tware interrupt is nOften, the high primachine had been running at lRather than p tware interrupt that wsthe hardware cland calls the stware clBetween these tw [equation] the 100 stware interrupts per sec

4.1.6. File System

The file system uses a large blT [equation] iles t ficiently, the large blbe br ragments, typically multiples 1024 bytes. T full-sized blintragments, the file system uses a best fit strategy. Pr iles using write 1024 bytes can f ile system tsuccessively larger and larger fragments until it finally gr ull sized blThe file system still uses a best fit strategy the first time a fragment is written. H [equation] irst time that the file system is ffragment it places it at the beginning a full sized blC urther cby using up the rest the blIf the file ceases t the blavailable f ragments.

When creating a new file name, the entire directtFBecause there was n [equation] a direct illed will increase the c file creati ter the illing is cThus, fafter it is cleared up. Tthat it finds at the end a directscan t

4.1.7. Netw

The default am [equation] buffer space all pipes) has been increased tStream s fer sizes in the bl ield the stat structure. This inf fering. Unix din the stat structure twith The TCP maximum segment size is calculated accand interface in use; nf

On multiply-ht [equation] ace that will be used in transmitting data packets fcSeveral bugs in the calculati rTCP n ails, ICMP srate TCP streams by temp icially small send windrather than resending all queued data. A send pthat decreases the number small packets f fic [Nagle84], pr netwThe packet rc

The buffer management strategy implemented by s P has been changed t the increased size the s fers and a better tuned delayed acknR ied t the last rMultiple messages send with the same destinatiPerf leither the sender hAlsthe thrWe have their cycles transmitting netw

4.1.8. Exec

When exec-ing a new prprfragain [equation] the newly created prThese tware nThis an argument list by a fact ten; the average time t Iexec call decreased by 25%.

4.1.9. C

The kernel used t [equation] tware event when it wanted ta prOften the pr exiting the kernel, delaying the event trap. At sbe selected tfinally causing the event tThe event wselecting the same prThe fix t tware reschedule events when saving a prThis change dcan synchr

4.1.10. Setjmp/L

The kernel r [equation] Isetjmp, that saves the current system c registers than necessary under mBy trimming its the 13%.

4.1.11. C C

The current c [equation] d icant Gthe C language is nbecause its rampant use unbThus, many classical analysis and selecti register variables must be dby hand using ``exteri when such e.

Anis inline expansi [equation] small requently used rIn past Berkeley systems this has been d Ised trun r the r ten a single VAX instructiWhile this the subrcall and return, it did n several arguments tThe sed script has been replaced by a minline, that merges the pushes and pF the C c

if (scanc(map[i], 1, 47, i - 63))

is cin the left hand c [equation]

Table 11. The sed inline expander changes this cshThe newer [equation]

the stack

D'l |408u 0'
Alternative C Language CD'l |408u 0'
ccsedinline
D'l |408u 0'
subl3$64,_i,-(sp)subl3$64,_i,-(sp)subl3$64,_i,r5
pushl$47pushl$47mpushl$1pushl$1pushl$1
mull2$16,_i,r3mull2$16,_i,r3mull2$16,_i,r3
pushl-56(fp)[r3]pushl-56(fp)[r3]m
p)[r3],r2
calls$4,_scancmtstlr0mjeqlL7mmscancr2,(r3),(r4),r5
tstlr0
jeqlL7
            D'l 0 |0u-1v'
      D'l 0 |0u-1v'
                 D'l 0 |0u-1v'
D'l 0 |0u-1v'
Table 11. Alternative inline c

Anexisting data structures in the c [equation] the current system. F fer hashing was implemented when the system typically had thirty tifty buffers. M fers. C the hash chains cten t fers each! The running time the l fer management primitives was dramatically impr the hash table.

4.2. Impr

Intuitively, changes tpayf since they affect all prH [equation] icant imprBy c the libraries and utilities had never been tuned. F their running time dChanging the utility trunning time by a fact five! Thus, while m m the speedups are because impr the system. S the m subsecti

4.2.1. Hashed Databases

UNIX pr [equation] database management r Idbm, that can be used t iles with an external hashed index file. The dbm was designed tdatabase at a time. These rmultiple database files, enabling them t the passw ile lused t ile significantly imprtime many impthe C-shell (in d Ils -l, etc.

4.2.2. Buffered I/O

The new filesystem with its larger blperf [equation] by perf ers rather than using appr fers. The standard I/O library aut fer size f ile. SI/O fering, hSeveral impand were buffering I/O using the fer size, 1Kbytes; the pr fer I/O acc ile system blThese include the edit ile cthe text f

The standard err [equation] fered tand t r buffers are n lushed. The in sending single-byte packets thrthe netw fering scheme errWithin a single call t Ifprintf, all fered tempBef lushed and the stream is again marked unbuffered. As bef fering mechanisms can be used instead the default behavi

It is p [equation] defeat the standard I/O library's ch I/O buffer size by using the setbuf call t fer. Because p ault buffer size prby setbuf is 1024 bytes; this can lead, One such pr Icat; there are undas the system has changed much since they were

4.2.3. Mail System

The pr [equation] icant w irst pr ied was a bug in the sysl P pr Isendmail lc acilities. Sysl P then rec a l ile. Unf Isysl P was perf Isync ter each message it received, whether it was l ile fectiveness the buffer cache and explained, textent, why sending mail theavy l message recipient causing alm [equation] sync

The hashed data base files were installed in all mail pr [equation] magnitude speedup I/bin/mail that n ies the c P pra user was changed tspeedup

Next, the file l [equation] acilities pr Ifl P(2), were used in place the lThe mail system previ Ilink and unlink in implementing file lBecause these y the c directthey require synchradvantage the name cache maintained by the system. Unlink requires that the entry be fit can be remlink requires that the directdBy c [equation] acility in 4.2BSD is efficient because it is all dThus, the mail system was m ied t ile lThis yielded an delivering mail. Extensive priling and tuning sendmail and c

4.2.4. Netw

With the intr [equation] the netw acilities in 4.2BSD, a myriad services became available, each which required its Many these daem ever used, yet they lay asleep in the prsystem resRather than having many servers started at binetd was substituted. This pr igurati ile that specifies the services the system is willing tand listens fWhen a client requests service the apprand passed a service cthat require the identity their client may use the getpeername system call; likewise gets P may be used tind a server's l iles. This scheme is attractive f

it eliminates as many as a dall ile and text tables t
servers need nqueueing, simplifying the pr
installing and replacing servers bec

With an increased numbers netwwe f [equation] the rinSeveral changes were made in the rR ault gateway. This reduces the am netw fic and the time required tIn additi iled and functi time were The maj aster hashing scheme, and inline expansi the ubiquit uncti

Under certain circumstances, when attempts by the remtalth [equation] Iselect call had indicated that data cThis resulted in cuser restarted This prand the

4.2.5. The C Run-time Library

Several pe [equation] in frequently used r In particular the running time the string rcut in half by rewriting them using the VAX string instructiThe memmem twCertain library r ile input in have been cOther library r Ifread and fwrite have been rewritten f ficiency.

4.2.6. Csh

The C-shell was cwriting a set rWhile this pr [equation] unctiit was gr ficient, generating up tThe C-shell has been m ied tfacilities directly, cutting the number system calls per pr . Additi priling t frequently used facilities.