- fl ‘— 1 v—x‘ .‘ ~_,’-' OVERDUE FINES: 25¢ per dew per in: RETURNING LIBRARY MATERIALS: Place in book return to renew charge fro: circulation record: THO EXTENSIONS TO THE ARCHITECTURE OF THE CONTROL DATA CORPORATION 6000 SERIES By John Dykstra A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Computer Science 1979 ABSTRACT THO EXTENSIONS TO THE ARCHITECTURE OF THE CONTROL DATA CORPORATION 6000 SERIES By John Dykstra The computer architecture first developed for the Control Data Corporation 6600 has not been modified in subsequent members of the family. This thesis examines two possible extensions to the 6000 architecture to determine whether it is unusually resistant to modification. One of the extensions considered provides a virtual memory feature. The other supports operation of the system in virtual machine mode. In addition, one possible implementation of the virtual memory feature is described at the register transfer level. No significant problems are detected in the virtual memory scheme, but the inclusion of peripheral processors in the 6000 architecture makes the virtual machine extension extremely awkward. ACKNOWLEDGEMENTS This thesis was prepared with the EASYDOC text formatting program, written by David Roderick and maintained by Douglas Nelson. Computer time for the use of the program was supplied through the assistance of Dr. Julian Kateley, Director of the MSU Computer Laboratory. Two people have contributed not only to the work reported in this thesis, but also to the many other ‘important things in my life. Dr. Lewis Greenberg changed the direction of my career with a job offer, and since then has become a valued professional adviser and personal friend. Pat Seger has helped in so many ways to make each suceeding year of our friendship "better and better." I offer to both of them my sincerest thanks. ii TABLE OF CONTENTS TABLE OF FIGURES CHAPTER ONE: INTRODUCTION AND OBJECTIVES CHAPTER THO: VIRTUAL MEMORY FEATURE Benifits and Penalties Basic Paging Operation Page Tables Page Size Page Table Addressing Fault Mechanism Processor Usage Inter-Address Space References . Monitor-Mode Paging 2.10 ECS 2.11 Peripheral Processor Memory Accesses 2.12 Access Controls ‘OW‘JOU‘IS‘WN-b NNNNNNNNN CHAPTER THREE: VIRTUAL MEMORY IMPLEMENTATION 3.1 System Design 3.2 CPU Paging Mechanism 3.3 PP Paging Mechanism CHAPTER FOUR: VIRTUAL MACHINE FEATURE 4.1 Benifits and Penalties 4.2 CPU Mode Extensions 4.3 CPU Memory Relocation and Protection 4.4 Virtual PP's 4.5 Virtual Operator's Console 4. Virtual Deadstart CHAPTER FIVE: CONCLUSIONS REFERENCES iii iv Figure Figure Figure Figure Figure Figure TABLE OF FIGURES The Hub Concept Unmoddfied System Modified System CPU Paging Mechanism PP Paging Mechanism Mode Combinations iv 26 52 54 56 61 67 EUAEIEB-Q!§ 1518929QIIQ!-&Nfi-Q§1E£IIVES Hhen the 6600 was introduced by Control Data Corporation in 1964, it was by far the fastest computer available. During the following decade-and-a-half, CDC has introduced a number of successor systems in three different series, all with the same basic architecture. During the same period, the architectures of other major computer families (DEC POP-11, IBM System 360/370) have undergone many changes and extensions. It is natural to ask why, through several changes in implementation and realization, the architecture first developed for the 6600 has remained essentially unchanged. Several possible hypotheses can be advanced: 1. The original architecture was so good that there is little that can be done to improve it; 2 2. The market served by this family is atypical in that the users of the machine have not desired extensions; 3. There is a need and a desire for extensions, but the architecture presents unusual obstacles to the addition of these extensions; 4. The architecture is amenable to extension, but the implementations and realizations used by CDC for the family do not easily support these extensions; 5. CDC just hasn't had the corporate resources to design a new CPU with extensions. This hypothesis is supported by the length of time CDC stayed with the cordwood CPU designs. One of the objectives of this study was to test the validity of hypothesis (3) by actually attempting (on paper) to extend the architecture. The other objective was to develop designs for two extensions which would alleviate what the author considers to be serious limitations of the present architecture. The two extensions selected for detailed consideration are provisions for "virtual memory” and "virtual machine" Operation. The ways in which these extensions improve the architecture under consideration are discussed in their 3 respective sections. They are appropriate for testing the extensibility of the architecture because they affect many areas of the architecture, including instruction set, context switching and memory access. CDC has marketed systems with this architecture under three different names (6000-series, CYBER 70, CVBER 170), and there is no commonly-accepted name for the family. In this paper we shall call it the 6000 family, after the first product line to incorporate the architecture. Chapter Two of this thesis examines the changes neccessary to add a virtual memory feature to the 6000 architecture, and Chapter Three describes one possible implementation of this extended architecture at the register-transfer level. Chapter Four considers the virtual machine feature and its impact on system architecture. Conclusions from this study are presented in Chapter Five. QEAEIEB-IHQ [IEIQAL-!§!QBI-£§AIUBE The term "virtual memory" has come to denote a memory system which appears to the processors and I/O devices connected to it to be much larger than it really is. The "virtual” memory apparent to external devices is simulated within the system by a smaller "real" memory through a technique called "paging." In the classic paging design, both virtual and real memory are divided into blocks called "pages.” The real memory of the system is used to hold various pages of the virtual memory, without regard for the real addresses of the pages. Virtual memory pages not currently resident in primary memory are stored on random-access mass storage devices. 'A part of the hardware called the "paging mechanism" maps the virtual addresses generated by the processor onto the pages of the real memory. If a virtual page referenced by the processor is not currently present in primary memory, the instruction in progress is interrupted, and a "page fault" is generated. This page fault results in a call to the Operating system, which reads the virtual page into 5 real memory from mass storage, and then restarts the interrupted program. The instruction which caused the page fault will complete normally, because its data is now resident in primary memory. 2.1 BENEFITS AND PENALTIES The possible benefits obtainable from adding a virtual memory capability to the 6000 architecture include: Permitting real memory in the system configuration to exceed 262K. Probably the most serious drawback of the current architecture is its limitation to this quite small memory size. A virtual memory mechanism is one, but not the only, way to evade this limit. Permitting increased multiprogramming without excessive main memory requirements or swapping overhead. As memory sizes get larger, the proportion of the data in memory that will be accessed before the address space is swapped out gets smaller. This useless swapping traffic ties up the swapping device(s) and contributes to memory interference. Capability of running all sizes of programs on any machine that is available, or when large amounts of 7. 6 memory are defective or unavailable due to maintenance. Permitting sharing of data and program storage between several jobs. Possible support for a virtual machine capability (see Chapter Four). Capability of protecting memory and controlling its usage at the page level. Improved flexibility in re-configuring around failing memory modules. Possible sales advantage in matching features of other manufacturer's machines. Note that increasing the amount of real memory in a system ((1) above) will have no effect on the amount of memory that can be accessed at any one time by a user job. This is the processor address space, and is limited by the width of the registers and instruction fields of the CPU. Expanding this address space is a non-trivial task, and outside the scape of this study. Possible disadvantages of a virtual memory capability include: 7 Slight increase in the memory access time due to the paging hardware. At one time, any proposed increase in access time would have been immediately rejected on the grounds that the performance of "scientific" computers is frequently limited by their memory bandwidth, and that therefore every spare nanosecond must be sqeezed out of the memory cycle time. However, it is very possible that the benefits of a virtual memory system would out-weigh the penalty of additional memory access delays. A detailed examination of the tradeoffs depends upon the details of an actual implementation, and so will not be attempted here. Possible thrashing by programs that use a lot of array space. There is an extensive literature on the theoretical and empirical behavior of paging systems, but only a few studies [for example, 1, pp. 405-415] of systems whose memory references include extensive accessing of array-structured data. It would seem that some such programs could cause excessive paging to occur. The studies mentioned above have identified ways of reducing the severity of this problem, but most of these techniques require additional hardware. This topic strays from the objectives of this thesis, and will not be considered further. S. 8 Paging is really advantagous only in a multiprogramming system, where other useful work can be done while a needed page is moved into real memory. A sizable fraction of 6000 family installations use their machines essentially uniprogrammed, and many of the advantages given for virtual memory would not apply to them. Notice, however, that a uniprogrammed virtual memory system would not show significantly poorer performance as compared to a standard machine of the same real memory size. This is because once all pages of the program were in real memory they would stay there, incurring no additional paging time. The only performance penalty would be due to initial page fetches, and the longer memory cycle time due to paging. Increased PP usage, central memory bandwidth consumption and bank conflicts, due to the paging traffic. Increased hardware and operating system complexity and cost. In the 6000 architecture there are three sources of references to central memory: the CPU, PP's, and ECS control. (Although ECS control may be considered to be a 9 part of the CPU, its memory usage characteristics are sufficiently different from those of the CPU for it to be treated separately here.) The first sections of this chapter will develop the design of the paging hardware exclusively around CPU memory accesses. Later sections will extend this design to handle memory references from PP's and ECS. 5 2.2 BASIC PAGING OPERATION The methodology used to develop the design will consist of beginning with a conventional paging scheme, and modifying it as needed to fit the unusual aspects of the 6000 architecture. The initial structure is similar to many well-known commercial systems, and therefore will be described only briefly. The paging mechanism may be either enabled or disabled, under the control of a (newly-allocated) bit in the exchange package. when the mechanism is disabled, memory references proceed exactly the way they do in the current architecture. when it is enabled, all memory addresses are interpreted according to a virtual address space implemented by the paging mechanism. 10 To the program this address space appears contiguous, but the hardware breaks it up into fixed-size blocks called pages. Each page is either resident somewhere in real central memory, resident on some random-access mass storage device, or non-existent. In the later case, a reference to it must be resolved by the operating system. A memory reference to a page that is resident in main memory is redirected to the proper real memory address by the paging hardware. To locate the page in real memory it uses a data base called the page table, which is kept in real memory. (There is a one-to-one correspondence between virtual address spaces and page tables.) The page table also indicates which pages currently exist only on secondary I/O devices. These are called "non-resident" pages- A reference to a non-resident page results in a “page fault," which calls a portion of the operating system called the "paging manager." This software moves the page into real memory and restarts the interrupted program so that the memory reference can be retried. 11 2.3 PAGE TABLES A page table describes the mapping between virtual and real pages for one virtual address space. Each entry in the table corresponds to one page, and contains a flag indicating whether the page is currently in real memory, and if it is, a pointer to the first word of the page. There may also be additional flags to record page usage and limit the ways in which the page may be accessed. The basic task of the paging hardware, given a virtual address, is to select from the current page table the entry for the page being referenced, and to generate the real address of the referenced word from the pointer in the entry and the original virtual address. 2.3.1 Organization The first design decision to be made concerns how the page table entries are to be organized, and by implication, how the paging hardware will locate the proper page table entry, given a virtual address. There are two common schemes for page table organization, which here shall be called "associative" and "indexed." Both begin by partitioning the virtual address into two fields. One, called the "page ID," specifies which page contains that address. The other field, called the "word offset," specifies which word within that page should be 12 referenced. Typically, the low-order bits of the virtual address are used as the word offset, so that a page will contain a contiguous portion of the virtual memory. In the associative paging scheme, each page table entry contains its associated page ID. In most implementations of this scheme, pages that are not in primary memory or are non-existent have no entries in the page table, and page table entries can be in any order. Given a virtual address, the paging hardware generates the real address by associatively searching the page table for an entry that matches the page 10, and then adding the real address contained in that entry to the word offset of the virtual address. The virtual memory mechanism [2, pg. 4-46] of the CDC STAR uses this scheme. In the indexed scheme, the page table entries are arranged in order of page 10's. The page ID field from the virtual address is used as an index to select the proper entry. From then on, the procedure is identical to that of the associative scheme. There must be entries in the page table for all pages, even those not in primary memory or those that do not exist altogether (although if there is a page table length register then there need not be entries for non-existent pages at the high end of the address space). Many commercial machines have used this scheme. 13 Perhaps the best known is the IBM System 370 [3, pp. 99-1301. A paging mechanism must operate extremely quickly to avoid degrading system performance, and several existing designs incorporate features toward this end. For example, the associative method as described above requires an associative memory large enough to contain the entire page table.‘ Unfortunately, memories this large are quite expensive, and not exceedingly fast. Instead, the STAR uses a linear search procedure to locate the proper page table entry [2, P9. 4-53]. It compares the page ID field from the virtual address with the contents of the corresponding field from each page table entry in turn until it finds a match or the end of the table is encountered. If a match is found, that entry is moved to the beginning of the page table, and the entries between the first and the original position of the matching entry are moved down one position. The entry is also used to generate the real address for the reference. (To improve efficiency, the STAR implementation keeps the tap 16 page table entries in processor registers rather than memory. This scheme essentially implements a cache whose maintenance is automatically performed as part of the overall paging mechanism.) 14 There is no clear-cut winner in comparisons between the indexed and the associative schemes. The indexed approach is fastest and easiest to implement, but it may require large page tables, only a few of whose entries are actually used. (This characteristic is not a serious problem if the page tables themselves can be paged.) The associative scheme is much more complicated to implement, but will result in smaller page tables, especially if many of the pages in the address space are non-existent or not currently resident in real memory. If a push-down algorithm like the STAR's is implemented, it is simple for the operating system to identify the least-recently-used page within that address space (not across all system address spaces), simply by choosing the last entry in the page table. This operation is necessary when the paging manager must select a page to muve out to secondary storage, and in the indexed scheme is usually approximated through the use of page table entry flags. There is, of course, a penalty for the convienience of the push-down scheme, in the form of additional hardware complexity. This algorithm might also very well be inappropriate for implementations that provide an associative cache for the page table. As indicated above, the indexed method requires the least amount of hardware, and its only disadvantages are larger 15 page tables, and the need for use-bits in the page table entries for identification of seldom-used pages. These disadvantages are less severe than those of the associative method, which include additional hardware complexity and unsuitability for some implementation approaches. Therefore, the indexed scheme will be selected for use in this design. 2.3.2 Page table entries The most important component of each page table entry is, of course, the real memory address of the page (assuming that the page is currently resident in real memory). Since one of the goals of this design effort is expanding the allowable system configuration to include more than 262K of real memory, the page address field should be wider than 18 bits. Two possible candidates for the size are 24 bits (two PP bytes) and 30 bits (one-half of a memory word). In the interests of maximum flexibility, and since it is unlikely that PP's will do much manipulation of page table entries, the later size will be chosen. As mentioned in the previous section, efficiency of the paging mechanism is an important design goal. The paging hardware would Operate much faster if the real address for 16 a memory reference could be formed by concatenating the word offset from the virtual address with the page address from the page table entry, rather than by doing an addition. This is possible, of course, only if the low-order bits of the page address corresponding to the word offset are zero. If they are always to be zero, there is no point in including them in the page table entry. Thus, the 30 bit page address in the page table entry will include only the high-order bits of the real address of the page. Each page table entry must also contain a flag which indicates that the associated page is not currently in real memory. This flag is called the fault flag. Note that when this bit is set, the paging hardware will immediately signal a page fault, and will not use the page address field in any way. This field is therefore free for the monitor software to use in any way that it sees fit. One possible use is to contain the secondary memory address of the page. when the paging manager moves a page—out of real memory to make room for another page, it is best to select a page that will not be referenced again for some time in the future. The best available approximation to this ideal is selection of the least-recently-used (LRU) page. Efficient implementation of this scheme requires the 17 paging hardware to maintain some information about page references. This information takes the form of use counts in each page table entry. Every time a reference is made to a word within a given page, the use count of the appropriate page table entry is incremented by the paging hardware. If the count was already as high as can be represented in the count field, it is kept at that value, rather than rolled-over to zero. To approximate a LRU algorithm when selecting a page to move out, the paging manager chooses the page whose use count is lowest. It then resets all use counts in the page table to zero, to prevent the counts from preserving information about events in the distant past. The correct size for the use count field in the page table entries depends upon the page size selected for the design and the page fault characteristics of the running system. It seems sufficient at this time to allocate 12 bits for the count, since this is far more space than will actually be needed. Particular implementations can use only a part of this field for the count, as long as the remainder is kept zero. 18 As an additional aid to the paging manager, the hardware should remember whether any word within a particular page has been written into by the CPU. This function is performed by the write flag, which is initialized as clear when the page is first moved into real memory, and set by the paging hardware if a write to that page takes place. Hhen the paging manager selects this page to be moved out it checks the state of the write flag. If the flag is clear, and a copy of the page already exists in secondary memory, there is no need to write out the page. So far we have defined 44 bits within the page table entry. ‘These fields all fit within a 60 bit memory word with room left over, so the size of each page table entry will be defined to be one memory word. 2.3.3 New CPU registers Of course, the paging hardware must have some way of finding the page table in memory. The obvious solution is the addition to the CPU of a "page table pointer register," which contains the address of the first word of the page table. The paging hardware adds the page ID to the contents of this register to find the address of the proper page table entry. 19 CPU exchange jumps usually involve a change in address space, so it is natural to include the page table pointer register in the exchange package. Virtual memory makes obsolete the function performed by the RA register, so its space within the package can be reused for the pointer. Most programs will use only a portion of the 262K word virtual address space, and it is desirable to shorten the page tables of these address spaces accordingly. A straight-forward solution is to add a "page table length register“ to the CPU. Each page ID is compared to the contents of this register before the page table entry is fetched. If the page ID is too large, a bounds error is signaled and the reference abandoned. Thus, the page table need have entries only up to the highest page in the address space that will be referenced. The page table length register is similar in purpose to, and can replace in the exchange package, the FL register of the current architecture. 2.4 PAGE SIZE Page size is a design parameter that must be determined through the resolution of performance tradeoffs. It is usually chosen to be a power of two to facilitate the partitioning of the virtual address, and the page size of 20 existing machines range from 64 words [4, pg. 150] to 65K words (the larger of two possible page sizes in the CDC STAR [2, pg. 4'463). Expected addressing patterns are relevant to one trade-off: If references are fairly sparse across the address space within a moderate interval of time, a small page size avoids the overhead (both in 110 traffic and in real memory space) wasted by bringing in large pages, only a few of whose words will be referenced before they are moved out again. Page size also affects the cost of the paging hardware. For a given virtual address space size, a small page size results in a large page ID field. It also reduces the number of address bits that need not pass through the paging mechanism, and are therefore immediately available to the rest of the memory system. Use of these ”early" bits can often improve the performance of an implementation. Quantitative analysis of these tradeoffs is beyond the scape of this study. Therefore, a page size of 4096 words will be chosen rather arbitrarily. This size does have the advantage of corresponding to the 12-bit byte size of the PP's, which might simplify some of the PP software. 21 2.5 PAGE TABLE ADDRESSING Section 2.3.1 mentioned lengthy page tables as one disadvantage of the indexed method, and proposed paging the page table as a solution. However, the moderately-large page size we have selected reduces the significance of this problem. A 262K virtual field length (the maximum allowed) will require only 64 page table entries. Therefore, it is practical to keep the entire page table in real memory. Since page tables are so small, there is no need for the page table length register described in Section 2.3.3. It is quite practical for the system to use a fixed page table size of 64 words for all address spaces, and simply set the fault bit of all entries corresponding to out-of-bounds entries. Eliminating the page table length register also frees a 18 bit field in the exchange package for other uses. Section 2.3.3 described how the page table pointer register is included in the exchange package, but left the exact interpretation of the address contained in that register unspecified. Since we have elected not to page the page table, this address can simply be the real address of the table. If the contents of this register are to replace those of the FL register in the exchange 22 package, the register width will be limited to 18 bits. Thus, all page tables would have to be located in the first 262K of real memory.. There is space available in the exchange package with which the register could be widened somewhat, but to expand it to the 30 bit size of the real address field in the page table entries would be difficult. In any event, there is no real need to expand this register beyond 18 bits, for it is unlikely that the entire 262K space would ever be filled with page tables, and with the virtual addressing scheme there is not much else that must go there. 2.6 FAULT MECHANISM Proper design of the interface between the paging hardware and the monitor at page fault time is crucial to the success of the project. It must be efficient, and it should also promote good structuring of the monitor. 2.6.1 Exchange Address In the current architecture, exchange jumps are already used to signal other asynchronous events to the CPU, and it seems natural to use this mechanism for page faults too. The first question raised by this choice concerns the exchange address; i.e., the location from which the 23 exchange package for the paging manager is loaded, and the exchange package of the faulting program stored. This location might be the same as used for other types of exchange jumps (the Monitor Address), or could be separate and reserved exclusively for page fault exchange jumps. This question is closely related to decisions about the structure of the monitor, because exchange packages are equivalent to the entry and exit points of this software. Having one exchange package implies that the monitor has a single entry point, and appears (to user jobs, at least) to be a single block of code. If there is a specialized exchange package to be used after page faults, by implication the system software is divided into two parts--the paging manager and the remainder of the monitor. The two-exchange package scheme is initially attractive. It isolates all of the support software for the virtual memory feature (the paging manager) from the remainder of the monitor. This would be especially useful if the monitor itself was paged. In addition, page faults could be processed quicker since the paging manager was entered directly rather than through some intermediate code. Despite these advantages, the single-exchange package scheme will be chosen for this design. This choice is 24 based on an analysis of the best structure for the monitor software. 2.6.2 The HUb The code that makes up the CPU portion of the monitor can be divided into several ”tasks." Each task is concerned with a particular aspect of the monitor's Operation, and typically maintains its own tables and local variables. If this concept is expanded to include user job's as tasks, all of the programs to which a CPU can be allocated are tasks. The form of this allocation is either subroutine-like, in which the task keeps the CPU until it is completed, or coroutine-like, in which the task may give up the CPU at various points in its execution, in expectation of getting the CPU back at a later time. The code that allocates the CPU's to various tasks has several functions to perform besides the elementary one of controiling the CPU's. First, this code implements the various wait states that tasks may pass through as their execution progresses, including waiting for a page fault to be resolved. Secondly, since some monitor tasks may be reentrant, the CPU allocation code must not create a new incarnation of such a task if another incarnation of the same task is waiting on a page fault. Thirdly, in a ZS multiple-CPU system the CPU allocation code is responsible for implementing mutual exclusion between the various CPU's through keeping the CPU away from a task until another CPU is done with it. Thus, the CPU allocation code controls the Operation of the rest of the software, both user and system. It is also one terminus for all transfers of control between the various modules, as shown in Figure 1. This code will therefore be called the "Hub." Since the CPU allocation module implements software interlocks, simulaneous execution of it by different CPU's must be prevented. This can be achieved through use of the monitor flag already included in the 6000 architecture. This hardware flag is cleared at deadstart time, and toggled by each subsequent exchange jump. If one CPU tries to do an exchange jump while another CPU's monitor flag is set, the exchange jump is delayed until the other's monitor flag is cleared. If the monitor flag is set during execution of the CPU allocation code, simultaneous execution by multiple CPU's is prevented. Since the tasks called by the CPU allocation code are protected through its software interlock function, they can run with the monitor flag 26 a... .eaa =2: “gasp SEE: =5: ...o=ea as: ..p u. ..=u_. .2 .am: a2. 2: .2— 3.2.... as"... .2 .o»: am.— .»253. a: 27 cleared. Thus, the amount of time that the Hub is occupied by a CPU is minimized. This results in the following structure: When a user job is interrupted, the exchange jump begins execution of the Hub, with the monitor flag set. The Hub then selects a new task to execute, and calls that task with an exchange jump, which clears the monitor flag. When the task is completed, or is interrupted for any reason, the CPU returns to the Hub, which selects a new task to run. Thus, the Hub must be called by the page fault mechanism, just as it is called after any event that causes a monitor exchange jump. The Hub should have single entry point, so page faults will use the same exchange package address as other types of exchange jumps. (Note: A variation of the Hub concept was independently developed by Lawrence J. Kingsbury, et. al. [5, pg. 2].) 2.6.3 Provisions for Restart Most processors that use paging execute their instruction streams serially; i.e., each instruction's execution is completed before the next's is begun. Memory reference instructions also frequently include some sort of data 28 maniputation along with their memory reference function. In such systems, an instruction that causes a page fault must be interrupted, and restarted from its beginning when the referenced page has been fetched from secondary memory. This is easily accomplished in such systems by setting the instruction pointer of the interrupted program to point to the instruction that caused the page fault. This method of restarting after a page fault is not appropriate for the 6000 architecture, because this architecture was designed from the beginning to support parallel execution of instructions. By the time a page fault is recognized, some implementations might have already completed several of the instructions that follow the memory reference in the instruction stream. If the P register was reset after the page fault to point to the faulting instruction, these instructions would be re-executed, with disasterous results. Fortunately, the parallel design of the processor implies that only msm9£1-£sistsnsss, not insinustiens, need be restarted. When a page fault occurs, the CPU can complete_ any non-memory reference instructions that it is currently executing, and leave the P register in its exchange package pointing to the next sequential instruction word. In addition, the CPU must somehow identify in the exchange package the memory reference that caused the page fault. 29 when this exchange package is reloaded into the CPU after the page fault has been processed, that reference will be restarted, and instruction execution resumed at the next sequential instruction. All that need be added to the exchange package to provide this restart information is a "reissue” field of 7 bits, each of which is associated with an AIX register pair. (AD/x0 are not used for memory references, and therefore need no flag bit.) A particular bit is set if a memory reference from that register pair caused the page fault. When the exchange package is reloaded into the CPU, the memory reference is reissued, using the address and data information still present in the A/x register pair. There are several complications that force modifications to this scheme. Most importantly, some parallel implementations of the architecture may allow more than one memory reference to be active at the time a page fault is detected. Also, there may be more than one instruction in a memory word. In the modified fault sequence, the CPU finishes instructions that were in progress when the fault was detected, but does not begin execution of any more. A new two bit field in the exchange package is used to indicate where (in the memory word pointed to by the P register) execution should resume. The system then waits for all outstanding memory references to be completed. If 30 any of these references cause additional page faults, their reissue bits are set in the exchange package, along with that of the reference that caused the original fault. When there are no more active memory references, the page fault exchange jump occurs. when the exchange package is reloaded into the CPU, the memory reference that caused the page fault will complete normally, as will any other reissued references that access the same page. If any of the reissued references access another page, which is still not in primary memory, another page fault will immediatly occur so that this page can be fetched. waiting for all memory references to complete before doing the page fault exchange jump might be a serious waste of time in some implementations. It can be avoided by aborting all outstanding memory references when a page fault is first discovered, and setting the reissue bits so that they all are restarted when the exchange package is ultimately reloaded into the CPU. 31 2.6.4 Supplying Fault Information Once the page fault has occured, the paging manager must be able to identify which page it should cOpy from secondary storage into central memory. This information could be presented in several different forms. The most basic would be the virtual address that caused the fault. Since the word offset is of no use to the paging manager, this could be simplified to the page ID portion of the address. The paging manager would add this to the address of the page table (obtained from the user's exchange package) to determine the address of the faulting page table entry. This, however, is exactly what has already been done by the paging hardware, and if this address is provided to the monitor by the hardware, the software can go directly to the page table entry. This method will be chosen for the design. There are several possible methods for transmitting the page table entry address to the paging manager software. It could be placed in a particular CPU register when the Hub's exchange package is loaded. This has the disadvantage of possibly constraining the Hub software in its use of registers. In addition, the Hub has no need for the detailed information about the page fault: All it 32 needs to know is that a page fault has occured, so that it can call the paging manager. If it must pass the detailed information to the paging manager, the length of time the monitor flag is set will increase. This could reduce the performance of the system. The alternative is providing a new CPU instruction, which loads the address of the faulting page table entry into a register specified by the instruction. Although this method is slightly slower than automatically loading the information into a register, it has none of that method's disadvantages. If the instruction is executed when there .is no page fault outstanding, the value loaded into the register will be zero. Thus, the Hub can determine whether a page fault has occured with a simple zero-test. 2.6.5 The CPU Error Sequence The current architecture defines an error sequence that is performed by the CPU when it detects an error in the program it is executing (undefined operand, memory bounds error, etc.). This sequence includes a write tp the first word of the user address space to store information about the error. This write presents a problem if the page containing this word is not resident in real memory, because there is no way to set the user's exchange package 33 so that the error will be repeated after the page fault is serviced. It seems that the best way to handle this problem is to require the monitor to keep in real memory the first page of any user who is eligible for execution. 2.7 PROCESSOR USAGE So far we have spoken of the paging manager as a part of the monitor running in the CPU. However, the original system concept for the 6000 family [6, pg. 4] put most of the operating system functions into the PP's, and reserved the CPU for user program execution. Since it is desirable to minimize the amount of CPU time used in paging overhead, and since page fault handling is mainly an I/O operation anyway, it is reasonable to consider assigning primary responsibility for servicing page faults to the PP's. The initial problem in implementing this scheme concerns how to notify a PP that a page fault has occured. In the current architecture there is no way to "interrupt“ a PP. If such a mechanism was added, some method would have to be developed for the hardware to determine which PP's were busy with other tasks and which could be interrupted to perform the paging. Complications such as this make this approach unattractive. 34 The alternative is the method currently used by 6000 Operating systems to handle asynchronous external events--dedicating a PP to watch for event to occur. The falling cost of hardware probably makes this lavish use of PP's acceptable. Note, however, that not much is gained by handling page faults in the PP's. If the page fault was caused by the CPU (there will be other types of page faults-see below--but the CPU is the most likely source), the CPU might as well be exchange jumped by the fault signal, since the job in execution will have to be blocked until the paging I/O is completed anyway. Once the CPU is exchange jumped into the monitor, it can probably notify the paging service PP of the fault almost as quickly as a direct hardware connection could. In addition, the CPU could direct the paging request to whichever PP it chose. In summary, there is no significant advantage to allowing PP's to directly receive paging requests. Accordingly, this design will direct the fault information to the CPU, as described in the previous section. 35 2.8 INTER-ADDRESS SPACE REFERENCES In the virtual memory system, pages belonging to user address spaces are scattered over real memory, which may exceed the 262K span that the CPU can reference using real addresses, and some user pages may not be resident in real memory at all. This presents a problem to the CPU and PP portions of the monitor, who are often called upon to reference various portions of a user address space in performing monitor functions. Although both the CPU and the PP portions of the monitor need to reference user address spaces, there are significant differences in the way that the two will perform these references. The CPU portion of the Operating system will be executing from its own address space, and most of its data will be located there too. The CPU's references to other address spaces will mostly be limited to fetching and storing operands for system action requests. On the other hand, the PP's execute out of their own memories. They also reference system tables in the monitor address space, but due to their function as IID processors for the system, their references to user memory frequently take the form of block transfers. 36 Due to these differences, this section will only consider the CPU side of the problem. A later section will extend the concepts developed herein to the PP's. Hithout some assistance from the architecture, the CPU monitor would face serious difficulties when it tried to reference a user address space. Addresses given to the monitor by the user or that the monitor maintains itself would be in virtual form, and these would first have to be translated to real addresses. As the program performed the translation it wOuld have to ensure that the page was currently in real memory, page it in if it was not, and then lock it in primary memory to ensure that its real address did not change. To perform the actual reference the CPU would need several new instructions that would accept real addresses larger than 18 bits. If the CPU subsequently needed to reference additional words (perhaps part of the same table), it would have to check to see if these other words were in the same page, and repeat the entire procedure if they were not. If the monitor had to follow this procedure for every reference to a user address space, the software would be both complex and inefficient. The alternative is to use the virtual memory mechanism for these references. This requires some modifications to the Hub/paging manager software discussed in Section 2.6.2, to process page 37 faults from monitor tasks. Once these modifications have been made, entire monitor tasks, not just the user address space references, can run in virtual memory mode, and this will be assumed in the discussion that follows. One possible scheme using the virtual memory mechanism would consist of duplicating some of the user's page table entries in the monitor's page table. Upon being given an address in the user's address space, the monitor would locate the corresponding entry in the user's page table, and copy this entry into a pre-arrangefl location in the monitor's page table. It would then be a simple matter for the monitor to compute the address of the target word in its own address space and perform the reference. Such complications as non-resident pages would be automatically handled by the usual virtual memory mechanisms. If the monitor anticipated referencing more than one word within the same area, it could simply copy enough page table entries to assure that all references would lie inside the block. It could then ignore the page boundaries. The above scheme would still require the CPU program to do a significant amount of work to perform the cross-address space access. This work could be entirely shifted to the hardware through the implementation of a set of "alternate 38 address space” instructions. One of the operands of these privileged instructions would be a pointer to the page table of the address space the monitor wishes to access. These instructions would execute similarly to the increment instructions, but the value set into the A register would be interpreted by the paging hardware using the alternate page table. Both non-resident pages and page boundaries would be handled automatically by the paging hardware. This is the scheme that shall be adapted for cross-address space references. Further details of the alternate address space instructions are simple to develop, and will not be considered here. 2.9 MONITOR-MODE PAGING The solution to the problem of inter-address space references prOposed in the previous section requires that the monitor tasks run with the virtual memory feature enabled. However, it is not immediately obvious whether this is practical, or what changes to the architecture might be necessary to make it practical. 39 There are, of course, other advantages to using virtual memory in the monitor besides facilitating inter-address space references. Perhaps the most important benefit is programming simplicity, since the monitor code can be written with minimal concern for memory management or the overlaying of program segments. An additional benefit is a reduction in the amount of real memory consumed by the monitor, since the size of the working set will almost certainly be smaller than the storage required by an equivalent non-paged monitor. There will undoubtedly be some monitor tasks that cannot run in virtual memory mode. Deadstart processing and recovery from hardware failure are two of these. However, the Multics operating system [7] has shown that an Operating system can be written that normally uses only virtual memory, so the number of these tasks should be quite small. One potential defect of a paged operating system is inefficiency due to frequent waits for paging. The harmful effects of paging can be minimized, however, by keeping frequently used pages permanently resident in real memory 0 Paging monitor tasks puts some new restraints on the interface between these tasks and the Hub. These tasks 40 can now become unrunable for the duration of the paging I/O, and it is up to the Hub to recognize this, and hold subsequent calls to a blocked task until the first task can be completed. Thus, no call to a task may assume that the task will immediately be run. If either the Hub or the paging manager caused a page fault, there would be no way to process that fault without falling into infinite recursion. Thus, both the paging manager and the Hub, and all data referenced by these modules, must kept resident in real memory at all times. 2.10 ECS The CPU instruction that starts an ECS transfer uses two addresses--one for central memory and one for extended core storage. He shall leave the ECS addressing scheme as it is, but must decide whether the CM addresses used in ECS transfers will be real or virtual addresses when the CPU is running in virtual mode. Since ECS is controlled by CPU instructions, it seems most consistent for it to use virtual addresses just as the rest of the CPU instructions do. However, before we accept this choice we must make sure that it does not result in problems when ECS is used as a paging device. 41 It is true that when the page fault occurs there is no page table entry pointing to the location to which the page is to be read. However, such an entry must be created by the paging manager anyway, to be inserted into the page table of the faulting program. If the entry is formed before the page is read in, it can be used for the ECS transfer too.’ Thus, ECS can use the virtual memory mechanism without problems. Since ECS is so closely related to the CPU, it can share the CPU's page table pointer register. ECS transfers never overlap exchange jumps, so the addresses specified by ECS instructions will always be interpreted according to the proper address space. The fault mechanism developed for the CPU can also be used for ECS transfers. However, the current architecture makes no provision for restarting an ECS transfer at the point at which is was interrupted. Instead, after the paging I/O is completed, the ECS transfer must start again from its beginning. This situation results in two problems. The first is that if a large transfer encounters several page faults, it will have to be restarted from the beginning several times, resulting in much wasted memory traffic. 42 The second problem is a result of the first. There is the possibility that the paging manager, in processing one page fault, will remove a page from real memory that was referenced earlier in the transfer. If the amount of real memory available is smaller than the size of the transfer, ‘ the transfer can never complete without causing a page fault, and therefore will lOOp forever. A solution to the second problem is to prohibit ECS transfers larger than the amount of real memory available to the job. This may or may not be practical, depending upon the design of the operating system. It also does nothing about the waste of repeated transfers. An alternative that solves both problems is to modify the architecture so that interrupted ECS transfers can be restarted from the place at which they were stopped. This can be effected by adding a new field to the exchange package to contain the number of words already transfered. This field would have to be at least 18 bits wide to accomodate the maximum word count that can be specified for these instructions. Hhen the ECS transfer is restarted, the contents of this field would be added to the beginning central memory and ECS addresses to locate the next word to be processed. These addresses would be available in A0 and x0, and the final word count for the 43 transfer could be computed from the ECS transfer instruction, pointed to by the P register. This change in the architecture has benificial effects aside from its connection with the virtual memory feature, because other types of exchange jumps can now occur during an ECS transfer. Thus, the responsiveness of the CPU part of the system is improved, and possible wasted memory traffic is eliminated. 2.11 PERIPHERAL PROCESSOR MEMORY ACCESSES So far, it has been decided that both the CPU's and ECS will use virtual addresses for normal memory accesses. It is not immediately obvious whether the same policy should be followed for the third source of memory accesses--the peripheral processors. The most compelling benefit of passing PP memory references through the paging mechanism would be software simplicity. The entire operating system could operate with virtual addresses, and all real addresses would be confined to the paging manager. The type of address translation software described in Section 2.8 would be unnecessary. 44 One of the tasks of the PP's is moving memory pages from real memory to secondary memory and vice versa, and the objection may be made that paging I/D cannot take place using virtual addresses. This objection is similar to the one made concerning ECS in the previous section, and as described there, the virtual memory mechanism is no obstacle. On the whole, then, it appears advantagous for the PP's to use the virtual memory mechanism. The next step is to see whether any problems arise in adding this feature to the architecture. Each PP in the system can potentially be working with a different address space (and thus a different page table) at the same time. Thus, each PP must have its own page table pointer register. Athough new PP instructions could be defined to load and read this register, it is more efficient to make use Of the large number of I/O channel numbers that do not correspond to actual channels. Two of these “pseudo-channels" (to accomodate the 18 bit page table address-~see Section 2.5) could be used to both load and read the page table pointer register with the existing PP I/O instructions. To simplify the PP software, the same channel numbers would be used by all PP's, but each PP could reference only its own page table pointer register. 45 As indicated in Section 2.9, there are undoubtely some situations in which the monitor, including the PP's, must run in real memory mode. Thus, there must be a hardware flag to control which mode the PP's use for central memory accesses. To provide maximum flexiblity, each PP can be given its own flag, and this flag can be set and cleared through the same pseudo-channels used to load the page table pointer registers. 2.11.1 Page Fault Mechanism It would be possible, of course, to establish the convention that PP's could not reference pages that are not in real memory, and thus could not cause page faults. However, this would force the PP's to invoke the paging manager by software to get the pages they needed to reference. Thus, the paging manager would be invoked by two very different paths to perform the same function. This is bad software design, and should be avoided if possible. If PP's are to be allowed to cause page faults, it must be decided how the hardware is to process these faults. One approach would make each PP wholly responsible for handling its own page faults. An error exit or indication could be added to each instruction that accessed central 46 memory, which would be invoked if the referenced page was not available. Unfortunately, it is quite difficult to add error exits to the existing PP architecture, and any such scheme would require major changes to existing software. Another alternative is to suspend any PP memory access that referred to a non-resident page, and convey the fault to another PP for processing. This scheme is essentially the same as that discussed in Section 2.7, and the same objections apply. The third alternative seems to be by far the best. .It is to handle page faults caused by PP memory accesses in essentially the same way as page faults from the CPU or ECS. Hhen a PP referenced a non-resident page, the reference would be suspended, and the CPU exchange-jumped to the paging manager, who would then fetch the page from secondary memory in the usual way. The only issue that remains to be decided concerns how a PP.memory reference is to be suspended if it causes a page fault. One alternative is retry the reference, perhaps with a delay between tries, until the referenced page is resident in real memory. Of course, the CPU would be exchange-jumped only after the original try. This scheme has the advantage of requiring minimal changes in the 47 architecture, but leads to two types of inefficiency. The repeated tries are inefficient, because they increase the workload of the paging hardware, and if there is a delay inserted in the retry cycle, there will be wasted time between when the page is made resident and the next retry. The alternative requires the PP to wait for a signal from the CPU before retrying the memory access. This signal would be generated by a new instruction added to the CPU's instruction set. Since more than one page fault could be outstanding at any one time, this instruction would have to specify which PP is to be restarted. This in turn implies that the identity of the faulting PP is included in the information passed to the CPU by the hardware when the page fault occurs. 2.11.2 Alternate Address Space References The PP's share with the CPU portion of the monitor a need to reference two different address spaces--that of the user, and that of the monitor. This requirement can be satisfied with an adaptation of the alternate address space instructions proposed for the CPU in Section 2.8. Two new instructions must be added to the PP instruction set to perform memory reads and writes to the alternate address space. To permit block transfers, these 48 instructions operate similarly to the existing CRM and CUM instructions rather than the CR0 and CH0 instructions. Since the PP's only user-programmable register, the accumulator, is already used by these instructions, the page table of the alternate address space must be specified by another page table pointer register, loaded through I/D channels just as the primary one is. 2.12 ACCESS CONTROLS A paging system provides the opportunity to control usage of memory by the system on a page-by-page basis, through access control flags in each page table entry. Access modes that have been supported in commercial systems include execute-only, data-only (no execute), and read-only. In systems that share pages between address spaces a write-only mode is also useful. There are plenty of unused bits in the page table entries, so it is simple to add access control flags to the entries. The choice of CPU action after a violation of the access controls, and how these controls should be extended to ECS and the PP's, seems less clear. Uhen the CPU violates the access controls, it could either perform an error sequence similar to that executed after 49 mode errors, or it could cause a page fault exchange jump. In either case, the hardware would have to provide some way for the monitor to determine the cause of the error. Since access violations would seem in most cases to be valid errors rather than expected events, the error sequence seems most appropriate. As part of the existing error sequence the CPU stores information about the error in memory, and there are sufficient unallocated bits in this word to accomodate the additional error information. Since ECS transfers are completely controlled by the CPU program, it seems appropriate to apply access controls to ECS also, and to follow the CPU error sequence if the controts are violated. Access controls might pose a bit of a problem to the paging manager if it used ECS for swapping, since the manager should be able to read and write user pages even if they are marked write-only or execute only. However, the paging manager already has the ability to modify the page table entries, and it can clear these control bits before beginning the transfer. The benefits of applying access controls consistently seems to outweight this slight inconvienience. It is significantly more difficult to apply access controls to memory accesses from PP's, because these processors do not have a hardware mechanism for aborting 50 their programs. Because of this difficulty, it appears that PP's must excluded from memory access controls. Execute-only mode usually implies read-only mode. In the 6000 architecture this results in difficulties for code that includes subroutine entry points, because the RJ subroutine call instruction stores its return pointer in the first word of the called subroutine. However, it is a simple matter to exempt this instruction from the read-only restriction. QUAEIER-IUBE§ IIBIUAL-EE!QBI-I!ELE!ENIAIIQN A machine architecture is not of much value if it cannot be implemented efficiently and economically. This chapter will describe one possible implementation of the virtual memory system described in Chapter Two. The description will be at the register-transfer level, and will ignore considerations related to technology and CPU design. 3.1 SYSTEM DESIGN Figure 2 depicts the functional blocks of a 6000-architecture machine without the virtual memory capability. All of these blocks have been mentioned earlier except Central Memory Control, labeled as CMC in the diagram. CHC accepts memory references from the CPU's and PP's, and distributes them to the memory array. The primary change which must be made in this system for the virtual memory feature is the insertion of the paging mechanism, which translates virtual addresses to real addresses, somewhere in the path between each processor and memory. 52 ECS C MD PP's Memory Figure 2: Unmodified System 53 If the paging manager (which will be abrieviated PM) is inserted between CMC and the memory, only one PM will be needed for the entire system. This arrangement has two primary difficulties. One is that all events in the processors that might change their address spaces (exchange jumps, loads of the PP page table pointer registers, etc.) must be transmitted through CMC to the PM. This increases the complexity of both the CMC and connections to it. The second difficulty is that a centralized PM will contain some components that must be duplicated for each processor in the system. Since different systems might be configured with different numbers of processors, the PM would either have to be customized for each system, or it would have to be designed to accomodate the largest possible system. Either alternative would increase the cost of the unit. Instead, the arrangement (Figure 3) that will be considered here provides a PM for each CPU, and another PM that is shared by all PP's. Although it might be possible to make all PM's in the system identical, we shall develop different designs for the CPU and the PP PM's, to correspond to the differing characteristics of the two types of processors. ECS CPU CPU PM PM Memory PP's Figure 3: Modified System 55 3.2 CPU PAGING MECHANISM A register-transfer-level diagram of the PM to be connected between CPU's and CMC is presented in Figure 4. Connections to the CPU enter at the tap of the diagram, and connections to CMC are at the bottom. Only the major data paths are depicted; There are control signals that connect all units within the diagram and the CPU and CMC. The PM is organized around a high-speed memory of 65 words by 60 bits, called the table memory. This memory is used to contain a copy of the current page table, loaded at exchange jump time. The 65 words of the table are sufficient to contain the page table, plus one additional word used for the alternate address space instructions (see below). 3.2.1 Exchange Jump Sequence when the CPU begins an exchange jump, the paging mechanism must save the contents of the previous address space's page table, contained in the table memory. Each word of the table is sent to CMC with an address formed by adding the page table address supplied by the CPU to the contents of the counter. This counter is incremented for each word. Since the real addresses accepted by the CMC are 42 CPll CPll CPll Page To?“ Adr. Address llata Page ID E i 12 Mord foset - x “\ load Memory d Table Memory (65 entries) 2 "so mm COM" ”3' L NRlTE>—v Fault Detect Save Memory 4% A ,1... Access Control Page “m“ 39‘ ll \\ 24 w Zeroes - Page Table Entry Address ‘2 CTN] Ila a Fault Address Figure l= CPll Paging Mechanism 57 bits wide (30 bit page pointer plus 12 bit word offset), the output of the adder is concatenated with 24 zero bits. After the old page table has been saved in memory, the CPU examines the paging mode bit of the new exchange package. If the bit is clear, the PM is disabled, and its part of the exchange jump is finished. If the paging mode bit is set, the PM reads the new page table into table memory. The addresses send to CMC for each read are formed in the way described in the last paragraph. 3.2.2 Normal Memory References Normal memory references begin with the CPU supplying the virtual address on the address bus, and if this is a write operation, the data word on the data bus. The PM splits the virtual address into its word offset and page ID components, and the word offset is immediately sent to the CMC. The page ID simultaneously goes to several destinations. The page ID is used to address one of the page table entries contained in the table memory. The access control bits from the entry are checked against the type of memory reference in progress. Any access violation is signaled to the CPU, and aborts the memory reference. 58 The fault bit from the page table entry is then checked. If it is set, a page fault is signaled to the CPU. An adder computes the address of the faulting page table entry from the page ID and the contents of the page table pointer register. This address is sent to the CPU over the utility data path, and will ultimately be passed to the paging manager software. If none of these errors has been detected, the page address field from the page table entry is sent to CMC. This field, concatenated with the word offset, forms the real address of the referenced memory word. Two fields of the table memory word addressed by the page ID are updated by every memory reference. The use count is increased by one, assuming the maximum use count has not already be reached. If the CPU memory reference was a write, the write flag in the entry is set. If the CPU memory reference was a read, the current value of the write flag is retained. 3.2.3 Alternate Address Space References The CPU alternate address space instructions do not use the standard page table. Instead, during execution of these instructions the CPU places on the utility data path 59 the address of the alternate page table, specified as an operand of the instruction. Address and write data information is transmitted from the CPU as for normal memory references. The word offset and write data from the CPU are temporarily blocked from the CMC. Instead, a real address formed by adding the page table address and the page 10 is sent to the CMC with a read request. This fetches the appropriate page table entry, which is returned on the data bus. The PM stores this entry into the 65th word of table memory. It then performs a normal memory reference cycle, except that the page ID is ignored, and the 65th table entry is always used. Since the alternate address space instructions will probably be used only by the monitor, it is not absolutely necessary for the page table entry used by the alternate address space instructions to be updated. If at a later time it does become necessary, this update can be performed through a simple extension of the above mechanism. 60 3.3 PP PAGING MECHANISM PP memory references will tend to be mOre localized within an address space than CPU references. In addition, there are far more PP's than CPU's in a system. For that reason, a different paging mechanism (Figure 5) is used to connect the PP's to CMC than is used by the CPU's. The data received from the PP chassis for each memory reference include the number of the PP originating the reference, the contents of the page table pointer register (normal or alternate) to be used, and the virtual address and write data. The information sent to the CMC is identical to that generated by the CPU PM; i.e., real address and write data. There is also a path to carry the addresses of faulting page table entries to the CPU. The PM is organized around memories for the page table pointers, page 10's and page table entries. These memories contain one entry for each PP in the PP chassis, and are addressed by the PP number. They are used to remember the last page table entry used by each PP. A memory reference begins by comparing the current page table pointer and page ID aginst the contents of the corresponding memory entries. If either does not match, a new page table entry must be fetched. First, the entry 61 PP pp Page T IrP e Adr. Number Address Data 181 Y-—->(te all memones) I Word '2 x Ego Ill 9 Offset 4 load Memory g __L_ Table Page Pointer Memory Memory l W Save Memory ~ '3'" r m an .2: Detect age ress ‘ - Page Table Entry Address 1 CPU Fault Address Figure 5: PP Paging Mechanism 62 currently contained in the memories is written back to real memory, using an address formed by adding the current contents of the page table pointer and the page ID registers. The new values for page table pointer and page ID are then written into their appropriate memory slots, and a read request sent to CMC using an address formed from the new contents of the two memories. The page table entry returned by CMC is placed in the PM memory. Uhether or not a new page table entry was fetched, the next step in the memory reference is to examine the page fault flag in the entry. If this bit is set, a page fault signal is sent to the CPU, accompanied by the address of the entry as computed by the adder. Otherwise, the real address for the memory reference is formed by concatenating the page address from the page table entry with the word offset from the virtual address. This address is sent with the write data to CMC to perform the memory access. SEAEIEB-EQUB VIBIUAL-!A£EINE-EEAIUBE A virtual machine system is one in which each user is presented with an interface that in almost all aspects is identical to that of a dedicated hardware system. The resemblance between each user's virtual machine and a real machine is close enough that the user can utilize a standard operation system to control his virtual machine. 4.1 BENEFITS AND PENALTIES Possible benefits of a virtual machine capability include: 1. Easing or eliminating the need for program conversion between various Operating systems; 2. Atlowing system software to be develOped and tested without tying up an entire machine; 3. Facilitating achievment of security and privacy goals by providing a well-defined and easily-controllable interface to user jobs. [8, pp. 270-278] 64 4. Matching features of other manufacturer's machines. Some of the possible disadvantages of such an extension include: 1. Additional hardware and software complexity. (Software support for the I/O portion of a virtual machine capability would be especially complex--see below.) 2. Significantly increased system overhead; 3. Possible constraints on the characteristics of I/O devices that could be attached to system (see section on virtual I/O); 4. Difficulty in supporting the system console of virtual machines. Despite the similarity in names, it is not absolutely necessary for a system with the virtual machine capability described in this chapter to have the virtual memory feature described in Chapter Two. However, as noted in that chapter, a machine without the virtual memory feature can have a maximum of 262K of memory. This limits both the memory size that can be configured into a virtual 65 machine, and the number of virtual machine memories that can be present in central memory at the same time. This in turn impacts system peformance by reducing the level of multiprogramming possible. In addition, a non-virtual memory system requires that the memory of each virtual machine be transfered in and out of central memory as a whole. This is quite inefficient when only a small portion of that memory may be accessed before the memory is swapped out again. Despite these considerations, the discussion that follows will assume that neither the real machine or the virtual machines to be simulated on it will have virtual memory. This is done to avoid needless complexdty, and so that both architectual extensions considered in this thesis will begin from the same base machine. The modifications needed for virtual memory are straight-forward. The goal of virtual machine support features is to present the full architecture to each virtual machine, while protecting other virtual machines and the real machine's monitor from all users. For the CPU, the extensions consist primarily of memory relocation/protection and mode extensions. The PP's require load/dump facilities, virtual I/O support, and central memory relocation/protection. 66 4.2 CPU MODE EXTENSIONS Since it is intended to be capable of running standard operating systems, the virtual CPU has both user and monitor modes. There is also a mode under which the software implementing the virtual machines can run, which will be called "real monitor mode." To allow the hardware to run unmodified operating systems in non-virtual mode, there is also a mode, called "real user mode,“ which operates identically to user mode in the current architecture. (See Figure 6) There are also two varieties of exchange jumps, differing in purpose and in the contents of their exchange packages. A virtual exchange jump moves the machine between virtual user and virtual monitor modes. Since this exchange jump implements part of the virtual machine architecture, its exchange package is identical to that of the current architecture. A real exchange jump moves the CPU between virtual mode (either virtual user or virtual monitor) and real monitor mode. Here the exchange package may be extended to contain information relevant to the operation of the virtual machine support features. An exchange jump between real monitor and real user modes will be considered to be a special case of the real exchange jump in which the virtual machine information in the exchange package defaults to zero. 67 Monitor llser lleal Com atallle Real Monitor ser Program Program User‘s llser's Virtual Monitor Problem Program Program Figure 6: Mode Comllioations 68 Relatively little state information need be added to the CPU to implement this system of modes. Hithout constraining actual implementations, a flip-flOp within the CPU can be considered to indicate whether the processor is in user or monitor mode (either virtual or real). A virtual exchange jump complements the value of this flag. Hhen a real exchange jump occurs, the old value of the flag is stored as part of the exchange package put into memory, and the flag is set according to the contents of the new exchange package. This flag is used to determine which form the CPU Central Exchange Jump instruction takes. There is also a "virtual" flip-flop which is set when the CPU is in either virtual user or virtual monitor modes, and cleared when the CPU is in either of the other two modes. Its value is neither stored into nor set from an exchange package, but is complemented by a real exchange jump. This flag performs the CPU interlock function assigned to the monitor mode flag in the current architecture, i.e., only one CPU in a multiple-CPU configuration can have its virtual mode flag cleared simultaneously. The flag also controls Operation of the memory relocation and protection circuitry. 69 4.3 CPU MEMORY RELOCATION AND PROTECTION To allow efficient multi-programming of the real system, it must be possible for more than one virtual machine's memory to be present in real central memory at the same time. In addition, the contents of each one of these virtual memories must be protected from the actions of other virtual machines. This implies that some sort of additional relocation and protection mechanism must be available to the real machine. The mechanism already present in the architecture (implemented using the RA and FL registers) cannot be used because it must remain available for use by the virtual machines. The most straight-forward solution to this problem consists of adding another RA and another FL register to the processor, and using these for all memory references from a given virtual machine. These registers, which will be referred to as the "super-RA" and “super-FL" registers, are used as follows: If the machine is in real mode, neither register is used. If the machine is in virtual monitor mode, the contents of the super-FL register are compared to each memory address. An out-of-bounds address results in a real exchange jump. If the address is valid, the contents of the super-RA register are added to the memory address and the result is sent to real central memory I 70 If the processor is in virtual user mode, the contents of the FL register are compared to each memory address. If the limit is exceeded, a virtual exchange jump results. Otherwdse, the contents of the RA register are added to the address, and the result compared to the contents of the super-FL. Again, if this limit is exceeded, a real exchange jump is generated. If the limit is not exceeded, the contents of the super-FL register are added to the result of the previous addition, and the final sum passed on to real central memory as the reference address. Since the contents of the super-FL and super-RA register , are associated with a given virtual machine, it is natural to load and store them as part of the real exchange package. There are two 18 bit fields available in the exchange package for this purpose. 4.4 VIRTUAL PP'S The 6000 architecture is rather unique in that a system contains multiple processors of two quite different types. This poses some difficult problems for the designer of a virtual machine feature for this architecture. A similar case is that of VM/370, whose channels can be considered to be separate processors. Even though these processors are quite limited in power, and execute out of the same 71 , memory as the CPU, their support in IBM's virtual machine system requires a significant amount of software. For example, virtual channel programs must be "translated" before they can be executed by real channels. [3, pg. 124] The goal of permitting all present 6000-family operating systems to run without modification on a virtual machine puts rather stringent restrictions on the implementation of virtual PP's. The hardware cannot make any assumptions about the contents or behavior of the PP software, for example, nor significantly slow its execution. The execution time requirement seems to rule out interpretation of PP software by the CPU or another PP. The only alternative remaining is direct execution of PP software by real PP's, in a manner similar to that of virtual CPU implementation. when a virtual machine is scheduled for execution on the real machine, some of the real PP's will be assigned to emulate the virtual machine's PP's, just as the real CPU will be assigned to emulate the virtual machine's CPU. Since one real PP is needed to emulate each virtual PP, and the virtual machine must have a full complement (10 or 20) of PP's, and since the real machine must have some PP's left over for its own use, the real machine must have a large number of PP's. 72 Many changes must be made to the PP architecture to support virtual mode execution on PP's. First, the system must have some way to load and store the contents of all virtual PP memories, their A and P registers, and the status of the channels, all without cOOperation from the PP software. This mechanism will be used to swap a virtual machine in and out of the real machine, and simulate the deadstart sequence. The system must also have some way to intercept transactions on the virtual I/D channels and redirect and interpret them. Also needed is relocation/protection hardware, similar to that used on the CPU, to limit central memory references to this virtual machine's central memory. Each real PP must contain a virtual flag whose meaning is similar to the CPU's virtual mode flag. Uhen set, this flag indicates to the PP hardware that it should use the hardware features added to support virtual operation. 4.4.1 Virtual PP Load/Dump Instructions I The dump and load operations are used to prepare the PP portion of a virtual machine for execution on the real machine, and to terminate that execution. The dump operation stops execution of all PP's devoted to the virtual machine (if they have not already been 73 stopped--see below), and saves the state of those PP's in central memory. The load operation takes the information saved by a dump and reloads it into real PP's. This dump/load cycle must be invisible to the virtual PP software. The destination of the information from a PP dump could be an I/O device or ECS. However, the most general design, and also the one most consistent with the remainder of the architecture, is for the transfer to reference a buffer in central memory. From there the data can be moved to an external I/O device if necessary. An important question to be decided concerns how the dump - operation is initiated. The CPU analog of the PP dump is the exchange jump, since it too saves part of the state of the virtual machine, and it therefore appears attractive to link PP dumps with CPU exchange jumps. If a dump operation was initated when a real exchange jump moved the CPU from virtual to real mode, the entire processor state of the virtual machine would be saved in a single operation. Unfortunately, many needless dump/load cycles would occur if this approach was adopted. Many exchange jumps into the real monitor would not result in a change in the 74 virtual machine, and would end with the same machine being loaded and execution resumed. These wasted dump/load cycles would harm system performance in two ways. First, both the dump and the load involve the transfer of a sizable amount of information between the PP's and central memory, and thus would reduce the memory bandwidth available to the rest of the system. Secondly, while the dump operation could presumably be overlapped in time with execution of the CPU monitor program, it is doubtful that the load operation could be. Thus, the duration of the load, which due to the number of memory references required is not small, would have to be added to the amount of time required for a task switch. This would have a severe impact on the efficiency of the system software. The alternative of letting the CPU monitor invoke the dump operation through execution of a new instruction seems to avoid these difficulties. The major disadvantage of this approach is that due to the delayed start of the dump Operation (as compared with starting it with the monitor exchange jump), the monitor may have to wait for the dump to be completed before beginning the loading of a new virtual machine. However, the duration of this delay could probably be minimized through proper design of the monitor program. 75 If a virtual machine's PP's are not automatically dumped when a real CPU exchange jump occurs, that exchange jump leaves the virtual system with its PP's still executing, but its CPU effectively stalled. Most operating systems that might be run on the virtual machine could tolerate this state of affairs, but some, especially those that assume that the CPU immediately responds to a PP-initiated exchange jump, might fail. Since we cannot make any assumptions about the software to be run on the virtual machine, and since it is more consistent with the architecture to do so, we will halt virtual PP execution when a real exchange jump takes the CPU away from the virtual machine, and resume PP execution when the CPU is returned. This can easily be done by allowing virtual PP execution only when the CPU's virtual mode flag is set. Also to be decided is how the hardware locates the area in memory to/from which the dump/load transfer is to be done. Since the Operation is to be initiated by a CPU instruction, it seems natural for that instruction to specify the memory block as one of its Operands. The format of the load/dump block is not a major concern. The space occupied by one PP's record should probably be a integer number of central memory words to simplify the job of any program that must analyze these records. The data to be included for each PP are the PP memory (4096 12-bit 76 bytes), the contents of the P register (12 bits) and the A register (18 bits). For each virtual machine, the state of the I/O channels (2 bits for active and full flags and 12 bits for contents per channel) and the contents of the Status and Control Register must be saved. 4.4.2 Central Memory Relocation/Protection The address space seen by the virtual PP should be the same seen by the CPU in virtual mode. Therefore it is natural to pass PP memory references through the protection/relocation mechanism already used by the CPU. Since virtual PP execution occurs only when the real CPU is running in virtual mode, this presents no implementation difficulties. 4.4.3 Virtual I/D A virtual machine requires I/O devices for input, output and mass storage. Since it is usually not practical to dedicate an entire real I/O device to a single virtual machine, these I/O devices must be virtual too. For example, the real operating system might split a real disk unit into many smaller virtual disks. Virtual unit-record I/O might be spooled onto the real system's mass storage. 77 (Note that there are occasions on which a virtual machine will operate dedicated real I/O devices. For example, a virtual machine might operate a graphic display unit or line concentrator just as it would if it was‘executing in real mode. Real devices dedicated to a virtual machine must not share the I/O channel with devices used by other virtual machines or the real monitor to avoid protection problems.) There are rather difficult problems to be overcome in providing virtual I/O, because the PP software that will be running on the virtual machine is rather intimately involved in 110 operations. Transactions over the PP's I/O channels are at a relatively low level, and the PP software enforces strict timing constraints. To ”translate" these transactions into a form that can be used by real I/O devices, the real system must intercept the transactions sent over the virtual channels, piece together the low-level transactions into higher-level operations, and finally convert these operations into a form that can be executed on the real machine. While the real system is performing these operations, it must simulate the timing that would be seen by the virtual PP if a real device was actually doing the I/O. There are several possible approaches to satisfying these requirements. The one most similar to methods used by 78 other architectures is to intercept I/O transactions sent over virtual channels, probably through a real PP external to the virtual machine's PP set. (The "interception" would resemble conventional inter-PP communication through channels.) The I/O transactions are then redirected by the "translator PP" to the real I/O device that is simulating the virtual one. 4.4.3.1 Translator PP's In considering the translator PP method, the first matter to be decided concerns channel numbers. As noted above, the easiest way for the translator PP to intercept the I/D transactions of the virtual PP is for it to simply read them off the channel. Unfortunately, the two PP's cannot use the same number for this channel. The virtual PP's will always do 110 on channels numbered in the low end of the possible range, in order to be compatable with existing software. For the same reason, real PP's will have to use these channel numbers for real IIO to external devices. Thus, the low-numbered virtual channels must be mapped by the hardware to high numbers for the real PP's. 79 Only one addition must be made to the standard channel design for this new usage. That is some way for the translator PP to determine whether a full state on the channel is due to a function or a data output. This could be accomplished by using one of the unallocated PP instruction opcodes for a conditional jump flag that tests the function flag. The dump operation discussed in Section 4.4.1 provides a way for the state of the PP portion of the virtual machine to be saved when the machine is removed from execution. Similarly, some way must be found to save the I/O state information maintained by the translator PP's. The most obvious method is for the real operating system to send a signal to the translator PP's (via one of the usual software mechanisms) when the virtual machine is to be suspended. The PP's would then write the information to part of the central memory block reserved for the status of this machine. Unfortunately, this method is very inefficient. The principal location of the inefficiencies is the translator PP. The software in this processor would have to constantly check for an idle-down request from the Operating system. This checking would slow down the time-critical code involved in simulating the virtual devices. 80 In order to avoid having to check for idle-down between every transaction with the virtual PP, the following idle-down scheme must be used. Hhen the real operating system wants to remove a virtual machine from execution, it sets a flag notfying the translator PP's working on that machine of the idle-down. As each translator reaches a convenient stopping point, it checks this flag, writes out the status of the virtual I/O devices it is handling to the machine state block, and acknowledges the idle-down request. The CPU and PP's of the virtual machine are not stopped until the last translator PP has acknowledged the idle-down. In this way, the virtual PP's can complete all I/O operations in progress. It would be attractive to do the saving of I/O status with hardware. Unfortunately, it is quite difficult for hardware to examine the memory of translator PP's and separate virtual machine state information, which must be saved and ultimately restored, from real machine state information, which need not be saved, and must not be restored. Checking the idle-down flag will slow down the time-dependent code in the translator PP. In addition, this code has the usual interpretation characteristic of requiring many instruction executions to simulate a single virtual operation. In order to make such a system 81 practical, it seems obvious that the processors performing as translator PP's must run at a much faster instruction rate than the processors performing as virtual PP's. This should be relatively easy to achieve, since the instruction timings on which existing software is based are the result of a long-outdated technology. The exact execution time ratio between the virtual PP's and the translator PP's is an implementation matter, but must probably be at least 10:1. 4.4.3.2 Intelligent I/D Controllers The falling cost of processing power has resulted in an increasing level of intelligence being available to peripheral controllers. This trend offers an attractive alternative to the translator PP scheme presented above. In this proposal, virtual PP's communicate directly with actual IIO controllers, who take over the task of translating the IIO requests into real IIO operations. There is no need for support from real PP's during virtual machine execution. The channels of the real machine are divided into two groups. The "fixed" channels are numbered and used identically to the current 6000 architecture, and are connected to all I/O devices. The "movable" channels are 82 used only for virtual IIO, and need only be connected to the virtual PP's. Normally each movable channel is connected only to one piece of equipment--one that has a virtual IIO capability. The movable channels are so named because, under control of the real operating system, they can be mapped to any of the virtual channels. This mapping is established when the configuration of the virtual machine is first specified. More than one movable channel may be mapped to the same virtual channel. Hhen the virtual machine is readied for execution, the real operating system loads the mapping between virtual and movable channels into the hardware, and sends the saved IIO state of the virtual machine to the controllers over the fixed channels. As the virtual machine executes, the controllers translate its virtual I/O requests into real IIO operations: when the virtual machine is to be removed from execution, an idle-down request is sent to the controllers over the fixed channels. Hhen the controllers finish the IIO operations in progress, they send the updated I/O state of the virtual machine back over the fixed channels, and prepare themselves for the next virtual machine. This scheme greatly improves the efficiency of virtual I/O operations over the translator PP method. Unfortunately, there is a significant hardware cost involved. First, of 83 course, there is the requirement for new device controllers, each containing a rather high level of processing power. In addition, there are many more channels to connect the mainframe with these controllers, and the mainframe must contain some sort of “switchboard" to connect the movable channels to the virtual PP channels. A possible implementation of the real channels that would minimize the impact of these requirements is based on a fiber-optic communication path. This path would serve as all of the fixed and movable channels through a time-division-multiplexed scheme, and would be daisy-chained to all of the IIO devices. The switcthard mentioned above would be implemented by sending to each controller the time-slot to which it should listen. Each fixed channel and each virtual channel would be assigned a fixed slot in which to broadcast and listen. 4.5 VIRTUAL DPERATOR'S CONSOLE CDC machines are rather unique in the industry in having a fully-interactive graphics terminal as their standard console device, rather than a simple teletypewriter. This presents problems when running virtual systems, each Of 84 which has its own virtual console, which must be accessible to the user of the system. If the user's console has graphics capabilities, then it ’ would probably be possible for the virtual system to translate the output of the PP driving the virtUal console (DSD) into a form that could instead drive the user's terminal. Since this would be rather constraining, and would require a high-bandwidth transmission path between the system and each user, there would probably also have to be software in the virtual system to allow any type of terminal to examine simulated display screens and generate type-in's. These requirements would add a significant amount of overhead to the system. (Note that this software would have to be responsive to the user even while the virtual machine is swapped out.) Another solution to this problem would require abandonment of the customary Operation of a virtual system, in which each user has his own virtual machine. Instead, there would be only a few virtual machines running on the real system, each supporting a number of users. Since there are so few virtual machines, each could be assigned its own real console, or a single console could be switched among several machines. The primary advantages of a virtual system (no need for operating system conversion, and easy checkout of system software) would be retained. 85 4.6 VIRTUAL DEADSTART Most of the implementation of deadstart on a virtual machine is straight-forward. The primary question that must be answered concerns how the virtual PP's are to be put into their deadstart state. One alternative would be to create a status block of a standard form which results in the desired behavior, and then loading this block into the PP's with the standard load operation. The alternative is a hardware operation that sets the PP's into the standard deadstart state, except that the channels read by the PP's are virtual ones, not real ones. The tradeoffs are fairly obvious. The first scheme takes advantage of existing features of the architecture, but requires extra central memory and CPU time for creation of the state block. The second scheme requires a small amount of additional hardware, but avoids the disadvantages of the first. It would almost certainly be faster, too. These trade-offs are so closely matched that the choice would have to be made on the basis of implementation considerations. Perhaps the architectural features implied by the second scheme could be made Optional, with the real operating system adapting to the features of the machine on which it is executing. QUERIEB--£I¥§ QQNEEU§IQN§ The virtual memory and virtual machine designs presented in the previous chapters were undertaken to evaluate the flexibility of the 6000 architecture. The results were mixed. Design of the virtual memory feature was relatively straight-forward. The only unusual aspects of the design were the result of the parallel Operation of the CPU's, and the presence of two different types of processors (CPU's and PP's) referencing the same virtual memory. However, neither of these unusual aspects made the resulting design unworkable. The trial implementation presented in Chapter Three confirmed this. The design of the virtual machine feature was less successful. The CPU portion of the 6000 architecture adapted easily to the requirements of such a system, but the PP's presented major problems. Most of these problems arose from the simplicity of the I/O channels connected to the PP's, and the resulting hardware dependence of the PP software. Two approaches to the problem (translator PP's, and the use of intelligent IIO controllers) were examined, 87 but both appeared to be unacceptably complex and expensive. Due to these problems, a trial implementation of the virtual machine feature was not attempted. The use of programmable peripheral processors to control IIO devices was an effective design choice in the early 1960's when the 6000 architecture was first developed. However, the presence of these highly hardware-dependent processors now poses the most serious obstacle to expansion of the architecture to take advantage of state-of-the-art technology. E [1" If" BENCEfi Fischer, Patrick C., and Robert L. Probert. Storage Reorganization Techniques for Matrix Computation in a Paging Environment. Communications of the ACM, 22, 7 (JULY, 1979), ppm ‘05-‘15- Control Data Corporation. antcgl-21§a-§l&8:199 Cementsc-§xstsm-flacdnace-8stccence-flaoual- Publication Number 60256000, Revision 5, 1973. Parmelee, R. P., T. I. Peterson, C. C. Tillman, and D. J. Hatfield. Virtual Storage and Virtual Machine Concepts. IBM Systems Journal, 11, 2, pp. 99-130, 1972. Digital Equipment Corporation. EDE:11L&§-E£9£:§§Q£ flandbssk- 1971- Roderick, David K. CPUMTR. Michigan State University Computer Laboratory 6000 SCOPE Memo 120, 1977. Members of the Instructional Staff of the Control Data Institute- 9999-§sciss-lotcoductieo-aod Es:inbscal-£cessssecs-Icaioins-naoual, Second Edition, Publication Number 60250400, 1968. Organick, Elliott Irving. 1hg_uultig§,§y§tem. MIT Press, 1972. Donovan, J. J. and S. E. Madnick. Virtual Machine Advantages in Security, Integrity, and Decision Support Systems. IBM Systems Journal, 15, 3, pp. 270-278, 1976. 89 HIGRN STRTE UIN V. IIIIIII IIIII3 III III IIIIIIIIIIIIIIIIIIII