Physical and Virtual Memory

 


 

Contents

  1. Overview
  2. Storage Access Patterns
  3. Useful Websites
  4. Virtual Memory Performance Implications
  5. Worst Case Performance Scenario
  6. Best Case Performance Scenario
  7. Basic Virtual Memory Concepts
  8. Virtual Memory in Simple Terms
  9. Backing Store — the Central Tenet of Virtual Memory
  10. Red Hat Linux-Specific Information
  11. The Storage Spectrum
  12. CPU Registers
  13. Cache Memory
  14. Main Memory — RAM
  15. Hard Drives
  16. Off-Line Backup Storage
  17. Virtual Memory: the Details
  18. Page Faults
  19. The Working Set
  20. Swapping

 


Overview

All present-day, general-purpose computers are of the type known as stored program computers. As the name implies, stored program computers can load instructions (the building blocks of programs) into some type of internal storage and can subsequently execute those instructions.

Stored program computers also use the same storage for data. This is in contrast to computers that use their hardware configuration to control their operation (such as older plugboard-based computers).

The place where programs were stored on the first stored program computers went by a variety of names and used a variety of different technologies, from spots on a cathode ray tube, to pressure pulses in columns of mercury. Fortunately, present-day computers use technologies with greater storage capacity and much smaller size than ever before.

 


Storage Access Patterns

One thing to keep in mind throughout this chapter is that computers tend to access storage in certain ways. In fact, most storage access tends to exhibit one (or both) of the following attributes:

Let us look at these points in a bit more detail.

Sequential access means that, if address N is accessed by the CPU, it is highly likely that address N+1 will be accessed next. This makes sense, as most programs consist of large sections of instructions that execute one after the other.

Localized access means that, if address X is accessed, it is likely that other addresses surrounding X will also be accessed in the future.

These attributes are crucial, because it allows smaller, faster storage to effectively buffer larger, slower storage. This is the basis for implementing virtual memory. But before we can discuss virtual memory, we must look at the various storage technologies currently in use.

 


Useful Websites

 


Virtual Memory Performance Implications

While virtual memory makes it possible for computers to more easily handle larger and more complex applications, as with any powerful tool, it comes at a price. The price in this case is one of performance — a virtual memory operating system has a lot more to do than an operating system that is not capable of virtual memory. This means that performance is never as good with virtual memory as it is when the same application is 100% memory-resident.

However, this is no reason to throw up one's hands and give up. The benefits of virtual memory are too great to do that. And, with a bit of effort, good performance is possible. The thing that must be done is to look at the system resources that are impacted by heavy use of the virtual memory subsystem.

 


Worst Case Performance Scenario

For a moment, take what you have read in this chapter and consider what system resources are used by extremely heavy page fault and swapping activity:

The interrelated nature of these loads makes it easy to see how resource shortages can lead to severe performance problems.

All it takes is a system with too little RAM, heavy page fault activity, and a system running near its limit in terms of CPU or disk I/O. At this point, the system is thrashing, with performance rapidly decreasing.

 


Best Case Performance Scenario

At best, system performance presents a minimal additional load to a well-configured system:

From this, the overall point to keep in mind is that the performance impact of virtual memory is minimal when it is used as little as possible. This means that the primary determinant of good virtual memory subsystem performance is having enough RAM.

Next in line (but much lower in relative importance) are sufficient disk I/O and CPU capacity. However, these resources only help the system performance degrade more gracefully from heavy faulting and swapping; they do little to help the virtual memory subsystem performance (although they obviously can play a major role in overall system performance).

A reasonably active system always experiences some page faults, if for no other reason than because a newly-launched application experiences page faults as it is first brought into memory.

 


Basic Virtual Memory Concepts

While the technology behind the construction of the various modern-day storage technologies is truly impressive, the average system administrator does not need to be aware of the details. In fact, there is really only one fact that system administrators should always keep in mind:

There is never enough RAM.

While this truism might at first seem humorous, many operating system designers have spent a great deal of time trying to reduce the impact of this very real shortage. They have done so by implementing virtual memory — a way of combining RAM with slower storage to give the system the appearance of having more RAM than is actually installed.

 


Virtual Memory in Simple Terms

Let us start with a hypothetical application. The machine code making up this application is 10000 bytes in size. It also requires another 5000 bytes for data storage and I/O buffers. This means that, in order to run this application, there must be 15000 bytes of RAM available; even one byte less, and the application will not be able to run.

This 15000 byte requirement is known as the application's address space. It is the number of unique addresses needed to hold both the application and its data. In the first computers, the amount of available RAM had to be greater than the address space of the largest application to be run; otherwise, the application would fail with an "out of memory" error.

A later approach known as overlaying attempted to alleviate the problem by allowing programmers to dictate which parts of their application needed to be memory-resident at any given time. In this way, code that was only required once for initialization purposes could be overlayed with code that would be used later. While overlays did ease memory shortages, it was a very complex and error-prone process. Overlays also failed to address the issue of system-wide memory shortages at runtime. In other words, an overlayed program may require less memory to run than a program that is not overlayed, but if the system still does not have sufficient memory for the overlayed program, the end result is the same — an out of memory error.

Virtual memory turns the concept of an application's address space on its head. Rather than concentrating on how much memory an application needs to run, a virtual memory operating system continually attempts to find the answer to the question, "how little memory does an application need to run?"

While it at first appears that our hypothetical application requires the full 15000 bytes to run, think back to our discussion in

 

Storage Access Patterns

— memory access tends to be sequential and localized. Because of this, the amount of memory required to execute the application at any given time is less than 15000 bytes — usually a lot less. Consider the types of memory accesses that would be required to execute a single machine instruction:

The actual number of bytes necessary for each memory access varies according to the CPU's architecture, the actual instruction, and the data type. However, even if one instruction required 100 bytes of memory for each type of memory access, the 300 bytes required is still a lot less than the application's 15000-byte address space. If a way could be found to keep track of an application's memory requirements as the application runs, it would be possible to keep that application running while using less than its address space.

But that leaves one question:

If only part of the application is in memory at any given time, where is the rest of it?

 


Backing Store — the Central Tenet of Virtual Memory

The short answer to this question is that the rest of the application remains on disk. This might at first seem to be a very large performance problem in the making — after all, disk drives are so much slower than RAM.

While this is true, it is possible to take advantage of the sequential and localized access behavior of applications and eliminate most of the performance implications of using disk drives as backing store for RAM. This is done by structuring the virtual memory subsystem so that it attempts to ensure that those parts of the application that are currently needed — or likely to be needed in the near future — are kept in RAM only for as long as they are needed.

This is similar to the relationship between cache and RAM: making a little fast storage and a lot of slow storage look like a lot of fast storage.

 


Red Hat Linux-Specific Information

Due to the inherent complexity of being a demand-paged virtual memory operating system, monitoring memory-related resources under Red Hat Linux can be confusing. Therefore, it is best to start with the more straightforward tools, and work from there.

Using free, it is possible to get a concise (if somewhat simplistic) overview of memory and swap utilization. Here is an example:

 

             total       used       free     shared    buffers     cached
Mem:       1288720     361448     927272          0      27844     187632
-/+ buffers/cache:     145972    1142748
Swap:       522104          0     522104
      

We can see that this system has 1.2GB of RAM, of which only about 350MB is actually in use. As expected for a system with this much free RAM, none of the 500MB swap partition is in use.

Contrast that example with this one:

 

             total       used       free     shared    buffers     cached
Mem:        255088     246604       8484          0       6492     111320
-/+ buffers/cache:     128792     126296
Swap:       530136     111308     418828
      

This system has about 256MB of RAM, the majority of which is in use, leaving only about 8MB free. Over 100MB of the 512MB swap partition is in use. Although this system is certainly more limited in terms of memory than the first system, to see if this memory limitation is causing performance problems we must dig a bit deeper.

Although more cryptic than free, vmstat has the benefit of displaying more than memory utilization statistics. Here is the output from vmstat 1 10:

 

   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 2  0  0 111304   9728   7036 107204   0   0     6    10  120    24  10   2  89
 2  0  0 111304   9728   7036 107204   0   0     0     0  526  1653  96   4   0
 1  0  0 111304   9616   7036 107204   0   0     0     0  552  2219  94   5   1
 1  0  0 111304   9616   7036 107204   0   0     0     0  624   699  98   2   0
 2  0  0 111304   9616   7052 107204   0   0     0    48  603  1466  95   5   0
 3  0  0 111304   9620   7052 107204   0   0     0     0  768   932  90   4   6
 3  0  0 111304   9440   7076 107360  92   0   244     0  820  1230  85   9   6
 2  0  0 111304   9276   7076 107368   0   0     0     0  832  1060  87   6   7
 3  0  0 111304   9624   7092 107372   0   0    16     0  813  1655  93   5   2
 2  0  2 111304   9624   7108 107372   0   0     0   972 1189  1165  68   9  23
      

During this 10-second sample, the amount of free memory (the free field) varies somewhat, and there is a bit of swap-related I/O (the si and so fields), but overall this system is running well. It is doubtful, however, how much additional workload it could handle, given the current memory utilization.

When researching memory-related issues, it is often necessary to see how the Red Hat Linux virtual memory subsystem is making use of system memory. By using sar, it is possible to look at this aspect of system performance in much more detail.

By reviewing the sar -r report, we can look more closely at memory and swap utilization.:

 

Linux 2.4.18-18.8.0smp (raptor.example.com)     12/16/2002

12:00:01 AM kbmemfree kbmemused  %memused kbmemshrd kbbuffers  kbcached
12:10:00 AM    240468   1048252     81.34         0    133724    485772
12:20:00 AM    240508   1048212     81.34         0    134172    485600
…
08:40:00 PM    934132    354588     27.51         0     26080    185364
Average:       324346    964374     74.83         0     96072    467559
      

The kbmemfree and kbmemused fields show the typical free and used memory statistics, with the percentage of memory used displayed in the %memused field. The kbbuffers and kbcached fields show how many kilobytes of memory are allocated to buffers and the system-wide data cache.

The kbmemshrd field is always zero for systems (such as Red Hat Linux) that use the 2.4 Linux kernel.

The lines for this report have been truncated to fit on the page. Here is the remainder of each line, with the timestamp added to the left to make reading easier:

 

12:00:01 AM   kbswpfree kbswpused  %swpused
12:10:00 AM      522104         0      0.00
12:20:00 AM      522104         0      0.00
…
08:40:00 PM      522104         0      0.00
Average:         522104         0      0.00
      

For swap utilization, the kbswpfree and kbswpused fields show the amount of free and used swap space, in kilobytes, with the %swpused field showing the swap space used as a percentage.

To learn more about the swapping activity is taking place, use the sar -W report. Here is an example:

 

Linux 2.4.18-18.8.0 (pigdog.example.com)     12/17/2002

12:00:01 AM  pswpin/s pswpout/s
12:10:01 AM      0.15      2.56
12:20:00 AM      0.00      0.00
…
03:30:01 PM      0.42      2.56
Average:         0.11      0.37
      

Here we can see that, on average, there were three times fewer pages being brought in from swap ( pswpin/s) as there were going out to swap ( pswpout/s).

To better understand how pages are being used, refer to the sar -B report:

 

Linux 2.4.18-18.8.0smp (raptor.example.com)     12/16/2002

12:00:01 AM  pgpgin/s pgpgout/s  activepg  inadtypg  inaclnpg  inatarpg
12:10:00 AM      0.03      8.61    195393     20654     30352     49279
12:20:00 AM      0.01      7.51    195385     20655     30336     49275
…
08:40:00 PM      0.00      7.79     71236      1371      6760     15873
Average:       201.54    201.54    169367     18999     35146     44702
      

Here we can view how many blocks per second are paged in from disk ( pgpgin/s) and paged out to disk ( pgpgout/s). These statistics serve as a barometer of overall virtual memory activity.

However, more knowledge can be gained by looking at the other fields in this report. The Red Hat Linux kernel marks all pages as either active or inactive. As the names imply, active pages are currently in use in some manner (as process or buffer pages, for example), while inactive pages are not. This example report shows that the list of active pages (the activepg field) averages approximately 660MB [1].

The remainder of the fields in this report concentrate on the inactive list — pages that, for one reason or another, have not recently been used. The inadtypg field shows how many inactive pages are dirty (modified) and may need to be written to disk. The inaclnpg field, on the other hand, shows how many inactive pages are clean (unmodified) and do not need to be written to disk.

The inatarpg field represents the desired size of the inactive list. This value is calculated by the Linux kernel and is sized such that the inactive list be large enough to act as a reasonable pool for page replacement purposes.

For additional insight into page status (specifically, how often pages change status), use the sar -R report. Here is a sample report:

 

Linux 2.4.18-18.8.0smp (raptor.example.com)     12/16/2002

12:00:01 AM   frmpg/s   shmpg/s   bufpg/s   campg/s
12:10:00 AM     -0.10      0.00      0.12     -0.07
12:20:00 AM      0.02      0.00      0.19     -0.07
…
08:50:01 PM     -3.19      0.00      0.46      0.81
Average:         0.01      0.00     -0.00     -0.00
      

The statistics in this particular sar report are unique, in that they may be positive, negative, or zero. When positive, the value indicates the rate at which pages of this type are increasing. When negative, the value indicates the rate at which pages of this type are decreasing. A value of zero indicates that pages of this type are neither increasing or decreasing.

In this example, the last sample shows slightly over three pages per second being allocated from the list of free pages (the frmpg/s field) and nearly 1 page per second added to the page cache (the campg/s field). The list of pages used as buffers (the bugpg/s field) gained approximately one page every two seconds, while the shared memory page list (the shmpg/s field) neither gained nor lost any pages.

The page size under Red Hat Linux on the x86 architecture is 4096 bytes

 


The Storage Spectrum

Present-day computers actually use a variety of storage technologies. Each technology is geared toward a specific function, with speeds and capacities to match. These technologies are:

In terms of capabilities and cost, these technologies form a spectrum. For example, CPU registers are:

However, at the other end of the spectrum, off-line backup storage is:

By using different technologies with different capabilities, it is possible to fine-tune system design for maximum performance at the lowest possible cost. The following sections explore each technology in the spectrum.

 


CPU Registers

Every present-day CPU design includes registers for a variety of purposes, from storing the address of the currently-executed instruction to more general-purpose data storage and manipulation. CPU registers run at the same speed as the rest of the CPU; otherwise, they would be a serious bottleneck to overall system performance. The reason for this is that nearly all operations performed by the CPU involve the registers in one way or another.

The number of CPU registers (and their uses) are strictly dependent on the architectural design of the CPU itself. There is no way to change the number of CPU registers, short of migrating to a CPU with a different architecture. For these reasons, the number of CPU registers can be considered a constant, as they are unchangeable without great pain.

 


Cache Memory

The purpose of cache memory is to act as a buffer between the very limited, very high-speed CPU registers and the relatively slower and much larger main system memory — usually referred to as RAM [1]. Cache memory has an operating speed similar to the CPU itself, so that when the CPU accesses data in cache, the CPU is not kept waiting for the data.

Cache memory is configured such that, whenever data is to be read from RAM, the system hardware first checks to see if the desired data is in cache. If the data is in cache, it is quickly retrieved, and used by the CPU. However, if the data is not in cache, the data is read from RAM and, while being transferred to the CPU, is also placed in cache (in case it will be needed again). From the perspective of the CPU, all this is done transparently, so that the only difference between accessing data in cache and accessing data in RAM is the amount of time it takes for the data to be returned.

In terms of storage capacity, cache is much smaller than RAM. Therefore, not every byte in RAM can have its own location in cache. As such, it is necessary to split cache up into sections that can be used to cache different areas of RAM, and to have a mechanism that allows each area of cache to cache different areas of RAM at different times. However, given the sequential and localized nature of storage access, a small amount of cache can effectively speed access to a large amount of RAM.

When writing data from the CPU, things get a bit more complicated. There are two different approaches that can be used. In both cases, the data is first written to cache. However, since the purpose of cache is to function as a very fast copy of the contents of selected portions of RAM, any time a piece of data changes its value, that new value must be written to both cache memory and RAM. Otherwise, the data in cache and the data in RAM will no longer match.

The two approaches differ in how this is done. One approach, known as write-through cache, immediately writes the modified data to RAM. Write-back cache, however, delays the writing of modified data back to RAM. The reason for doing this is to reduce the number of times a frequently-modified piece of data will be written back to RAM.

Write-through cache is a bit simpler to implement; for this reason it is most common. Write-back cache is a bit trickier to implement, in addition to storing the actual data, it is necessary to maintain some sort of flag that flags the cached data as clean (the data in cache is the same as the data in RAM), or dirty (the data in cache has been modified, meaning that the data in RAM is no longer current). Because of this, it is also necessary to implement a way of periodically flushing dirty cache entries back to RAM.

 

Cache Levels

Cache subsystems in present-day computer designs may be multi-level; that is, there might be more than one set of cache between the CPU and main memory. The cache levels are often numbered, with lower numbers being closer to the CPU. Many systems have two cache levels:

Some systems (normally high-performance servers) also have L3 cache, which is usually part of the system motherboard. As might be expected, L3 cache would be larger (and most likely slower) than L2 cache.

In either case, the goal of all cache subsystems — whether single- or multi-level — is to reduce the average access time to the RAM.

 


Main Memory — RAM

RAM makes up the bulk of electronic storage on present-day computers. It is used as storage for both data and programs while those data and programs are in use. The speed of RAM in most systems today lies between the speeds of cache memory and that of hard drives, and is much closer to the former than the latter.

The basic operation of RAM is actually quite straightforward. At the lowest level, there are the RAM chips — integrated circuits that do the actual "remembering." These chips have four types of connections to the outside world:

Here are the steps required to store data in RAM:

  1. The data to be stored is presented to the data connections.

  2. The address at which the data is to be stored is presented to the address connections.

  3. The read/write connection to set to write mode.

Retrieving data is just as simple:

  1. The address of the desired data is presented to the address connections.

  2. The read/write connection is set to read mode.

  3. The desired data is read from the data connections.

While these steps are simple, they take place at very high speeds, with the time spent at each step measured in nanoseconds.

Nearly all RAM chips created today are sold as modules. Each module consists of a number of individual RAM chips attached to a small circuit board. The mechanical and electrical layout of the module adheres to various industry standards, making it possible to purchase memory from a variety of vendors.

The main benefit to a system that uses industry-standard RAM modules is that it tends to keep the cost of RAM low, due to the ability to purchase the modules from more than just the system manufacturer.

Although most computers use industry-standard RAM modules, there are exceptions. Most notable are laptops (and even here some standardization is starting to take hold) and high-end servers. However, even in these instances, it is likely that you will be able to find third-party RAM modules, assuming the system is relatively popular and is not a completely new design.

 


Hard Drives

All the technologies that have been discussed so far are volatile in nature. In other words, data contained in volatile storage is lost when the power is turned off.

Hard drives, on the other hand, are non-volatile — the data they contain remains there, even after the power is removed. Because of this, hard drives occupy a special place in the storage spectrum. Their non-volatile nature makes them ideal for storing programs and data for longer-term use. Another unique aspect to hard drives is that, unlike RAM and cache memory, it is not possible to execute programs directly when they are stored on hard drives; instead, they must first be read into RAM.

Also different from cache and RAM is the speed of data storage and retrieval; hard drives are at least an order of magnitude slower than the all-electronic technologies used for cache and RAM. The difference in speed is due mainly to their electromechanical nature. Here are the four distinct phases that take place during each data transfer to/from a hard drive. The times shown reflect how long it would take a typical high-performance drive, on average, to complete each phase:

Of these, only the last phase is not dependent on any mechanical operation.

 


Off-Line Backup Storage

Off-line backup storage takes a step beyond hard drive storage in terms of capacity (higher) and speed (slower). Here, capacities are effectively limited only by your ability to procure and store the removable media.

The actual technologies used in these devices can vary widely. Here are the more popular types:

Of course, having removable media means that access times become even longer, particularly when the desired data is on media that is not currently in the storage device. This situation is alleviated somewhat by the use of robotic devices to automatically load and unload media, but the media storage capacities of such devices are finite. Even in the best of cases, access times are measured in seconds, which is a far cry even from the slow multi-millisecond access times for a high-performance hard drive.

Now that we have briefly studied the various storage technologies in use today, let us explore basic virtual memory concepts.

While "RAM" is an acronym for "Random Access Memory," and a term that could easily apply to any storage technology that allowed the non-sequential access of stored data, when system administrators talk about RAM they invariably mean main system memory.

 


Virtual Memory: the Details

First, we should introduce a new concept: virtual address space. As the term implies, the virtual address space is the program's address space — how much memory the program would require if it needed all the memory at once. But there is an important distinction; the word "virtual" means that this is the total number of uniquely-addressable memory locations required by the application, and not the amount of physical memory that must be dedicated to the application at any given time.

In the case of our example application, its virtual address space is 15000 bytes.

In order to implement virtual memory, it is necessary for the computer system to have special memory management hardware. This hardware is often known as an MMU (Memory Management Unit). Without an MMU, when the CPU accesses RAM, the actual RAM locations never change — memory address 123 is always the same physical location within RAM.

However, with an MMU, memory addresses go through a translation step prior to each memory access. This means that memory address 123 might be directed to physical address 82043 at one time, and physical address 20468 another time. As it turns out, the overhead of individually tracking the virtual to physical translations for billions of bytes of memory would be too great. Instead, the MMU divides RAM into pages — contiguous sections of memory of a set size that are handled by the MMU as single entities.

Keeping track of these pages and their address translations might sound like an unnecessary and confusing additional step, but it is, in fact, crucial to implementing virtual memory. For the reason why, consider the following point.

Taking our hypothetical application with the 15000 byte virtual address space, assume that the application's first instruction accesses data stored at address 12374. However, also assume that our computer only has 12288 bytes of physical RAM. What happens when the CPU attempts to access address 12374?

What happens is known as a page fault. Next, let us see what happens during a page fault.

 


Page Faults

First, the CPU presents the desired address (12374) to the MMU. However, the MMU has no translation for this address. So, it interrupts the CPU and causes software, known as a page fault handler, to be executed. The page fault handler then determines what must be done to resolve this page fault. It can:

While the first three actions are relatively straightforward, the last one is not. For that, we need to cover some additional topics.

 


The Working Set

The group of physical memory pages currently dedicated to a specific process is known as the working set for that process. The number of pages in the working set can grow and shrink, depending on the overall availability of pages on a system-wide basis.

The working set grows as a process page faults. The working set shrinks as fewer and fewer free pages exist. In order to keep from running out of memory completely, pages must be removed from process's working sets and turned into free pages, available for later use. The operating system shrinks processes' working sets by:

In order to determine appropriate working sets for all processes, the operating system must track usage information for all pages. In this way, the operating system can determine which pages are being actively used (and must remain memory resident) and which pages are not (and therefore, can be removed from memory). In most cases, some sort of least-recently used algorithm determines which pages are eligible for removal from process working sets.

 


Swapping

While swapping (writing modified pages out to the system swap space) is a normal part of a system's operation, it is possible to experience too much swapping. The reason to be wary of excessive swapping is that the following situation can occur, over and over again:

If this sequence of events is widespread, it is known as thrashing and is indicative of insufficient RAM for the present workload. Thrashing is extremely detrimental to system performance, as the CPU and I/O loads that can be generated in such a situation can quickly outweigh the load imposed by a system's real work. In extreme cases, the system may actually do no useful work, spending all its resources moving pages to and from memory.