Multiplexed Virtual Nodes in Emulab
Table of Contents
Overview
In order to allow experiments with a very large number of nodes, we provide a multiplexed virtual node implementation. If an experiment application's CPU, memory and network requirements are modest, multiplexed virtual nodes (hereafter known as just "virtual nodes"), allow an experiment to use 10-20 times as many nodes as there are available physical machines in Emulab. These virtual nodes can currently only run on FreeBSD 4.10, or on Linux (Fedora8) using OpenVZ is container-based virtualization.
Virtual nodes fall between simulated nodes (ala, ns) and real, dedicated machines in terms of accuracy of modeling the real world. A virtual node is just a lightweight virtual machine running on top of a regular operating system. In particular, our virtual nodes are based on either FreeBSD's jail mechanism, or OpenVZ container-based virtualization on Linux. Both allow groups of processes to be isolated from each other while running on the same physical machine. Emulab virtual nodes provide isolation of the filesystem, process, network, and account namespaces. That is to say, each virtual node has its own private filesystem, process hierarchy, network interfaces and IP addresses, and set of users and groups. This level of virtualization allows unmodified applications to run as though they were on a real machine. Virtual network interfaces are used to form an arbitrary number of virtual network links. These links may be individually shaped and may be multiplexed over physical links or used to connect virtual nodes within a single physical node.
With some limitations, virtual nodes can act in any role that a normal Emulab node can: end node, router, or traffic generator. You can run startup commands, ssh into them, run as root, use tcpdump or traceroute, modify routing tables, and even reboot them. You can construct arbitrary topologies of links and LANs, even mixing virtual and real nodes.
The number of virtual nodes that can be multiplexed on a single physical node depends on a variety of factors including the resource requirements of the application, the type of the underlying node, the bandwidths of the links you are emulating and the desired fidelity of the emulation. See the Advanced Issues section for more info.
Use
Multiplexed virtual nodes are specified in an NS description by indicating that you want the pcvm node type:
set nodeA [$ns node] tb-set-hardware $nodeA pcvm
or, if you want all virtual nodes to be mapped to the same machine type, say a pc850:
set nodeA [$ns node] tb-set-hardware $nodeA pcvm850
that is, instead of "pcvm" use "pcvmN" where N is the node type (600, 850, 1500, 2000). That's it! With few exceptions, every thing you use in an NS file for an Emulab experiment running on physical nodes, will work with virtual nodes.
As a simple example, we could take the [basic.ns basic NS script] used in the tutorial add the following lines:
tb-set-hardware $nodeA pcvm tb-set-hardware $nodeB pcvm tb-set-hardware $nodeC pcvm tb-set-hardware $nodeD pcvm
and change the setting of the OS to one of:
# Set the OS on a couple. tb-set-node-os $nodeA FBSD-JAIL tb-set-node-os $nodeC OPENVZ-STD
and the resulting NS file can be submitted to
produce the very same topology. Once the experiment has been instantiated,
the experiment web page should include a listing of the reserved nodes that
looks something like:
By looking at the NodeIDs (pcvm36-NN), you can see that all four virtual nodes were assigned to the same physical node (pc36). (At the moment, control over virtual node to physical node mapping is limited. The Advanced Issues section discusses ways in which you can affect the mapping.) Clicking on the ssh icon will log you in to the virtual node. Virtual nodes do not have consoles, so there is no corresponding icon. Note that there is also an entry for the hosting physical node. You can login to it as well, either with ssh or via the console. See the Advanced Issues section for how you can use the physical host. Finally, note that there is no delay node associated with the shaped link. This is because virtual links always use end node shaping.
Logging into a virtual node you see only the processes associated with your node:
PID TT STAT TIME COMMAND
1883 ?? SsJ 0:00.03 /usr/sbin/syslogd -ss
1890 ?? SsJ 0:00.01 /usr/sbin/cron
1892 ?? SsJ 0:00.28 /usr/sbin/sshd
1903 ?? IJ 0:00.01 /usr/bin/perl -w /usr/local/etc/emulab/watchdog start
5386 ?? SJ 0:00.04 sshd: mike@ttyp1 (sshd)
5387 p1 SsJ 0:00.06 -tcsh (tcsh)
5401 p1 R+J 0:00.00 ps ax
Standard processes include syslog, cron, and sshd along with the Emulab watchdog process. Note that the process IDs are in fact not virtualized, they are in the physical machine's name space. However, a virtual node still cannot kill a process that is part of another container.
Doing a df you see:
Filesystem 1K-blocks Used Avail Capacity Mounted on
/dev/vn5c 507999 1484 496356 0% /
/var/emulab/jails/local/testbed 6903614 73544 6277782 1% /local/testbed
/users/mike 14081094 7657502 5297105 59% /users/mike
...
You will notice a private root filesystem, as well as the usual assortment of remote filesystems that all experiments in Emulab receive. Thus you have considerable flexibility in sharing ranging from shared by all nodes (/users/yourname and /proj/projname), shared by all virtual nodes on a physical node (/local/projname) to private to a virtual node (/local).
Doing ifconfig reveals:
fxp4: flags=8843 mtu 1500 rtabid 0
inet 172.17.36.5 netmask 0xffffffff broadcast 172.17.36.5
ether 00:d0:b7:14:0f:e2
media: Ethernet autoselect (100baseTX )
status: active
lo0: flags=8049 mtu 16384 rtabid 0
inet 127.0.0.1 netmask 0xff000000
veth3: flags=8843 mtu 1484 rtabid 5
inet 10.1.2.3 netmask 0xffffff00 broadcast 10.1.2.255
ether 00:00:0a:01:02:03
vethtag: 513 parent interface:
Here fxp4 is the control net interface. Due to limited routable IP address space, Emulab uses the 172.16/12 unroutable address range to assign control net addresses to virtual nodes. These addresses are routed within Emulab, but are not exposed externally. This means that you can access this node (including using the DNS name nodeC.vtest.testbed.emulab.net) from ops.emulab.net or from other nodes in your experiment, but not from outside Emulab. If you need to access a virtual node from outside Emulab, you will have to proxy the access via ops or a physical node (that is what the ssh icon in the web page does). veth3 is a virtual ethernet device (not part of standard FreeBSD, we wrote it at Utah) and is the experimental interface for this node. There will be one veth device for every experimental interface. Note the reduced MTU (1484) on the veth interface. This is because the veth device uses encapsulation to identify packets which are multiplexed on physical links. Even though this particular virtual link does not cross a physical wire, the MTU is reduced anyway so that all virtual links have the same MTU.
Advanced Issues
Taking advantage of a virtual node host.
A physical node hosting one or more virtual nodes is not itself part of the topology, it exists only to host virtual nodes. However, the physical node is still setup with user accounts and shared filesystems just as a regular node is. Thus you can login to, and use the physical node in a variety of ways:
- Since the /usr file system for each node is mounted via a read-only loopback mount from the physical host, any files installed on a physical host's /usr will automatically be part of every virtual node as well. This allows for a potentially more efficient file distribution mechanism: install packages in the host's /usr and they are visible in the virtual nodes as well. Unfortunately, there is currently no "handle" for a virtual node host in the NS file, so you cannot install tarballs or RPMs on it as part of the experiment creation process. You must install them by hand after the experiment has been created, and reboot the virtual nodes. Thereafter, the packages will be available.
- The private root filesystem for each virtual node is also accessible to the host node. Thus the host can monitor log files and even change files on the fly.
- Other forms of monitoring can be done as well since all processes, filesystems, network interfaces and routing tables are visible in the host. For instance, you can run tcpdump on a virtual interface outside the node rather than inside it. You can also tcpdump on a physical interface on which many virtual nodes' traffic is multiplexed. The installed version of tcpdump understands the veth encapsulation.
We should emphasize however, that virtual nodes are not "performance isolated" from each other or from the host; i.e., a big CPU hogging monitor application in the host might affect the performance and behavior of the hosted virtual nodes.
Controlling virtual node layout.
Normally, the Emulab resource mapper, assign will map virtual nodes onto physical nodes in such a way as to achieve the best overall use of physical resources without violating any of the constraints of the virtual nodes or links. In a nutshell, it packs as many virtual nodes onto a physical node as it can without exceeding a node's internal or external network bandwidth capabilities and without exceeding a node-type specific static packing factor. Internal network bandwidth is an empirically derived value for how much network data can be moved through internally connected virtual ethernet interfaces. External network bandwidth is determined by the number of physical interfaces available on the node. The static packing factor is intended as a coarse metric of CPU and memory load that a physical node can support, currently it is based strictly on the amount of physical memory in each node type. The current values for these constraints are:
- Internal network bandwidth: 400Mb/sec for all node types
- External network bandwidth: 400Mb/sec (4 x 100Mb NICs) for all node types
- Packing factor: 10 for pc600s and pc1500s, 20 for pc850s and pc2000s
The mapper generally produces an "unsurprising" mapping of virtual nodes to physical nodes (e.g., mapping small LANs all on the same physical host) and where it doesn't, it is usually because doing so would violate one of the constraints. One exception involves LANs.
One might think that an entire 100Mb LAN, regardless of the number of members, could be located on a single physical host since the internal bandwidth of a host is 400Mb/sec. Alas, this is not the case. A LAN is modeled in Emulab as a set of point-to-point links to a "LAN node." The LAN node will then see 100Mb/sec from every LAN member. For the purposes of bandwidth allocation, a LAN node must be mapped to a physical host just as any other node. The difference is that a LAN node may be mapped to a switch, which has "unlimited" internal bandwidth, as well as to a node. Now consider the case of a 100Mb/sec LAN with 5 members. If the LAN node is colocated with the other nodes on the same physical host, it is a violation as 500Mb/sec of bandwidth is required for the LAN node. If instead the LAN node is mapped to a switch, it is still a violation because now we need 500Mb/sec from the physical node to the switch, but there is only 400Mb/sec available there as well. Thus you can only have 4 members of a 100Mb/sec LAN on any single physical host. You can however have 4 members on each of many physical hosts to form a large LAN, in this case the LAN node will be located on the switch. Note that this discussion applies equally to 8 members on a 50Mb/sec LAN, 20 members of a 20Mb LAN, or any LAN where the aggregate bandwidth exceeds 400Mb/sec. And of course, you must take into consideration the bandwidth of all other links and LANs on a node. Now you know why we have a complex program to do this!
Anyway, if you are still not deterred and feel you can do a better job of virtual to physical node mapping yourself, there are a few ways to do this. Note carefully though that none of these will allow you to violate the bandwidth and packing constraints listed above.
The NS-extension tb-set-colocate-factor command allows you to globally decrease (not increase!) the maximum number of virtual nodes per physical node. This command is useful if you know the application load you are running in the vnodes is going to require more resources per instance (e.g., a java DHT), and that the Emulab picked values of 10-20 per physical node are just too high. Note that currently, this is not really a "factor," it is an absolute value. Setting it to 5 will reduce the capacity of all node types to 5, whether they were 10 or 20 by default.
If the packing factor is ok, but assign just won't colocate virtual nodes the way you want, you can resort to trying to do the mapping by hand using tb-fix-node. This technique is not for the faint of heart (or weak of stomach) as it involves mapping virtual nodes to specific physical nodes, which you must determine in advance are available. For example, the following code snippet will allocate 8 nodes in a LAN and force them all onto the same physical host (pc41):
set phost pc41 # physical node to use
set phosttype 850 # type of physical node, e.g. pc850
# Force virtual nodes in a LAN to one physical host
set lanstr ""
for {set j 1} {$j <= 8} {incr j} {
set n($j) [$ns node]
append lanstr "$n($j) "
tb-set-hardware $n($j) pcvm${phosttype}
tb-fix-node $n($j) $phost
}
set lan [$ns make-lan "$lanstr" 10Mb 0ms]
If the host is not available, this will fail. Note again, that "fixing" nodes will still not allow you to violate any of the fundamental mapping constraints.
There is one final technique that will allow you to circumvent assign and the bandwidth constraints above. The NS-extension tb-set-noshaping can be used to turn off link shaping for a specific link or LAN, e.g.:
tb-set-noshaping $lan 1
added to the NS snippet above would allow you to specify "1Mb" for the LAN bandwidth and map 20 virtual nodes to the same physical host, but then not be bound by the bandwidth constraint later. In this way assign would map your topology, but no enforcement would be done at runtime. Specifically, this tells Emulab not to set up ipfw rules and dummynet pipes on the specified interfaces. One semi-legitimate use of this command, is in the case where you know that your applications will not exceed a certain bandwidth, and you don't want to incur the ipfw/dummynet overhead associated with explicitly enforcing the limits. Note, that as implied by the name, this turns off all shaping of a link, not just the bandwidth constraint. So if you need delays or packet loss, don't use this.
How do I know what the right colocate factor is?
The hardest issue when using virtual nodes is determining how many virtual nodes you can colocate on a physical node, without affecting the fidelity of the experiment. Ultimately, the experimenter must make this decision, based on the nature of the applications run and what exactly is being measured. We provide some simple limits (e.g., network bandwidth caps) and coarse-grained aggregate limits (e.g., the default colocation factor) but these are hardly adequate.
One thing to try is to allocate a modest sized version of your experiment, say 40-50 nodes, using just physical nodes and compare that to the same experiment with 40-50 virtual nodes with various packing factors.
We are currently working on techniques that will allow you to specify some performance constraints in some fashion, and have the experiment run and self-adjust til it reaches a packing factor that doesn't violate those constraints.
Mixing virtual and physical nodes.
It is possible to mix virtual nodes and physical nodes in the same experiment. For example, we could setup a LAN, similar to the above example, such that half the nodes were virtual (pcvm) and half physical (pc):
set lanstr ""
for {set j 1} {$j <= 8} {incr j} {
set n($j) [$ns node]
append lanstr "$n($j) "
if {$j & 1} {
tb-set-hardware $n($j) pcvm
} else {
tb-set-hardware $n($j) pc
tb-set-node-os $n($j) FBSD-STD
}
}
set lan [$ns make-lan "$lanstr" 10Mb 0ms]
We have also implemented, a non-encapsulating version of the virtual ethernet interface that allows virtual nodes to talk directly to physical ethernet interfaces and thus remove the reduced-MTU restriction. To use the non-encapsulating version, put:
tb-set-encapsulate 0
in your NS file.
Limitations
Following are the primary limitations of the Emulab virtual node implementation.
- Not a complete virtualization of a node. We make no claims about being a true x86 or even BSD/Linux virtual machine. We build on an existing mechanisms with the primary goal of providing functional transparency to applications. We are even more lax in that we assume that all virtual nodes on a physical host belong to the same experiment. This reduces the security concerns considerably. For example, if a virtual node is able to crash the physical machine or is able to see data outside its scope, it only affects the particular experiment. This is not to say that we are egregious in our violation. A particular example is that virtual nodes are allowed to read /dev/mem. This made it much easier as we did not have to either virtualize /dev/mem or rewrite lots of system utilities that use it. The consequence is, that virtual nodes can spy on each other if they want. But then, if you cannot trust yourself, who can you trust!
- Not a complete virtualization of the network. This is another aspect of the previous bullet, but bears special note. While we have virtual interfaces and routing tables, much of the network stack of a physical host remains shared, in particular all the resources used by the higher level protocols. For example, all of the statistics reported by "netstat -s" are global to the node.
- No resource guarantees for CPU and memory on nodes. We also don't provide complete performance isolation. We currently have no virtual node aware CPU scheduling mechanisms. Processes in virtual nodes are just processes on the real machine and are scheduled according to the standard BSD scheduler. There are also no limits on virtual or physical memory consumption by a virtual node.
- Nodes must run a specific version of FreeBSD or Fedora. We have hacked the FreeBSD 4.10 kernel mightily to support virtual nodes. See [wiki:Jail? this document] for details, but suffice it to say, making these changes to other versions of FreeBSD would be a huge task. The OpenVZ version runs on Fedora8 and we do plan to keep that reasonably up to date.
- Will only scale to low 1000s of nodes. We currently have a number of scaling issues that make it impractical to run experiments of more than 1000-2000 nodes. These range from algorithmic issues in the resource mapper and route calculator, to physical issues like too few and too feeble of physical nodes, to user interface issues like how to present a listing or visualization of thousands of nodes in a useful way.
- Virtual nodes are not externally visible. Due to a lack of routable IP space, virtual nodes are given non-routable control net addresses and thus cannot be accessed directly from outside Emulab. You must use a suitable proxy or access them from the Emulab user-login server.
- Virtual ethernet encapsulation reduces the MTU. This is a detail,
but of possible importance to people since they are doing network
experiments. By default, the veth device reduces the MTU by 16 bytes to
- As mentioned, we have a version of the interface which does not use encapsulation.
- Only 400Mb of internal "network" bandwidth. This falls in the rinky-dink node category. As most of our nodes are based on ancient 100Mhz FSB, sub-GHz technology, they cannot host many virtual nodes or high capacity virtual links. The next wave of cluster machines will be much better in this regard.
- No node consoles. Virtual nodes do not have a virtual console. If we discover a need for one, we will implement it.
- Must use "linkdelays." To enable topology-on-a-single-node configurations and to conserve physical resources in the face of large topologies, we use on-node traffic shaping rather than dedicated traffic shaping nodes. This increases the overhead on the host machine slightly. To improve the fidelity of delays and bandwidth shaping, virtual node hosts run their kernel at 1000Hz rather than 100Hz. One potentially serious side-effect of vnode traffic shaping, and linkdelays in general, is that dummynet on FreeBSD induces a minimum one clock tick (1ms for 1000Hz kernel) delay for any form of traffic shaping. For example, if you had 10 machines connected point to point in a "line", you would incur a 10ms delay from one end to the other, even if you were only shaping the bandwidth of the links.
Known Bugs
There is currently a problem with the "loopback" (nullfs) filesystem mechanism we use to export filesystems to virtual nodes. It is prone to deadlock under load. To be safe, you should do all your logging and heavy file activity inside the "file" disk (e.g., in /var).
Technical Details
There is an online document covering some of the details of the FreeBSD implementation of virtual nodes. There is a more detailed document in the Emulab source code in the file doc/vnode-impl.txt.
Attachments
- vnodes-list.png (6.5 kB) - added by stoller on 06/24/08 12:36:31.
- vnode-example.ns (0.6 kB) - added by stoller on 06/24/08 12:39:23.
