MOSIX Frequently Asked Questions - Flat listing
|
Table of contents
Copyright © 1999 - 2009 A. Barak. All rights reserved.
Question:
What is MOSIX
Answer:
MOSIX is a management system targeted for High Performance Computing (HPC)
on clusters and organizational grids with multiple clusters.
MOSIX incorporates automatic resource discovery and dynamic workload
distribution, commonly found on single computers with multiple processors.
More information can be found in the
About web page and
"The MOSIX2 Management System for Linux Clusters
and Organizational Grids" white paper.
Question:
Why this name
Answer:
MOSIX stands for a
Multicomputer
Operating
System
for UnIX.
MOSIX® is a registered
trademark of Amnon Barak and Amnon Shiloh.
Question:
Who is it suitable for
Answer:
MOSIX is suitable to run compute intensive and applications with
moderate amounts of I/O over fast, secure networks, in a trusted
environment (where all remote nodes are trusted),
e.g., as in private clusters and organizational grids.
Question:
What are the main benefits of MOSIX
Answer:
Users can login on any node and do not need to know where their
programs run.
In a MOSIX cluster/grid there is no need to modify or to link
applications with any library, copy files or login to remote nodes,
or even assign processes to different nodes, including nodes in
different clusters - it is all done automatically.
The outcome is ease of use, better utilization of resources and
near maximal performance.
Question:
How this is accomplished
Answer:
By a software layer that allows applications to run in
remote computers as if they run locally.
Users can run their regular sequential and parallel applications
as if they use one computer (node), while MOSIX automatically
(and transparently) seek resources and migrate processes among
nodes to improve the overall performance.
This is accomplished by on-line algorithms that monitor the state
of the system-wide resources and the running processes, then,
whenever appropriate, initiate process migration to:
-
Balance the load;
-
Move processes from slower to faster nodes;
-
Move processes from nodes that run out of free memory;
-
Preserve long-running guest processes when clusters are
about to be disconnected from the grid.
Question:
Which hardware platforms are supported
Answer:
The latest production distribution of MOSIX runs on all x86-compatible
computers (both 32-bit and 64-bit architectures).
Question:
Which software platforms are supported
Answer:
The latest production distribution of MOSIX runs on Linux-2.6.
MOSIX can also run in Virtual Machines over most operating systems,
including Windows.
Question:
Is MOSIX a cluster or a grid technology
Answer:
Both.
MOSIX version 2 (MOSIX2) for Linux-2.6 can manage a cluster
as well as a multi-cluster organizational grid with several
homogeneous clusters.
MOSIX version 1 for Linux-2.4 can manage a single cluster.
Question:
Why all remote nodes must be trusted
Answer:
To ensure that migrated (guest) processes in a multi-cluster grid
are not tampered while running in remote (hosting) clusters.
Note that guest processes run in a sandbox, which prevents such
processes from accessing local resources in the hosting nodes.
Question:
History of MOSIX
Answer:
The
History of MOSIX web page provides information about all the
versions of MOSIX.
Question:
MOSIX related papers and reports
Answer:
Can be found in
link.
Question:
What are the main features of MOSIX
Answer:
The main features are listed in the
MOSIX for Linux-2.6
and the
MOSIX for Linux-2.4
web pages.
Question:
What aspects of a single-system image are supported
Answer:
The main aspects are:
- Users can login on any node and do not need to know where
their programs run.
- No need to modify or link applications with special libraries.
- No need to copy files to remote nodes.
- Automatic resource discovery: whenever clusters or nodes
join (disconnect), all the active nodes are updated.
- Automatic workload distribution by process migration,
including load balancing, process migration from slower
to faster nodes and from nodes that run out of free memory.
- Preservation of the user's "login-node" run-time environment.
Question:
How MOSIX supports Virtual Organizations (VOs)
Answer:
A VO is a set of clusters (servers and workstations) whose owners
wish to share their computing resources from time to time
in a flexible way.
MOSIX2 provides the following features to manage VOs:
- Support of disruptive configurations:
clusters can join or leave the grid at any time.
- Clusters could be shared symmetrically or asymmetrically. For
example, the owner of cluster A can allow processes originating from
cluster B to move in but not processes originating from cluster C.
- A run-time priority for flexible use of nodes within and among
groups. For example, to partition a cluster among different users.
- Each cluster owner can assign priorities to processes from other clusters.
For example, the owner of cluster A can assign higher priority
to processes from cluster B and lower priority to processes from
cluster C. This way, when guest processes from cluster B wish to
move to cluster A, they will push out guest processes from cluster C
(if any).
- Local and higher priority processes force out lower priority processes.
- Migrated processes to/from a disconnecting cluster are moved
out/back, so that long-running migrated processes are not killed.
Question:
What is the architecture of a MOSIX configuration (cluster, grid)
Answer:
The architecture of a MOSIX configuration is homogeneous:
all nodes must be x86-based and run (nearly) the same version of MOSIX
(see the question about mixing different versions of MOSIX).
However, individual nodes may have different number of processors (cores),
different speed, different memory size or I/O devices.
Question:
Which type of processes are available in MOSIX
Answer:
MOSIX2 recognizes two types of processes: Linux and MOSIX processes.
Linux processes are not affected by MOSIX2 - they run as
they do on any Linux system, but cannot be migrated.
MOSIX processes are run in an environment that allows them
to migrate from one node to another.
Linux processes usually include administrative and other tasks that
are not suitable for migration, whereas MOSIX processes are selected
user-applications that are suitable and can benefit from migration.
Apart from process-migration that is available only to MOSIX processes,
MOSIX2 includes batch mechanisms that can queue and assign new jobs
to begin on the best available node: these batch mechanisms are available
for both Linux and MOSIX jobs.
Unlike MOSIX1, in MOSIX2 you need to invoke "mosrun" in order to use
MOSIX - otherwise you run your programs on your standard Linux platform.
If you want to make use of the MOSIX batch mechanisms for Linux
(non-migratable) processes, use the "mosrun -E" option.
This can be summarized in the following table:
| Process type |
Migratable (MOSIX) |
Non-Migratable (Linux) |
| Batch |
mosrun -M [-b] |
mosrun -E [-b] |
| Fully-interactive |
mosrun [-b] |
(do not use "mosrun") |
where the "-b" selects the best location to run it.
Question:
Does MOSIX support checkpoint/restart
Answer:
Yes, most CPU-intensive MOSIX processes can be checkpointed.
When a checkpoint is performed, the image of the processes is saved to a
file. The process can later recover itself from that file and continue to
run from that point.
For successful checkpoint and recovery, a process must not depend heavily
on its Linux environment. For example, for security reasons processes with
setuid/setgid privileges or processes with open pipes or sockets can't be
checkpointed.
Checkpoints can be triggered by a program, by a manual request
and/or automatically - at regular time intervals.
Question:
What are the options of "live-queuing" in MOSIX
Answer:
MOSIX2 supports "live-queuing" that allows queued jobs to preserve
their full connection with their Linux environment.
This includes controlling terminal, parent-process, signals, pipes,
sockets, shared file-descriptors,
etc. The queuing system includes tools for tracing queued jobs, setting
and changing their priorities or the order of execution, and for running
parallel jobs.
Question:
How the queuing system of MOSIX works
Answer:
In a MOSIX grid, each cluster has its own queue and this queue is shared
by all the users of that cluster.
The number of jobs that can be placed in the queue is limited by the
number of Linux processes (about 30000 for all users). To queue a
larger number of jobs, there is an option to run multiple command-lines
from a file, each with its own arguments. This option is commonly used
to run the same program with many different sets of arguments.
Another option allows to set an upper limit on the number of
simultaneous jobs that are allowed to run. This option combines well
with the queuing system which run jobs based on the availability of
grid/cluster resources.
There is an argument to inform the queuing system that the job may
split into a number of parallel processes, so that more resources
are reserved for it. Another argument allows bundling for easy
identification of several instances of a job by a single job-ID.
Jobs can also be handled as a group and be killed collectively.
Question:
How MOSIX manages batch jobs
Answer:
In MOSIX2 batch jobs can be sent to any node in the local cluster
(as opposed to non-batch jobs that require the specific environment
of their dispatching node).
There are two types of batch jobs: Linux and MOSIX. Linux batch
processes do not migrate, while MOSIX batch processes can migrate,
but their home-node can be different than their dispatching node.
MOSIX can assist both types by:
- Queuing the job until resources are available
(using "mosrun -q", "mosrun -S" or both);
- Selecting the best initial assignment for the job.
Batch jobs are started from binaries in another node and preserve only
some of the caller's environment: they receive the environment variables;
they can read from their standard-input and write to their standard
output and error, but not from/to other open files; they receive signals,
but if they fork, signals are delivered to the whole process-group
rather than just the parent; they cannot communicate with other processes
on the calling node using pipes and sockets (other than standard
input/output/error), semaphores, messages, etc. and can only receive
signals, but not send them to processes on the calling node.
The main advantage of batch jobs is that
they save time by not needing to refer to the dispatching-node to perform
system-calls, and that temporary files can be created on the node where
they start, preventing the dispatching node from becoming a bottleneck.
This approach is therefore recommended for programs that
perform a significant amount of I/O.
Question:
How MOSIX handles temporary files
Answer:
To reduce the I/O overhead, MOSIX2 has an option to
migrate (private) temporary files with the process.
Question:
Can MOSIX run in a Virtual Machine (VM).
Answer:
Yes.
MOSIX can run in a virtual machine in any platform
that supports virtualization (including Windows).
The MOSIX web provides a free evaluation copy of MOSIX2 on a
pre-installed virtual-disk image
that can be used to create a MOSIX virtual cluster
on Linux and/or Windows computers.
Question:
Is it possible to install and run more than one VM with MOSIX on the same node
Answer:
Yes, this is especially useful on multi-core computers.
Note that the total number of processors used by the VMs should not
exceed the number of physical processors.
Question:
Can MOSIX run on an unmodified Linux kernel
Answer:
Yes, within a Virtual Machine.
Question:
Why migrate processes when one can move a whole VM with a process inside
Answer:
Mainly because it is expensive, both in terms of time and the required
memory, to create a VM for each process.
Specifically:
-
Migrating a whole VM requires the transfer of much more memory.
Even in the case of "live-migration" (that works for certain types
of processes, not all), this can overload the network more.
-
Once in a VM, a process that splits (using "fork") cannot get
independent resources for each split process: the original process
with all its children will have to remain together on the same VM.
-
Processes within a VM cannot maintain most of their connections
(pipes, signals, parents/children, IPC, etc.) with other processes,
either on the generating host or in other VM's.
-
Allocating a full virtual-disk image for each process can consume
a large amount of disk space.
-
Current VM technology doesn't support migration between different
clusters that are on different switches.
Question:
How to find the latest release and change-log of MOSIX
Answer:
The latest release of MOSIX and its change-log are available at
this link.
Question:
Is technical support available
Answer:
Yes.
Technical support is available for a fee.
It includes configuration and installation assistance
as well as upgrades to new releases.
For details please follow this link.
Question:
How to install MOSIX
Answer:
An installation script and instructions are included
in all MOSIX distributions.
Question:
After installing MOSIX in one node, how do I install it on the other nodes
Answer:
The best way is to use a cluster installation package (such as OSCAR).
If you use a common NFS root directory for your cluster,
you can install MOSIX in that directory.
Otherwise, on a small cluster, you can install MOSIX node by node.
Question:
Why did the installer failed to patch my kernel
Answer:
You probably specified a wrong version of the Linux kernel sources.
You must use only the
official Linux kernel sources for your specific MOSIX distribution.
Note: do not use the kernel sources supplied with commercial Linux
distributions - they were modified and could cause the MOSIX patch
to fail.
Question:
Why did I get a kernel panic when trying to boot the MOSIX kernel
Answer:
This is not because of MOSIX, but simply because you have prepared
your own Linux kernel, which is probably miss-configured
(if you need to be convinced, try a plain, non-MOSIX, kernel from
http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.x).
When you use a standard Linux package (such as RedHat, SuSe, or Debian),
your kernel (and/or kernel modules) would already be configured by that
package, but when you compile your own kernel - as you do when installing
MOSIX, you need to make sure that the kernel configuration suits your
hardware and contains all the necessary device-drivers and file-systems
that you are using.
One tool that often helps in constructing the correct kernel configuration
is to use the output of "gzip -cd < /proc/config.gz", produced on the
originally-supplied kernel, as a basis for the new configuration
(but note that not every Linux distribution has "/proc/config.gz").
This output may not be totally accurate because it comes from a different
(usually older) Linux kernel-version, but is a good place to start:
place it in the file ".config" of the kernel-source directory,
then adjust it by running "make menuconfig".
Another tip that may help to configure the kernel correctly, is that
unless you are a very experienced Linux system-administrator, you should
probably avoid the "initrd" hassles and configure all the drivers and
file-systems that you need in order to get the system to start within
the kernel itself rather than as kernel modules.
Question:
After I installed MOSIX, "mosrun" produces "Not Super User" and exits
Answer:
The file "/bin/mosrun" (and a few others) must have setuid-root
permissions. If for any reason it does not, then run:
> chown root /bin/mosrun /bin/mosq /bin/mosps
> chmod 4755 /bin/mosrun /bin/mosq /bin/mosps
Question:
May I mix different versions of MOSIX in the same cluster or grid
Answer:
The MOSIX version has 4 digits. It is OK to mix versions when
only the last digit is different, but not otherwise.
Question:
How can I see the state of my cluster or grid
Answer:
Type "mon" (the MOSIX monitor).
It can display the number of active nodes (type t),
loads (l), size of total/used memory (m),
dead nodes (d) and relative CPU speeds (s).
Question:
Is it necessary to restart MOSIX in order to change the configuration
Answer:
No.
Once you modify configuration files, the changes will take effect
within a minute. After editing the list of nodes in your cluster
("/etc/mosix/mosix.map") you need to run "setpe", but if you are
using "mosconf" to modify the local configuration, then there is
no need to run "setpe".
Question:
How do I know that the process migration works
Answer:
Run "mon" in one screen.
Then run several copies of a test (CPU bound) program,
e.g.,
mosrun -e awk 'BEGIN {for(i=0;i<100000;i++)for(j=0;j<100000;j++);}'
First you should see an increase of the load in one node.
After a few seconds, if the process migration works you will
see how the load is spread among the nodes.
If your nodes are not of the same speed then more processes
will run in the faster nodes.
Question:
What is the maximal number of multi-cores supported
Answer:
MOSIX2 supports whatever hardware is supported by the Linux kernel
that it runs under, including multi-cores (dual, quad, 8-way, etc.)
and SMPs.
Question:
Is Hyper-threading supported
Answer:
Yes.
Question:
What are the port numbers used by MOSIX
Answer:
TCP ports 249 - 253.
UDP ports 249 - 250.
Question:
What happens when a node crashes
Answer:
All processes that were running on or originated from that node
are killed.
To minimize the damage for long-running processes, it is recommended
to use the MOSIX checkpoint facility.
Question:
Does the traffic among MOSIX nodes pass safely through the IPSec tunnels
Answer:
Yes. MOSIX works on top of TCP and UDP, obviously above IP.
Question:
Is it possible to run MOSIX over a WAN or the Internet
Answer:
Yes.
However, opening the grid over the Internet without a VPN is a
security hazard.
Question:
How to run MOSIX processes in idle workstations
Answer:
MOSIX can take advantage of idle workstations (when no one is logged in),
with the option that upon a login, all MOSIX processes are moved out and the
MOSIX activities are stopped.
- In the login script add the commands:
> mosctl block
> mosctl expel &
The "mosctl block" command prevents new remote processes from migrating
to that workstation.
The "mosctl expel &" move out MOSIX guest processes.
Note that an & is used after the expel command, since
expelling processes may take some time and we don't want the user login
process to hang. The processes are expelled while the user logs in.
- On logout, run the command:
> mosctl noblock
This command allows remote processes to migrate to the workstation.
On a Debian system using GDM the appropriate file to add this command
is /etc/gdm/PostSession/Default .
Note that when adding the mosctl commands to the GDM script you shouldn't
interfere with the correct work of gdb.
32-bit and 64-bit applications
|
Question:
How do I inform MOSIX whether I use 32-bit or 64-bit systems
Answer:
There is no need to do so - the MOSIX installation script will
automatically detect the type of system that you have and install
the appropriate binaries.
Question:
Can I mix 32-bit and 64-bit nodes in the same cluster
Answer:
Yes you can, but performance can be better if your situation allows you
to set the 32-bit and 64-bit nodes as separate clusters within a
multi-cluster grid.
Question:
Can I run 32-bit programs on 64-bit nodes
Answer:
Yes, 32-bit programs can migrate to 64-bit nodes (and even start there),
but the home-node of 32-bit programs must be on a 32-bit computer.
Thus, if
you want to run 32-bit programs on predominantly 64-bit cluster(s), you
may consider leaving aside a few 32-bit computers as part of your
cluster and/or multi-cluster grid, from where you can start 32-bit programs.
Question:
Can I run 64-bit programs on 32-bit nodes
Answer:
No, the hardware does not support it
(and even when it does, a 32-bit Linux kernel doesn't).
Question:
Can I have MOSIX running under a 64-bit kernel, but a 32-bit Linux installation, utilities and libraries (because it is so much easier to upgrade only the kernel)
Answer:
No, while Linux allows this combination, the current version of MOSIX
(neither the 32-bit nor the 64-bit variants) does not yet support this
option, so MOSIX will fail to start.
Question:
What happens if I attempt to run a 32-bit executable from a 64-bit node
Answer:
It will run correctly for the sake of transparency, but as a "native"
Linux process, so the program will not be able to migrate or use special
MOSIX features (not even its child processes, not even if they later
execute a 64-bit binary).
Question:
If a child process is spawned from a parent, must they migrate together
Answer:
No. Each process is managed independently.
Question:
Why shared-memory is not supported
Answer:
Because it is not scalable,
i.e., it is impossible to change the contents of a memory in one
node and expect that the same change will be reflected instantly
in the memory of the remaining nodes (with which memory is shared),
e.g., as in an SMP or a multi-core.
Question:
How to run a threaded application
Answer:
Threaded applications are created by the "CLONE_VM" system-call
which uses shared-memory, and thus are not suitable for distributed-memory
architectures.
In MOSIX it is possible to run threaded applications as standard Linux
processes. Such applications cannot be migrated, but can still benefit
from MOSIX features such as queuing and best initial-assignment.
To launch threaded applications use "mosrun -E".
Question:
How to run a script where one of commands is a threaded application
Answer:
By using the "native" utility in your script:
> native {threaded_program} [program-args]...
Question:
Must all migratable executables be started under "mosrun"
Answer:
To be migratable, either the executables, or the shell (or other program)
that called them must be run under "mosrun". Once a shell runs under
"mosrun", all its descendants will also be under "mosrun"
(but there is a way to request explicitly that a particular child
will NOT run under "mosrun").
Question:
Are there any limitations on I/O that can be performed by migrated processes
Answer:
Usually, remote I/O done by migrated processes on remote nodes
is performed via the respective home-node of each process.
While this does not limit the allowed operations, it may slow-down
such processes. Thus, if the amount of I/O is significant, it will
often cause the process to migrate back to its home-node.
Note that the amount and frequency of I/O is taken into account and
weighted against other considerations in making such a decision.
The direct-communication (migratable socket) can remove this slow-down
affect for I/O between communicating processes.
Question:
Which IPC mechanism should be use between processes to get the best performance
Answer:
The most efficient mechanism is the direct-communication, see the next
questions.
Otherwise, MOSIX is not different from Linux:
depending on the particular needs of the process,
whatever approach (other than shared-memory) that is best in Linux
is best on MOSIX. It could be pipes, SYSV-messages, UNIX-sockets,
TCP-sockets and files.
Obviously files can be slow when they usually require writing on
a physically-moving surface and/or networking. On the other hand,
Linux has very good caching mechanisms for local files.
Question:
Can MOSIX support migratable socket
Answer:
Yes, direct-communication provides an effective migratable socket between
migrated processes.
Question:
How direct-communication can improve the performance of communicating processes
Answer:
Normally, MOSIX processes do all their I/O and (most) system-calls via
their respective home-nodes.
This can be slow because operations are limited by the network speed and
latency.
Direct communication allows processes to exchange messages
directly between migrated processes, bypassing their home-nodes.
Question:
How to run MATLAB Version 7.4 (or older) jobs in MOSIX
Answer:
Jobs running MATLAB Version 7.4 (or older) can automatically migrate
among nodes of a cluster/multi-cluster.
First, tune MATLAB to MOSIX by the following 3 steps:
- Find where MATLAB is installed on your system by
> which matlab
/usr/local/bin/matlab
- Backup the matlab program to another location
> cp /usr/local/bin/matlab /tmp/mos-matlab
- Comment-out the following 2 lines in the mos-matlab script:
LD_ASSUME_KERNEL=2.4.1
export LD_ASSUME_KERNEL
the result should be :
#LD_ASSUME_KERNEL=2.4.1
#export LD_ASSUME_KERNEL
You can now run MATLAB jobs in a cluster/multi-cluster using mosrun.
Example: to run the following MATLAB test.m program:
a=randn(3000);
b=svd(a);
use:
> mosrun -e mos-matlab -nojvm -nodesktop -nodisplay < test.m
Question:
How to run MATLAB Version 7.5 (or newer) jobs
Answer:
MATLAB Version 7.5 (or newer) applications use a library which
uses threads (the "CLONE_VM" system-call) incorrectly.
To overcome this problem we added to mosrun the -i flag,
which should be used with the -E flag.
This means that MATLAB jobs can be queued and assigned by MOSIX to nodes
as regular Linux processes, but they can't migrate afterwards.
The MOSIX version should be at least MOSIX-2.24.0.0
and jobs should be started by:
> mosrun -E -b -i matlab ....
MOSIX will assign each job to the best node in the local cluster.
Example: to run the following MATLAB test.m program:
a=randn(3000);
b=svd(a);
use:
> mosrun -E -b -i matlab -nojvm -nodesktop -nodisplay < test.m
Question:
How to run JAVA programs
Answer:
JAVA supports (shared-memory) threads (the "CLONE_VM" system-call),
which is not suitable for distributed-memory architectures (clusters).
This means that JAVA jobs can be queued and assigned by MOSIX to nodes
only as regular Linux processes (with the mosrun -E flag).
A JAVA job should be started by:
> mosrun -E -b java job
MOSIX will assign each job to the best node in the local cluster.
Question:
Can MOSIX migrate MPI processes
Answer:
Yes.
MPI allocates processes to slave nodes of a cluster in a Round-Robin fashion,
without checking the state of the resources, e.g. speed, current load and
available memory.
Process migration can improve the performance by load-balancing, by
migration of processes from slower to faster nodes and to nodes with
sufficient free memory, as well as by migration of MPI processes to
grid nodes which are not part of the user's cluster.
Question:
What is HUGI
Answer:
The
Hebrew University Grid (HUGI) is a production multi-cluster
(organizational grid) with 15 MOSIX clusters.
Most clusters are private. They are made of production servers
that belong to research groups in various departments.
Four clusters are made of workstations in student labs.
Processes of users are allowed to migrate to idle workstations
and among nodes in the private clusters, subject to the priorities
among the different groups.
For example, since the workstations belong to the CS department,
processes that are started in a CS private cluster has a higher
priority to move to a workstation over already running processes
from the Chemistry cluster.
Due to the increased computing demands by our researchers,
the amount of installed memory in the workstations was increased
(beyond the needs of the students), to allow large guest processes
from the private clusters to run in these workstations.
Question:
How HUGI is managed
Answer:
All the nodes in all the clusters of HUGI do not rely on local
disks for booting and running.
Local disks may be used for temporary storage.
User-files and home directories are located on central NFS servers.
Question:
What are the rules and policies for running applications on HUGI
Answer:
Rules:
- Users are requested to login and start their jobs in the cluster
of their group.
- Remote logins to the student workstations are not permitted.
- Users that submit jobs with a large number of sequential
processes are requested to use either the -q or the -S queuing
options of mosrun.
- Users are requested to use the -m parameter of mosrun, predicting
the amount of memory they will require.
Policies:
- All the workstations are rebooted every night.
- Before rebooting a workstation, all guest processes are moved out.
Those processes can move to other nodes in the grid - if no other
nodes are available, they are frozen in the home node.
- Processes will automatically migrate to the best available (grid-wide)
nodes (subject to the priority of their home cluster).
- Students have the highest priority over their workstations. Whenever a
student logs in, all guest processes are moved out from that workstation.
Question:
Who is responsible to allocate freeze space
Answer:
Cluster owners should designate sufficient freeze space for
processes originating from their cluster.