MOSIX FAQ - Flat listing

M O S I X Cluster Management System
Home About FAQ

MOSIX Frequently Asked Questions - Flat listing

Table of contents

General

Question:

What is MOSIX

Answer:

MOSIX is a cluster management system targeted for parallel computing on Linux clusters and multi-cluster private clouds.

MOSIX Supports automatic resource discovery and dynamic workload distribution. It provides users and applications with the illusion of running on a single computer with multiple processors.

More information can be found in the About web page and the MOSIX white paper.

Question:

Why this name

Answer:

MOSIX stands for a Multicomputer Operating System for UnIX.

MOSIX^® is a registered trademark of Amnon Barak and Amnon Shiloh.

Question:

Who is it suitable for

Answer:

MOSIX is suited to run compute intensive and applications with small to moderate amounts of I/O over fast, secure networks, in a trusted environment (where all remote nodes are trusted). A typical situation is of several private clusters, each connected internally by Infiniband and externally by Ethernet, forming a private (intra-organization) cloud.

Question:

What are the main benefits of MOSIX

Answer:

A single-system image - users can login on any node and do not need to know the state of the (cluster) resources or where their programs run.

In a MOSIX cluster/multi-cluster there is no need to modify or to link applications with any library, copy files or login to remote nodes, or even assign processes to different nodes, including nodes in different clusters - it is all done automatically.

The outcome is ease of use, better utilization of resources and near maximal performance.

Question:

How this is accomplished

Answer:

By a software layer that allows applications to run in remote computers as if they run locally.

Users can run their regular sequential and parallel applications as if they use one computer (node), while MOSIX automatically (and transparently) seek resources and migrate processes among nodes to improve the overall performance.

This is accomplished by on-line algorithms that monitor the state of the system-wide resources and the running processes, then, whenever appropriate, initiate process migration to:

Balance the load;
Move processes from slower to faster nodes;
Move processes from nodes that run out of free memory;
Preserve long-running guest processes when clusters are about to be disconnected from the multi-cluster.

Question:

Which CPU architectures are supported

Answer:

MOSIX runs only on the x86_64 (standard 64-bit PC) architecture.

Question:

Which networks are supported

Answer:

MOSIX can run over any network that supports TCP/IP, e.g., Ethernet and Infiniband.

Question:

Which software platforms are supported

Answer:

MOSIX distributions are provided for use with all Linux distributions; as an RPM for openSUSE; and as a pre-installed virtual-disk image that can be used to create a MOSIX virtual cluster on Linux and/or Windows computers.

Question:

Is MOSIX a cluster or a multi-cluster technology

Answer:

MOSIX-2, 3, 4 can manage clusters and multi-clusters.

MOSIX-1 for Linux-2.4 managed a single cluster.

Question:

Why all remote nodes must be trusted

Answer:

To ensure that migrated (guest) processes are not tampered while running in remote clusters.

Note that such processes run in a sandbox, preventing them from accessing local resources in the hosting nodes.

Question:

History of MOSIX

Answer:

Is here.

Question:

MOSIX related papers

Answer:

Can be found in here.

MOSIX - conceptual

Question:

What are the main features of MOSIX

Answer:

The main features are listed here.

Question:

What aspects of a single-system image are supported

Answer:

MOSIX provides to users and applications the user's "login-node" run-time environment.

This means that:

Users can login on any node and do not need to know where their programs run.
No need to modify or link applications with special libraries.
No need to copy files to remote nodes.
Automatic resource discovery: whenever clusters or nodes join (disconnect), all the active nodes are updated.
Automatic workload distribution by process migration, including load balancing, process migration from slower to faster nodes and from nodes that run out of free memory.

Question:

How MOSIX supports a multi-cluster private cloud

Answer:

A multi-cluster private cloud is a set of clusters (including servers and workstations) whose owners wish to share their computing resources from time to time in a flexible way.

MOSIX provides the following features to manage such clouds:

Support of disruptive configurations: clusters can join or leave the cloud at any time.
Clusters could be shared symmetrically or asymmetrically. For example, the owner of cluster A can allow processes originating from cluster B to move in but block processes originating from cluster C.
A run-time priority for flexible use of nodes within and among groups. For example, to partition a cluster among different users.
Each cluster owner can assign priorities to processes from other clusters. For example, the owner of cluster A can assign higher priority to processes from cluster B and lower priority to processes from cluster C. This way, when guest processes from cluster B wish to move to cluster A, they will push out guest processes from cluster C (if any).
Local and higher priority processes force out lower priority processes.
Migrated processes to/from a disconnecting cluster are moved out/back, so that long-running migrated processes are not killed.

Question:

What is the architecture of a MOSIX configuration (cluster, multi-cluster)

Answer:

The architecture of a MOSIX configuration is homogeneous: all nodes must be x86-based and run (nearly) the same version of MOSIX (see the question about mixing different versions of MOSIX).

Individual nodes may have different number of cores, different speed, different memory size or I/O devices.

Question:

Does MOSIX support checkpoint/restart

Answer:

Yes, most CPU-intensive MOSIX processes can be checkpointed.

When a checkpoint is performed, the image of the processes is saved to a file. The process can later recover itself from that file and continue to run from that point.

For successful checkpoint and recovery, a process must not depend heavily on its Linux environment. For example, for security reasons processes with setuid/setgid privileges or processes with open pipes or sockets can't be checkpointed.

Checkpoints can be triggered by a program, by a manual request and/or automatically - at regular time intervals, see the next question.

Question:

How to trigger a checkpoint

Answer:

Checkpoints can be triggered in 3 ways:

By providing the "-C< file-name> " and "-A< integer-number>" flags to mosrun. This will perform a periodic checkpoint every "integer-number" of minutes and the checkpointed file will be saved to the files "file-name.1", "file-name.2", etc. Read the mosrun manual for details.
By using the "migrate < pid> checkpoint" command to perform a checkpoint at a specific time externally to the program. Read the migrate command manual for details.
From within the program, by using the MOSIX checkpoint interface.

The MOSIX checkpoint interface is documented in "man MOSIX". It contains the following files in the proc file system:

/proc/self/checkpoint
/proc/self/checkpointfile
/proc/self/checkpointlimit
/proc/self/checkpointinterval

These pseudo-files allow the process to modify its checkpoint parameters and to trigger a checkpoint operation.

Question:

Example how to perform a checkpoint from within a program

Answer:

The following program performs 100 units of work and uses the checkpoint-unit argument to trigger a checkpoint right after that unit. The "Checkpoint-file" is used to save the copies of the program.

#include < stdlib.h>
#include < unistd.h>
#include < string.h>
#include < stdio.h>
#include < fcntl.h>
#include < sys/stat.h>
#include < sys/types.h>

// Setting the checkpoint file from withing the process
// This can also be done via the -C argument to mosrun
int setCheckpointFile(char *file) {
int fd;

     fd = open("/proc/self/checkpointfile", 1|O_CREAT, file);
     if (fd == -1) {
        return 0;
     }
     return 1;

}

// Triggering a checkpoint from within the process
int triggerCheckpoint() {
     int fd;
     fd = open("/proc/self/checkpoint", 1|O_CREAT, 1);
     if(fd == -1) {
        fprintf(stderr, "Error doing self checkpoint \n");
        return 0;
     }
     printf("Checkpoint was done successfully\n");
     return 1;
}

int main(int argc, char **argv) {
     int j, unit, t;
     char *checkpointFileName;
     int checkpointUnit = 0;

     if(argc < 3) {
        fprintf(stderr, "Usage %s < checkpoint-file> < unit> \n", argv[0]);
        exit(1);
     }

     checkpointFileName = strdup(argv[1]);
     checkpointUnit = atoi(argv[2]);
     if(checkpointUnit < 1 || checkpointUnit > 100) {
        fprintf(stderr, "Checkpoint unit should be > 0 and < 100\n");
        exit(1);
     }

printf("Checkpoint file: %s\n", checkpointFileName);
printf("Checkpoint unit: %d\n", checkpointUnit);

// Setting the checkpoint file from within the process (can also be done using
// the -C argument of mosrun
     if(!setCheckpointFile(checkpointFileName)) {
        fprintf(stderr, "Error setting the checkpoint filename from within the process\n");
        fprintf(stderr, "Make sure you are running this program via mosrun\n");
        return 0;
     }

// Main loop ... running for 100 units. change this loop if you wish
// the program to run do more loops
     for( unit = 0; unit < 100 ; unit++ ) {
        // Consuming some cpu time (simulating the run of the application)
        // Change the number below to cause each loop to consume more (or) less time
        for( t=0, j = 0; j < 1000000 * 500; j++ ) {
          t = j+unit*2;
       }
       printf("Unit %d done\n", unit);

// Trigerring a checkpoint request from within the process
       if(unit == checkpointUnit) {
          if(!triggerCheckpoint())
             return 0;
          }
       }
       return 1;
}

To compile: gcc -o checkpoint_demo checkpoint_demo.c
To run: mosrun checkpoint_demo

A typical run:
> mosrun ./checkpoint_demo ccc 5
Checkpoint file: ccc
Checkpoint unit: 5
Unit 0 done
Unit 1 done
Unit 2 done
Unit 3 done
Unit 4 done
Unit 5 done
Checkpoint was done successfully
Unit 6 done
Unit 7 done
Unit 8 done
^C

The program triggered a checkpoint after unit 5. The checkpointed file was saved in ccc.1.
After unit 8 the program was killed.

To restart:
> mosrun -R ccc.1
Checkpoint was done successfully
Unit 6 done
Unit 7 done
Unit 8 done
Unit 9 done
Unit 10 done
...

The program was restarted from the point right after it was checkpointed.

Question:

How MOSIX handles temporary files

Answer:

To reduce the I/O overhead, MOSIX has an option to migrate (private) temporary files with the process.

Question:

Can MOSIX run in a Virtual Machine (VM).

Answer:

Yes.

MOSIX can run in a virtual machine in any platform that supports virtualization.

Question:

Is it possible to install and run more than one VM with MOSIX on the same node

Answer:

Yes, this is especially useful on multi-core computers.

Note that the total number of processors used by the VMs should not exceed the number of physical processors.

Question:

Can MOSIX run on an unmodified Linux kernel

Answer:

Yes, MOSIX-4 can run on any Linux kernel version 3.12 or higher; on Linux distributions based on such kernels; and on the standard kernel from openSUSE version 13.1 or higher.

Question:

Why migrate processes when one can move a whole VM with a process inside

Answer:

Mainly due to overheads, both in terms of time and the required memory to create a VM for each process.

Specifically:

Migrating a whole VM requires the transfer of much more memory. Even in the case of "live-migration" (that works only for certain types of processes - not all), this can overload the network more.
Once in a VM, a process that splits (using "fork") can not get independent resources for each split process: the original process with all its children will have to remain together on the same VM.
Processes within a VM can not maintain most of their connections (pipes, signals, parents/children, IPC, etc.) with other processes, either on the generating host or in other VM's.
Allocating a full virtual-disk image for each process can consume a large amount of disk space.
Current VM technology doesn't support migration between different clusters that are on different switches.

MOSIX - technical

Question:

Latest release and changelog

Answer:

Are available here.

Question:

How to install

Answer:

An installation script and instructions are included in all the MOSIX distributions.

Question:

After installing MOSIX in one node, how do I install it on the other nodes

Answer:

The best way is to use a cluster installation package (such as OSCAR).

If you use a common NFS root directory for your cluster, you can install MOSIX in that directory.

Otherwise, on a small cluster, you can install MOSIX node by node.

Question:

After I installed MOSIX, "mosrun" produces "Not Super User" and exits

Answer:

The file "/bin/mosrun" (and a few others) must have setuid-root permissions. If for any reason it does not, then run:
> chown root /bin/mosrun /bin/mosq /bin/mosps
> chmod 4755 /bin/mosrun /bin/mosq /bin/mosps

Question:

May I mix different versions of MOSIX in the same cluster or multi-cluster.

Answer:

The MOSIX version has 4 digits. It is OK to mix versions when only the last digit is different, but not otherwise.

Question:

How can I see the state of my cluster or multi-cluster.

Answer:

Type "mosmon" (the MOSIX monitor).

It can display the number of active nodes (type t), loads (l), size of total/used memory (m), dead nodes (d) and relative CPU speeds (s).

Question:

Is it necessary to restart MOSIX in order to change the configuration

Answer:

No.

Once you modify configuration files, the changes will take effect within a minute.

After editing the list of nodes in your cluster ("/etc/mosix/mosix.map") you need to run "mossetpe", but if you are using "mosconf" to modify the local configuration, then there is no need to run "mossetpe".

Question:

How to configure a cluster with partial Infiniband connection.

Answer:

Suppose you have a logical MOSIX cluster that consists of two physical clusters, each connected by both Ethernet and Infiniband, but the physical clusters are only connected between them by Ethernet: First configure the nodes within each physical cluster (including the local node) by their Infiniband IP address and the nodes in the other physical cluster by their Ethernet address.

Next, using "mosconf", select: "1. Which nodes are in this cluster",

then type "+" to turn on the advanced features,

then type "a" to define aliases and for each node in the other cluster, define its Inifiniband IP address (despite being unreachable from the node that is being configured) as an alias to its Ethernet IP address.

Question:

How do I know that the process migration works

Answer:

Run "mosmon" in one screen. Then run several copies of a test (CPU bound) program, e.g.,

mosrun -e awk 'BEGIN {for(i=0;i<100000;i++)for(j=0;j<100000;j++);}'

First you should see an increase of the load in one node. After a few seconds, if the process migration works you will see how the load is spread among the nodes.

If your nodes are not of the same speed then more processes will run in the faster nodes.

Question:

What is the maximal number of multi-cores supported

Answer:

MOSIX supports whatever hardware is supported by the Linux kernel that it runs under, including multi-cores (dual, quad, 8-way, etc.).

Question:

/proc/cpuinfo shows 8 CPUs, but MOSIX claims that there are only 4

Answer:

There are in fact only 4 real cores per node - the extra cores shown in /proc/cpuinfo reflect Hyper-Threading. You can tell that Hyper-Threading is enabled by the "ht" flag in the "flags" field of /proc/cpuinfo. Hyper-Threading may help some threaded applications, such as browsers, but may slow down compute intensive applications.

Question:

What are the port numbers used by MOSIX

Answer:

TCP ports 252 and 253.

UDP ports 249 and 253.

Question:

What happens when a node crashes

Answer:

All processes that were running on or originated from that node are killed. To minimize the damage for long-running processes, it is recommended to use the MOSIX checkpoint facility.

Question:

Does the traffic among MOSIX nodes pass safely through the IPSec tunnels

Answer:

Yes. MOSIX works on top of TCP and UDP, obviously above IP.

Question:

How to run MOSIX processes in idle workstations

Answer:

MOSIX can take advantage of idle workstations (when no one is logged in), with the option that upon a login, all MOSIX processes are moved out and the MOSIX activities are stopped.

In the login script add the commands:
> mosctl block
> mosctl expel &
The "mosctl block" command prevents new remote processes from migrating to that workstation.
The "mosctl expel &" move out MOSIX guest processes. Note that an & is used after the expel command, since expelling processes may take some time and we don't want the user login process to hang. The processes are expelled while the user logs in.
On logout, run the command:
> mosctl noblock
This command allows remote processes to migrate to the workstation.
On a Debian system using GDM the appropriate file to add this command is /etc/gdm/PostSession/Default .

Note that when adding the mosctl commands to the GDM script you shouldn't interfere with the correct work of gdb.

Running applications

Question:

If a child process is spawned from a parent, must they migrate together

Answer:

No. Each process is managed independently.

Question:

Why shared-memory is not supported

Answer:

Because unlike multi-cores, it is impossible to change the contents of a memory in one node and expect that the same change will be reflected instantly in the memory of other nodes (with which memory is shared).

Question:

Can I run threaded applications

Answer:

No. Threaded applications are created by the "CLONE_VM" system-call, which uses shared-memory, thus they cannot run under MOSIX.

Question:

How to run a script where one of the commands is a threaded application

Answer:

By using the "mosnative" utility in your script:

> mosnative {threaded_program} [program-args]...

Question:

Must all migratable executables be started under "mosrun"

Answer:

To be migratable, either the executables, or the shell (or other program) that called them must be run under "mosrun". Once a shell runs under "mosrun", all its descendants will also be under "mosrun" (but there is a way to request explicitly that a particular child will NOT run under "mosrun").

Question:

Are there any limitations on I/O that can be performed by migrated processes

Answer:

Usually, remote I/O done by migrated processes on remote nodes is performed via the respective home-node of each process. While this does not limit the allowed operations, it may slow-down such processes. Thus, if the amount of I/O is significant, it will often cause the process to migrate back to its home-node.

Note that the amount and frequency of I/O is taken into account and weighted against other considerations in making such a decision.

The direct-communication (migratable socket) can reduce this slow-down affect for I/O between communicating processes.

Question:

Which IPC mechanism should be use between processes to get the best performance

Answer:

The most efficient mechanism is the direct-communication, see the next questions.

Otherwise, MOSIX is not different from Linux: depending on the particular needs of the process, whatever approach (other than shared-memory) that is best in Linux is best on MOSIX. It could be pipes, SYSV-messages, UNIX-sockets, TCP-sockets and files.

Obviously files can be slow when they usually require writing on a physically-moving surface and/or networking. On the other hand, Linux has very good caching mechanisms for local files.

Question:

Can MOSIX support migratable socket

Answer:

Yes, direct-communication provides an effective migratable socket between migrated processes.

Question:

How direct-communication can improve the performance of communicating processes

Answer:

Normally, MOSIX processes do all their I/O and (most) system-calls via their respective home-nodes. This can be slow because operations are limited by the network speed and latency.

Direct communication allows processes to exchange messages directly between migrated processes, bypassing their home-nodes.

Question:

Can I run 32-bit programs on MOSIX

Answer:

You can start 32-bit programs from MOSIX, but they will run as standard, non-migratable Linux programs instead.

Connecting with SLURM

Question:

Why use SLURM

Answer:

SLURM can provide queuing and batch services.

Question:

Can I use other queuing or batch packages

Answer:

Yes, but MOSIX does not provide an interface to other packges, so you will need to write your own.

Question:

How to use MOSIX with SLURM

Answer:

1. Use the -O flag of "srun" (for overcommit)

2. When allocating a new job ("srun", "salloc" or "sbatch"), use the flag "-Cmosix\*{MHN}", where "MHN" is the desired number of MOSIX home-nodes for the job.

3. Precede your commands with "mosrun [mosrun-parameters]". The mosrun-parameters should normally include "-b" and "-m{mem}".

4. Do not specify memory-requirements in "srun", because this does not work with overcommit.

Question:

Why do I need more than one MOSIX home node

Answer:

1. To reduce the load on home nodes.

2. Even with overcommit, SLURM does not allow more than 128 tasks per node.

Question:

Do I have to use the prologs and the epilog provided by MOSIX

Answer:

These scripts are provided for convenience and can be adjusted to your needs.

Question:

Can I use nodes that are not part of the SLURM cluster

Answer:

Yes, this can be configured. It is also possible to use nodes in other SLURM clusters.