|
Frequently Asked Questions:
- I. General Questions
- Basics
- How does one run DMTCP?
- What types of programs can DMTCP checkpoint?
- What is the license for DMTCP?
- Do I need to re-compile or re-link my applications to use DMTCP?
- Do I need to add kernel modules or patch my kernel?
- Use Cases
- How does one use DMTCP for a distributed computation (for example, on a cluster)?
- Can I migrate my applications (restart on another host, possibly with a different Linux); and what if I see a message
illegal instruction ?
- How can my software invoke a checkpoint or restart directly under control of my program?
- What is the exit code (return status) for a target application running under DMTP?
- Does DMTCP support third-party plugins (add-ons) to change the behavior of DMTCP at run-time?
- On restart, a process sees only the environment variables from the time of checkpoint (since DMTCP restores the previous memory). What do I do if I want to see the newest environment values at the time of restart?
- What if my target application runs as root, uses the setuid bit, uses capabilities, or uses other special privileges?
- What if my target application runs as root, uses the setuid bit, uses capabilities, or uses other special privileges?
- How can I read a summary of the contents of a DMTCP checkpoint image?
- How do I debug my application under GDB when I am using DMTCP?
- Architectures Supported by DMTCP
- What CPUs has DMTCP been ported to?
- How can I compile DMTCP for mixed x86/x86_64 architectures (multi-arch/multilib: mixed 32-/64-bit)?
- What operating systems has DMTCP been ported to?
- What types of networks does DMTCP support?
- What Linux distributions do you test on?
- Performance
- My application takes several seconds
(or longer) to checkpoint and restart. What can I do
to speed it up?
- Can I tell DMTCP that some of my memory
does not need to be checkpointed (specifying cutouts,
for a smaller checkpoint image)?
- DMTCP Internals
- How does DMTCP work?
- Is there documentation on the internals of DMTCP?
- Can I enable tracing of DMTCP internals?
- I suspect a bug concerning an interaction with an internal plugin of DMTCP. How can I debug this?
- What is the run-time overhead of DMTCP?
- What O/S features are supported by DMTCP?
- About the DMTCP Project
- Who should I write to for support?
- When was work on DMTCP begun, and who works on it now?
- How should I cite DMTCP in a publication?
- II. Support for Specific Types of Software Applications
If you don't see your question here, consider
searching within the archive of past dmtcp-forum messages,
or else asking your question directly (see "Contact Us"
on the left).
-
% dmtcp_launch ./a.out arg1 arg2 ...
% dmtcp_command --checkpoint [from another terminal window on same computer]
% dmtcp_restart ckpt_a.out_*.dmtcp
If using the above recipe, make sure to first
remove old ckpt images. DMTCP also writes out
./dmtcp_restart_script.sh, which handles various
bookkeeping and is safer to use.
NOTE: Numerous configure and run-time options are
also available.
-
It checkpoints most binary programs on most Linux
distributions. Some examples on which users have
verified that DMTCP works are: Matlab, R, Java, Python,
Perl, Ruby, PHP, Ocaml, GCL (GNU Common Lisp), emacs,
vi/cscope, Open MPI, MPICH-2, MVAPICH2, Intel® MPI,
OpenMP, and Cilk. Both TCP and InfiniBand connections are
supported. See
Supported Applications for further details. Our goal is
to support DMTCP for all vanilla programs. If DMTCP does not
work correctly on your program, then this is a bug
in DMTCP. We would be appreciative if you can then
file a bug report with DMTCP.
-
It is a free software distributed under the terms of the
Lesser GNU Public License (LGPL).
This license was chosen to be non-contagious.
If you distribute a modified version of DMTCP, you must
make available to your users your modifications to DMTCP.
But proprietary or other
software may freely use the DMTCP libraries and utilities with no
restrictions on the proprietary or other software.
-
No. DMTCP is completely transparent in this sense.
-
No. DMTCP requires no root permissions. Hence, other
software packages can easily include DMTCP as part of their
distribution (subject to
DMTCP's LGPL license).
(See also setuid and other
special privileges.)
-
A DMTCP computation consists of all processes
connected to a given coordinator. To have two simultaneous
DMTCP computations on the same host, you will need two DMTCP
coordinators listening to different port numbers. The command
dmtcp_coordinator generates a new coordinator.
By default, dmtcp_launch (dmtcp_checkpoint)
will first look for
an existing coordinator on the localhost on port 7779
and join that existing DMTCP computation. If no such
coordinator is found,
dmtcp_launch (dmtcp_checkpoint) will create
a new coordinator and then join it as the first process
of that computation. A checkpoint is initiated when the
coordinator tells all its connected processes to create
a checkpoint. Each client of the coordinator writes a
file, ckpt_*.dmtcp, on its local machine, and the
coordinator writes dmtcp_restart_script.sh in its
own local directory. The script can be used to restart all
processes using the ckpt_*.dmtcp files on the various
hosts. Since the coordinator initiates all checkpoints, it is the
coordinator that remembers the checkpoint interval. Both
dmtcp_command --interval and dmtcp_launch
--interval can be used to set the checkpoint interval
on the coordinator.
-
Yes, migration works. Look at dmtcp_restart_script.sh for some
options on migrating both single-process and distributed process
computations. Homogeneous host architectures are best, but
some heterogeneity is also tolerated. This works best
if the source and destination Linux distro are both recent.
However, test the migration first. Problems are most likely
to occur when migrating from a newer Linux/CPU to an older
Linux/CPU, since an older distro is often not future-proof.
Note that in migrating between arbitrary
computers, it is possible to encounter
"illegal instruction " on restart.
This can occur if the CPUs of the source machine and
destination machine are different. Most often, this occurs
when migrating from a newer CPU to an older CPU that does not
support the full range of extensions to the CPU instruction
set. It may sometimes occur in migrating between AMD and
Intel CPUs if the application or libraries were optimized
during compilation for one CPU only. In principle, one can
get around this with gcc -mtune=generic, but
you would also have to make sure that (i) DMTCP,
(ii) your target application, and (iii) all
libraries used by it (including libc.so)
are compiled with gcc -mtune=generic.
Finally, DMTCP generally supports
process migration when migrating from an older Linux kernel
to a newer Linux kernel. This works, because the checkpoint
image of the target application contains all of the original
run-time libraries, including libc.so. So, when an older
libc.so makes a call to a newer Linux kernel, the newer
Linux kernel will generally preserve backwards compatibility.
-
The most full-featured mechanism is through
DMTCP plugins.
See especially
application-initiated checkpointing
(using dmtcp_checkpoint()), and
application-delayed checkpointing
(using dmtcp_disable_ckpt()/dmtcp_enable_ckpt()).
-
Normally, commands like dmtcp_launch a.out
(dmtcp_checkpoint a.out) and
dmtcp_restart ckpt_a.out_*.dmtcp pass on the
the exit code that is returned by a.out itself. If
dmtcp_launch or dmtcp_restart is passed an invalid
command line (e.g., no such ckpt file), then they will exit
with exit code 99 (by default),
or the integer value of DMTCP_FAIL_RC if that environment
variable has been set.
-
Please look at the directory of the
modify-env plugin. In particular, look at the
README file and the dmtcp_env.txt example
from that directory.
You can invoke this plugin with:
dmtcp_launch --with-plugin
/absolute/path/to/libdmtcp-modify-env.so
(If you use LD_LIBRARY_PATH , you can also avoid
the need for absolute pathnames.)
-
Yes. Please look at the
directory of DMTCP plugin examples for a flexible mechanism for
third-party plugins (add-ons). This includes support for:
(i) wrappers around system calls; (ii) hooks for
particular events (e.g.: startup, ckpt, resume, restart);
and (iii) for populating and querying a nameservice
database across distributed processes. Application-initiated
checkpointing is also provided. No re-compilation
or relinking of the application software is necessary.
The DMTCP source also provides a tutorial,
doc/plugin-tutorial.pdf.
-
DMTCP works internally by preloading its own library into
the target application (see
How does DMTCP Work?). Linux will not honor the setuid
bit if a foreign library is being preloaded (for obvious reasons).
So, either the application must be run in a mode not requiring
special privileges, or DMTCP must be run in a privileged manner.
This "Stack Overflow" web page
describes two strategies for allowing DMTCP to initially
run with privileges. The constructor to use in the case of DMTCP
is
dmtcp::DmtcpWorker::DmtcpWorker()
(or dmtcp::DmtcpWorker::DmtcpWorker(bool) for
earlier than DMTCP-2.4)
in DMTCP_ROOT/src/dmtcpworker.cpp .
(If you're putting this in a gdbinit script, don't forget
to use: set break pending on .)
Group permissions, and security authorizations such as
polkit, ACL and PAM may offer other options.
-
Use readdmtcp.sh. (The example below assumes only one
dmtcp ckpt file is present.)
<DMTCP_DIR>/util/readdmtcp.sh ckpt_*.dmtcp
-
Detailed advice is available in the file
doc/debugging-dmtcp.txt (although this FAQ is often more current). General advice follows.
When using GDB with DMTCP, you may find it useful to load some
utilities (available since DMTCP-2.6.1 and DMTCP-3.0):
(gdb) source util/gdb-dmtcp-utils
(gdb) dmtcp # lists the new GDB commands
For compiling DMTCP with debugging support under GNU gcc,
we recommend:
./configure --enable-debug
("-Wall -g3 -O0" for gcc and g++)
make -j clean && make -j
(Omit the "-j " if you are on a less powerful computer.)
CATCHING DMTCP INTERNAL ERRORS (CREATE A CORE DUMP):
Prior to running dmtcp_launch, do:
ulimit -c unlimited && export DMTCP_ABORT_ON_FAILED_ASSERT=1
(and also, if using GDB, set a breakpoint at _exit)
DEBUGGING DURING INITIAL LAUNCH:
Run it as:
gdb --args dmtcp_launch ./a.out
(Older versions of DMTCP use dmtcp_checkpoint instead
of dmtcp_launch.)
Then dmtcp_launch will call execvp
on ./a.out. So, try the following:
(gdb) break execvp
(gdb) run
(gdb) # [stops at execvp]
(gdb) break main
(gdb) next
(gdb) next
(gdb) ...
Note that dmtcp_launch
calls execvp to execute the main routine of the application
(a.out in the above example).
If you want to see the actions of the dmtcphijack.so library
starting after the call to execvp and before the application's main
routine, then at the beginning of your GDB session, do:
(gdb) break execvp
(gdb) run
(gdb) # 'pending on' is required if using a gdbinit script
(gdb) set breakpoint pending on
(gdb) # For DMTCP-2.4 and later in DMTCP-2.x:
(gdb) break dmtcp_initialize
(gdb) # For DMTCP-3.0 and later:
(gdb) # break dmtcp_initialize_entry_point
(gdb) # OBSOLETE: For prior to DMTCP-2.4:
(gdb) # OBSOLETE: break 'dmtcp::DmtcpWorker::DmtcpWorker(bool)'
(gdb) continue # Might need to repeat 'continue' a few times.
We have found the GDB command info proc mappings useful
for deciding if an address belongs to libc.so, dmtcphijack.so,
your target application, or other.
NOTE: If you want to trace the internals of DMTCP
(in addition to using GDB as above), see
tracing DMTCP internals.
DEBUGGING DURING CHECKPOINT:
Begin your DMTCP session under GDB (gdb --args dmtcp_launch ...),
and just run (without any checkpoints).
At the time of checkpoint, the checkpoint thread (typically thread 2 in GDB)
will send a SIGUSR2 to each user thread. By default, GDB will intercept that
signal, announce it to the user, and wait until the user executes:
(gdb) signal SIGUSR2
At this time, you can gain control with GDB. Tell GDB to switch to
thread 2, and you
can then examine the stack and set a breakpoint, before issuing the
"signal SIGUSR2" command.
DEBUGGING DURING RESTART:
To capture your process under GDB during
dmtcp_restart, you need a more roundabout strategy.
This is because
dmtcp_restart calls mmap to reload the
memory of your process. So, the best way is to use
gdb attach or
gdb ./a.out `pgrep -n a.out`
after your process has restarted.
In order to assist in using gdb to
attach, your restarted process can be forced
to pause deep within DMTCP just as it restarts,
by setting the environment variable DMTCP_RESTART_PAUSE2.
(Set DMTCP_RESTART_PAUSE2 before the original
dmtcp_launch, since
the restarted process will remember only the original environment
variables prior to checkpoint. (Prior to DMTCP-2.4, the
variable had the name MTCP_RESTART_PAUSE.)
The environment variable
DMTCP_RESTART_PAUSE is available to capture the
restart even earlier --- primarily of interest for DMTCP
developers.
On restart, dmtcp_restart will then pause
with a message to attach using a gdb command. In earlier versions
of DMTCP, the gdb attach command may fail with:
Operation not permitted. In those cases, you may execute:
echo 0 > /proc/sys/kernel/yama/ptrace_scope
(requires root privilege, and may pose a security risk).
Beginning with DMTCP-2.4, one can also
set the environment variable DMTCP_GDB_ATTACH_ON_RESTART
prior to executing dmtcp_restart. This also allows one
to use "gdb attach" on the restarted process for debugging.
Note that after DMTCP 2.0,
the MTCP directory was
merged into DMTCP itself, and it is no longer possible to run
MTCP as a standalone application. If you need that functionality,
please consider using dmtcp_launch --no-coordinator
with the latest DMTCP release. Having said that, if you are
using an older version of DMTCP, The information above
is valid.
DEBUGGING PLUGINS:
Some bugs may be produced by interactions among plugins. In such
cases, consider temporarily disabling plugins, and see if the
bug goes away. (This is similar to the standard advice often
given for web browsers.) In some exceptional cases, there can
also be a bug in the interaction with internal plugins of DMTCP.
See "debugging internal DMTCP plugins"
for more information concerning this issue.
In debugging during restart using 'gdb attach', we have
reports in early 2020 saying that under Ubuntu, we are not
seeing the symbol tables when we attach. This problem is
seen only on Ubuntu (e.g., Ubuntu 18.04). A stack is
seen with the addresses, but without the function names.
We see this on Ubuntu, but not on CentOS. We are
guessing that GDB under Ubuntu is having trouble "walking
the stack" to find the symbol table. As a workaround,
after restart, please use inside GDB:
(gdb) source DMTCP_ROOT/util/gdb-add-symbol-files-all
(or the older bash shell script from the command line,
DMTCP_ROOT/util/gdb-add-symbol-file
, if not using the latest DMTCP).
The shell script
DMTCP_ROOT/util/save-symbol-files-to-gdb-script.py
may also be useful in this setting.
-
DMTCP supported x86 and x86_64 since the beginning. It has
been ported to the 32-bit ARM CPU (armv7/armv7a), using the newer
EABI API for Linux on ARM. An experimental port
to 64-bit ARM (armv8) has been added as of DMTCP-2.4.0.
For porting to other CPUs, please see
src/mtcp/PORTING
(notes on how to port DMTCP).
DMTCP has also been verified to work on the Intel Xeon Phi
(back end, only, at this time) when built with the Intel
icc compiler. We used
./configure --host=mic CC=icc CXX=icpc CFLAGS=-mmic CXXFLAGS=-mmic LDFLAGS=-mmic
to build on the Intel MIC.
Optionally, one can also add "-static-intel" to
CFLAGS, CXXFLAGS and LDFLAGS.
-
See
doc/multi-arch.txt for details. In short,
./configure --enable-m32 && make clean &&
make -j && make install
./configure && make clean &&
make -j && make install
Use ./configure --prefix=$PWD/build if you wish to
build a local copy in $PWD/build, instead of installing globally.
Note that --enable-m32 will not create the
necessary DMTCP commands on a 64-bit Linux. So, on
a 64-bit Linux, you will still need the 64-bit install,
even if you intend only to run with 32-bit targets.
On recent versions of DMTCP (late 2019),
a --enable-multilib option has been
made available to automate this.
-
DMTCP operates only under Linux as of this writing. Because its
design stays close to the POSIX API, it can be ported to other
operating systems, given sufficient demand. If someone is
interested in doing the work, please write to us, and we will
share our ideas on how to do that port.
See also "Implementing Checkpointing for Android"
for work toward porting DMTCP to Android.
-
DMTCP works directly with both TCP and InfiniBand.
It would also be relatively easy to port it to IP, but we
haven't seen a demand for this. When using InfiniBand,
launch with: dmtcp_launch --infiniband ...
-
There are also "dmtcp" packages in
Debian (since version 7.0, "wheezy"),
Ubuntu (since version 11.10),
Fedora (since version 17), and
Red Hat (via Fedora EPEL).
In-house, we commonly use DMTCP under
Ubuntu, CentOS, and openSUSE.
On an irregular basis, we also test on other distributions.
-
Small applications should checkpoint and restart in
a second. There are several ways to speed up
checkpoint/restart for larger applications:
- The disk is usually the slowest part of
checkpoint/restart. Consider using a RamDisk.
For example:
sudo mount -t ramfs -o size=200m ramfs /path/to/dmtcp_ckpt_dir
export DMTCP_CHECKPOINT_DIR=/path/to/dmtcp_ckpt_dir
rm -f /path/to/dmtcp_ckpt_dir/ckpt_*.dmtcp
dmtcp_launch YOUR_APP
dmtcp_restart /path/to/dmtcp_ckpt_dir/ckpt_*.dmtcp
(Warning: ramfs can continue to grow with the total
size of files in the dmtcp_ckpt_dir, eventually freezing
your system if you write too much. The related
tmpfs does not suffer from this, but it uses the swapfile
on disk, which can slow it down for large writes.)
- Restart will be faster with the environment
variable DMTCP_TMPDIR (on ckpt and restart) pointing
into the RamDisk created above.
(Set DMTCP_TMPDIR before the initial launch
of your application, since the environment
variable is saved with your checkpoint image.
This policy may be changed in the future.)
- By default, DMTCP uses gzip for dynamic compression of
checkpoint images. Consider using
dmtcp_launch --no-gzip .
Alternatively, set an environment variable:
export DMTCP_GZIP="0" .
On restart, DMTCP will auto-detect whether gzip was used.
- Gzip was chosen because it is available almost
universally. Some examples of newer, faster compression
packages are:
Snappy,
LZO,
FastLZ, and
QuickLZ.
Currently, you will have to modify the DMTCP source code
to use these. With enough demand, we will make it
easy for the end user to select a different
compression package.
- Two configure options for improving checkpoint and
restart speeds are offered. Please test that these
optimizations are compatible with your application.
- ./configure --enable-forked-checkpointing :
Use fork-based copy-on-write to have a child checkpoint
while the parent continues to execute
- ./configure --enable-fast-restart :
mmap-based on-demand paging from the checkpoint image
-
If you write all zeroes to the memory region that does not need
to be checkpointed, then DMTCP will convert those pages into
zero-fill-on-demand pages in the checkpoint image. This creates
a smaller checkpoint image, and results in a faster restart.
-
A DMTCP coordinator process is created on a host (default:
localhost). As new processes are created (via fork or ssh),
the LD_PRELOAD environment variable (supported by the Linux
loader) is used to preload the DMTCP library (dmtcphijack.so).
That library runs before the routine main(). It creates a
second thread (DMTCP checkpoint thread). The checkpoint
thread then creates a socket to the DMTCP coordinator
and registers itself. The checkpoint thread also creates
a signal handler (SIGUSR2 by default). Control is then
returned to the original user thread, which executes its
standard startup routines. The DMTCP coordinator can request
a checkpoint by sending a message through the socket to the
checkpoint thread. The checkpoint thread then sends the
checkpoint signal (SIGUSR2) to each of the user threads.
(Note that the checkpoint signal is for internal DMTCP
use only. It should not be used by non-DMTCP programs.)
Note that it is shown how to see
a summary of the contents of a
DMTCP checkpoint image in a previous question.
-
Some sources of information follow. If you wish to cite DMTCP,
please cite the
paper by Ansel et al.
- The best high-level overview of the design of DMTCP
is still in the paper
DMTCP: Transparent Checkpointing for Cluster Computations
and the Desktop, by Ansel et al. (2009).
- For the design
of just the single-process checkpointing layer (MTCP), see
Transparent User-Level Checkpointing for the
Native POSIX Thread Library for Linux, by Rieker
et al. (2006).
- For a recent overview of the DMTCP architecture,
see doc/architecture-of-dmtcp.pdf
(available with the source distribution).
- For low-level documentation of the implementation of DMTCP,
see the
doc subdirectory of the source distribution.
-
Yes. You will have to re-configure and re-make:
./configure --enable-logging && make -j5 clean && make -j5
After this, you will see lots of output sent to stderr on screen.
In addition, there will be files in
$DMTCP_TMPDIR/dmtcp-$USER@$HOST
(where $DMTCP_TMPDIR is $TMPDIR or /tmp).
Look at the files jassert_*.log in the given directory.
To add low-level debugging (MTCP for single processes), change
mtcp/Makefile to uncomment:
CFLAGS += -O0 -g -DDEBUG -DTIMING -Wall
NOTE: If you also want to debug under GDB, see
debugging under GDB.
-
Normally this should not happen. However, a complex new user
plugin might uncover a bug in DMTCP itself. If so, the first
thing to look at is a bug in the interaction with an internal
DMTCP plugin. Just as browser plugins have a "safe mode",
DMTCP can be loaded without some of its internal plugins.
If one is only testing the initial launch (no checkpoint or
restart, one can safely disable all plugins for "save mode".
However, if testing checkpoint or restart, then some (but not
all) of the plugins will generally be required for correct
operation.
All plugins can be disabled by launching an application with
dmtcp_launch --disable-all-plugins .
If one wishes to disable only some of the plugins,
then one must modify the source code. The file
src/dmtcp_launch.cpp has global variables of the
form enableIPCPlugin=true, etc. Try setting some
of the internal plugins to false, re-compiling, and testing
if the bug goes away. If an interaction with an internal
plugin is uncovered, try commenting out some of the wrapper
functions in that plugin. Note that it is generally safe to
disable the internal plugins when testing only DMTCP launch
and resume after writing a checkpoint. However, in testing
DMTCP restart (from a checkpoint image file), some of of the
DMTCP internal plugins may be required for correct operation.
-
The run-time overhead of DMTCP is usually negligible. When there is
no checkpoint or restart in process, DMTCP code will run only within
DMTCP wrappers around certain less frequently used system calls.
Examples of such wrappers are wrappers for open(), getpid(), socketpair(),
etc. We explicitly do not place a wrapper around read()
or write(), since those are frequently called system calls that could
produce measurable run-time overhead.
-
Among the Linux features supported by DMTCP are open file
descriptors, pipes, sockets, signal handlers, process id and
thread id virtualization (ensure old pids and tids continue
to work upon restart), fifos, process group ids, session ids,
ptys, terminal attributes, System V shared memory (shmget,
etc.), timers, epoll, eventfd, signalfd, and
mmap/mprotect (including mmap-based shared memory).
Such common O/S daemons as NSCD and LDAP are also
transparently supported.
-
Please see the
"Contact Us" links for writing to us. That link includes
pointers to the public DMTCP forum, and a private e-mail for
private comments. Other possibilities are to add a bug report to
the bug tracker, or to write to an individual
administrator.
We are also always interested in finding others interested
in helping develop DMTCP as an open source project.
-
The origins of DMTCP lie in
a project begun in Fall, 2004, and reported on at the CCGrid-06
conference (Transparent Adaptive Library-Based Checkpointing for
Master-Worker Style Parallelism). Since then, The work has added
more ambitious goals. The list of active developers continues to
change over time. As of this writing, the sourceforge site
lists ten developers/administrators. We are always happy
to include new developers.
-
Please see the Publications page
for the standard citation at the top of that page.
-
Yes. See this page for a
general discussion. In addition, note the availability
of DMTCP plugins (add-ons)
that allow end users to easily extend the features of DMTCP.
-
Yes. For programs that spawn processes on remote
hosts, DMTCP currently assumes that this is done
internally using ssh to a "known host", and so no
password will be required. This is the case, for
example, for most TCP/IP-based MPI programs. To restart on
different hosts, edit the 'ssh' lines
in dmtcp_restart_script.sh .
See below
for further discussion on MPI. See QUICK-START.md
for additional details.
-
Yes. Starting with release 1.2.4, DMTCP supports Java.
If you are using Oracle Java (formerly Sun Java), you may
wish to use the flag java -Xmx512M to limit memory size
for a faster checkpoint. This is not needed for OpenJDK Java.
-
Yes. Please see below.
-
Yes. It passes our internal tests, but we would appreciate
having a more intensive user of R provide feedback to us.
-
Yes. See the section below on MPI.
-
Yes. DMTCP had partial support for OpenMP in the past.
Starting with release 1.2.4, DMTCP fully supports OpenMP and Cilk.
-
As of DMTCP version 2.4, DMTCP supports SLURM and Torque.
See below
for additional details.
-
Yes. See the web page on
Condor integration.
-
Yes. Starting with release 1.2.0, DMTCP supports GNU screen.
-
This is still considered experimental, but it should be reasonably
stable in DMTCP version 1.2.4 and the upcoming
DMTCP version 1.3.0.
Try: configure --enable-ptrace-support to use
this feature.
-
Partially. DMTCP supports X-Windows programs
with the help of VNC. Recall that vncserver will create
a virtual desktop in which you can run your graphics application,
and vncviewer invokes an Xserver to display the graphics.
So, the recipe is:
dmtcp_launch vncserver :1
vncviewer localhost:1
my-X-windows-app
To checkpoint, kill the vncviewer and then do:
dmtcp_command --checkpoint
vncviewer localhost:1 # to re-connect the graphics
Later, to restart from a checkpoint, do:
./dmtcp_restart_script.sh
vncviewer localhost:1 # to re-connect the graphics
Note that VNC does not support such X-Windows extensions
as 3D graphics (3D visualization such as in many scientific
graphics simulations) nor video nor sound. We are continuing
to look at improving graphics support.
-
Yes. But Matlab is a moving target, as it continues to use
newer features of Linux. It is best
to use a recent DMTCP.
DMTCP-2.4 added support for matlab-2013 and later.
DMTCP-2.3.x supported Matlab through matlab-2012.
Much earlier, DMTCP-1.2.5 had added
an enhancement for the case when Matlab is in the middle
of talking to its license server at checkpoint time.
-
Yes. And to restart on different hosts, edit the 'ssh' lines
in dmtcp_restart_script.sh .
DMTCP operates by checkpointing the sockets (or InfiniBand
connections, if using --infiniband) created by the
MPI library. Hence, it is transparent to MPI and doesn't require
any particular MPI configuration or hooks. In principle, DMTCP
should run on any MPI over TCP/IP. We usually test on
Open MPI,
MVAPICH-2 and
MPICH-2. If you
find an MPI that we don't support, this is a bug in DMTCP.
We would be appreciative if you can file a bug report.
For further details on using MPI, see
QUICK-START.md.
-
DMTCP supports the OFED implementation of InfiniBand
(librdmacm and libibverbs API).
These libraries support InfiniBand and iWARP. At this time,
we have not seen sufficient demand to add support for other
APIs like uDAPL, Mellanox verbs, or Myrinet API.
-
As of DMTCP version 2.4, DMTCP supports SLURM and Torque.
The integration of SLURM with MPI has been
especially intensively tested.
Support for both is implemented
as a plugin, and can be found in the
plugin/batch-queue directory, which includes example
submission scripts for SLURM/DMTCP and Torque/DMTCP. This is
provided by Artem Polyakov, who can be reached as artpol84
through gmail. Ports to other batch queues can be added
relatively easily using this model.
Click
|