|
Citing DMTCP (please cite this publication):
-
DMTCP Publications (enhancements to the DMTCP distribution)
-
Publications and Conference Presentations
using DMTCP in their work
A. DMTCP Publications (an
annotated history of
enhancements to the
DMTCP distribution; reverse chronological order):
Enabling Practical Transparent Checkpointing for {MPI}: A Topological Sort Approach,
(and earlier technical report of same name at
https://arxiv.org/abs/2408.02218)
Yao Xu and Gene Cooperman,
Proc. of 2024 IEEE International Conference on Cluster Computing
(Cluster'24), Sept., 2024, IEEE Computer Society,
Bibtex.
(The MANA paper introduced "split processes"
to transparently checkpointing independently of the underlying
network software. But that algorithm interposed an MPI barrier
in front of each collective communication, in order to ensure
a safe syncronization point (and hence a consistent snapshot
for checkpointing). This incurred high runtime overhead for
collective communication. A new sequence number algorithm is shown
with zero additional inter-process communication, and so has almost
no runtime overhead.)
CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM,
(and earlier technical report of same name at
https://arxiv.org/abs/2008.10596)
Twinkle Jain and Gene Cooperman,
Proc. of the Int. Conf.
for High Performance Computing, Networking, Storage and Analysis
(SC'20), pp. 1083--1097, Nov., 2020, IEEE Computer Society,
Bibtex.
(Extending "split-process" approach to
efficient checkpointing for CUDA: This typically involves less than
1% runtime overhead, unlike the earlier approach of CRUM,
which required a separate proxy process.)
MANA for MPI: MPI-Agnostic Network-Agnostic Transparent
Checkpointing,
Rohan Garg, Gregory Price, and Gene Cooperman,
Proc. of 28th Int. Symp. on High Performance Parallel and Distributed Computing (HPDC'19),
Phoenix, AZ, USA, ACM, pp. 49--60, June, 2019,
(and tech. report at
https://arxiv.org/abs/1904.12595),
• Slides (pdf),
Slides (odp),
Slides (pptx).
Bibtex.
(Novel "split-process" approach to
isolating and checkpointing just the MPI application at the
original cluster:
On restart (on the same cluster or on a different one),
the underlying network, the ranks-per-node, and even the MPI
library (e.g., MPICH or Open MPI) can each be changed.)
CRUM: Checkpoint-Restart Support for CUDA's Unified Memory,
Rohan Garg, Apoorve Mohan, Michael Sullivan, and Gene Cooperman,
Proc. of IEEE Int. Conf. on Cluster Computing (Cluster'18),
Belfast, United Kingdon, IEEE, pp. 302--313, Sept., 2018,
(and tech. report at
https://arxiv.org/abs/1808.00117),
Bibtex.
(Transparent checkpointing of CUDA, including UVM (Unified
Virtual Memory), using a proxy approach)
System-level Scalable Checkpoint-Restart for Petascale Computing.
Jiajun Cao, Kapil Arya, Rohan Garg, Shawn Matott, Dhabaleswar K. Panda,
Hari Subramoni,
Jérôme Vienne and Gene Cooperman,
Proc. of 22nd IEEE Int. Conf. on Parallel and Distributed Systems
(ICPADS'16),
Wuhan, China,
IEEE Press, pp. 932--941, Dec., 2016,
Bibtex.
(Scalable transparent checkpointing: MPI-based
HPCG using 32,752 CPU cores,
and MPI-based NAMD using 16,368 CPU cores)
Design and Implementation for Checkpointing
of Distributed Resources using Process-level Virtualization.
Kapil Arya, Rohan Garg, Artem Y. Polyakov and Gene Cooperman,
Proc. of IEEE Int. Conf. on Cluster Computing (Cluster'16),
pp. 402--412,
Taipei, Taiwan,
IEEE Press, Sept., 2016.
Bibtex.
(Extensible, adaptable, transparent checkpointing
via a plugin implementation of process virtualization)
Transparent Checkpoint-Restart over InfiniBand.
Jiajun Cao, Gregory Kerr, Kapil Arya and Gene Cooperman,
Proc. of ACM Symposium on High Performance Parallel and
Distributed Computing (HPDC'14),
12 pages, Vancouver, Canada, June, 2014.
Bibtex.
(Transparent checkpointing of MPI over
InfiniBand; no need to tear down network, and reconstruct
during restart)
Checkpoint-Restart for a Network of Virtual Machines.
Rohan Garg, Komal Sodha, Zhengping Jin and Gene Cooperman,
Proc. of 2013 IEEE Computer Society International Conference on
Cluster Computing.
8 pages, Indianapolis, USA. Sept., 2013.
Slides.
Bibtex.
(Transparent checkpointing of network of Qemu's
over KVM)
DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop.
Jason Ansel, Kapil Arya and Gene Cooperman.
Proc. of 23rd IEEE International Parallel and
Distributed Processing Symposium (IPDPS'09).
12 pages, Rome, Italy. May, 2009.
Slides.
Bibtex.
(Transparent checkpointing of
distributed processes)
Transparent Adaptive Library-Based Checkpointing for
Master-Worker Style Parallelism,
Gene Cooperman, Jason Ansel and Xiaoqin Ma,
Proc. of 6th IEEE Int. Symp. on
Cluster Computing and the Grid (CCGrid06),
pp. 283--291,
IEEE Press, 2006 Bibtex
(Master-worker style checkpointing experiment
in distributed checkpointing, prior to DMTCP itself)
Transparent User-Level Checkpointing for the Native POSIX Thread Library for Linux.
Michael Rieker, Jason Ansel, and Gene Cooperman.
Proc. of 2006 International Conference on Parallel and
Distributed Processing Techniques and Applications (PDPTA'06).
pp. 492--498, Las Vegas, NV. Jun., CSREA Press, 2006.
Bibtex.
(The original MTCP: Transparent checkpointing
for multi-threaded checkpointing for a single process)
Live Migration of Multi-Container Kubernetes Pods in Multi-Cluster Serverless Edge Systems,
Poggiani, Leonardo and Puliafito, Carlo and Virdis, Antonio and Mingozzi, Enzo,
Proceedings of the 1st Workshop on Serverless at the Edge (SEATED'24)
pp. 9--16, Sept., 2024,
Bibtex
AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems,
Jia, Jie and Liu, Yi and Liu, Yanke and Chen, Yifan and Lin, Fang,
European Conference on Parallel Processing (Euro-Par'24)
Lecture Notes in Computer Science 14803,
pp. 342--355, Aug., 2024, Springer Nature,
Bibtex
Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC,
Timalsina, Madan and Gerhardt, Lisa and Tyler, Nicholas and Blaschke, Johannes P and Arndt, William,
arXiv preprint arXiv:2407.19117,
9 pages, Jul., 2024,
Bibtex
Limitless FaaS: Overcoming serverless Functions Execution
Time Limits with Invoke Driven Architecture and Memory
Checkpoints,
Rodrigo Landa Andraca and Mahdi Zaree,
10th International Conference on Web Research (ICWR'24),
pp. 210--215, May, 2024,
Bibtex
Application-Transparent Strategies to Optimize Limited Resources in HPC and Big Data,
Twinkle Jain, PhD thesis,
Northeastern University, May, 2024,
Bibtex
DCU-CHK: Checkpointing for Large-scale CPU-DCU Heterogeneous Computing Systems,
Jia, Jie and Lin, Xinyuan and Lin, Fang and Liu, Yi,
CCF Transactions on High Performance Computing
pp. 171--187, Jan., 2024, Springer,
Bibtex
Digital Twin Migration using the {OKD} Platform: A Use-Case for Emergency Vehicles,
Ribeiro, Bruno and Gonçalves, Pedro and Bartolomeu, Paulo C,
33rd International Telecommunication Networks and Applications Conference (ATNAC'24),
pp. 63--69, Dec.., 2023, IEEE,
Bibtex.
HPC@Cloud: a Provider-Agnostic Toolkit to Enable the Execution
of {HPC} Applications on Public Clouds,
Pereira Filho, Vanderlei Munhoz,
Masters Thesis, Universidade Federal de Santa Catarina, Nov. 4, 2023,
Bibtex.
Implementation-Oblivious Transparent Checkpoint-Restart for MPI,
Xu, Yao and Belyaev, Leonid and Jain, Twinkle and Schafer,
Derek and Skjellum, Anthony and Cooperman, Gene,
Proceedings of the SC'23 Workshops of The International
Conference on High Performance Computing, Network, Storage,
and Analysis (SC-W'23; SuperCheck'23),
pp. 1738--1747, Nov., 2023,
Bibtex
Migration Transparente de Conteneurs,
Hamdi Abdelaziz, Daniele Miorandi, and Guillaume Pierre,
Masters Thesis,
École nationale Supérieurede lInformatique; ex. INI (Institut National de formation en Informatique, Alger, Alg&eactue;rie),
Sept., 2023,
Bibtex
Gestión del Almacenamiento para Tolerancia a Fallos
en Computación de Altas Prestaciones,
Betzabeth León, PhD thesis,
Universitat Autónoma de Barcelona, Mar., 2023,
Bibtex
Debugging MPI Implementations via Reduction-to-Primitives,
Cooperman, Gene and Li, Dahong and Zhao, Zhengji,
IEEE/ACM Third International Symposium on Checkpointing for Supercomputing (SuperCheck'22,
pp. 10--18, Nov., 2022,
Bibtex
Good Shepherds Care For Their Cattle: Seamless Pod Migration
in Geo-Distributed Kubernetes,
Paulo Souza Junior, Daniele Miorandi, and Guillaume Pierre,
6th IEEE Int. Conf. on Fog and Edge Computing (ICFEC'22),
9~pages, May, 2022,
Bibtex
A Model of Checkpoint Behavior for Applications that have I/O,
León, Betzabeth and M{éndez, Sandra and Franco,
Daniel and Rexachs, Dolores and Luque, Emilio,
The Journal of Supercomputing 78,
pp. 15404–15436, Apr., 2022, Springer,
Bibtex
Bibtex
Provisioning Strategies for Centralized Bare-Metal Clusters,
Apoorve Mohan, PhD thesis,
Northeastern University, Dec., 2021,
Bibtex
MANA-2.0: A Future-Proof Design for Transparent Checkpointing of MPI
at Scale,
Yao Xu, Zhengji Zhao, Rohan Garg, Harsh Khetawat, Rebecca Hartman-Baker,
Gene Cooperman,
Int. Symp. on Checkpointing for Supercomputing (SuperCheck'SC-21) ,
(SC Workshops Supplementaray Proceedings (SCWS))
pp. 68--78, Nov., 2021,
Bibtex
Assessing the Use Cases of Persistent Memory in High-Performance
Scientific Computing,
Yehonatan Fridman, Yaniv Snir, Matan Rusanovsky, Kfir Zvi,
Harel Levin, Danny Hendler, Hagit Attiya, and Gal Oren,
2021 IEEE/ACM 11th Workshop on Fault Tolerance for HPC
at eXtreme Scale (FTXS); workshop at SC21,
pp. 11--20, IEEE, Oct., 2021
Bibtex
Service Migration in a Distributed Virtualization System,
Pablo Andrés Pessolani, Luis Santiago Re,
Tomás Andrés Fleitas,
Journal of Computer Science \& Technology 21(2),
pp. 171--187, Oct., 2021, Springer,
Bibtex
Performance and Energy Task Migration Model for Heterogeneous Clusters,
Esteban Stafford and José Luis Bosque,
The Journal of Supercomputing 77(9),
pp. 10053--10064, 2021, Springer,
Bibtex
MigrOS: Transparent Live-Migration Support for
Containerised RDMA Applications,
Maksym Planeta, Jan Bierbaum, Leo Sahaya Daphne Antony, Torsten Hoefler,
and Hermann Härtig,
2021 USENIX Annual Technical Conference (USENIX ATC 21),
pp. 47--63, July, 2021,
(also as an
arXiv technical report
from Oct., 2020)
Bibtex
Verifiable Application-level Checkpoint and Restart Framework
for Parallel Computing,
I. Gankevich and I. Petriakov and A. Gavrikov and D. Tereshchenko
and G. Mozhaiskii,
Proc. of 9th Int. Conf. on Distributed Computing and Grid Technologies
in Science and Education (GRID'2021),
pp. 47--63, July, 2021,
Bibtex
An Innovative Approach for Cloud-Based Web Dev App Migration,
C Sunil, KS Raghunandan, KN Ranjit, HK Chethan, and G Hemantha Kumar,
ICT with Intelligent Applications,
pp. 807--817, December, 2021, Springer
Bibtex
Determination of Suitable Resource Discovery Tool and Methodology
for High-Volume Internet of Things (IoT),
Mohd Tamizan Abu Bakar and Azrul Amri Jamal,
Journal of Physics: Conference Series:
1st Int. Recent Trends in Engineering, Advanced Computing and
Technology Conference (RETREAT),
Volume 1874, Number 1, pp. 012046 (11 pages),
June, 2021, IOP Publishing,
Bibtex
ROS Rescue: Fault Tolerance System for Robot Operating System,
Pushyami Kaveti and Hanumant Singh,
Pushyami Kaveti and Hanumant Singh,
Robot Operating System (ROS),
pp. 381--397, 2021, Springer,
(and an arXiv technical report
from Oct., 2019)
Bibtex
Analysis of Parallel Application Checkpoint Storage
for System Configuration,
Betzabeth León, Daniel Franco, Dolores Rexachs, and Emilio Luque,
The Journal of Supercomputing 77(5),
pp. 4582--4617, May, 2021, Springer,
Bibtex
Transparent Checkpointing for OpenGL Applications on GPUs,
David Hou and Jun Gan and Yue Li and Younes El Idrissi Yazami and
Twinkle Jain,
First Int. Symp. on Checkpointing
for Supercomputing (SuperCheck21),
3 pages, Feb., 2021,
(with conf. program,
slides, and
video)
Bibtex
Checkpointing SPAdes for Metagenome Assembly:
Transparency versus Performance in Production,
Twinkle Jain and Jie Wang,
First Int. Symp. on Checkpointing
for Supercomputing (SuperCheck21),
4 pages, Feb., 2021,
(with conf. program,
slides, and
video)
Bibtex
Optimized Memoryless Fair-Share HPC Resources Scheduling
Using Transparent Checkpoint-Restart Preemption,
Kfir Zvi and Gal Oren,
First Int. Symp. on Checkpointing
for Supercomputing (SuperCheck21),
4 pages, Feb., 2021,
(with conf. program,
slides, and
video)
Bibtex
Improving Scalability and Reliability of MPI-agnostic Transparent
Checkpointing for Production Workloads at NERSC,
Prashant Singh Chouhan, Harsh Khetawat, Neil Resnik, Jain Twinkle,
Rohan Garg, Gene Cooperman, Rebecca Hartman-Baker, Zhengji Zhao,
First Int. Symp. on Checkpointing
for Supercomputing (SuperCheck21),
4 pages, Feb., 2021,
(with conf. program,
slides, and
video)
Bibtex
So Why Can't I Checkpoint That? (keynote talk),
Gene Cooperman,
First Int. Symp. on Checkpointing
for Supercomputing (SuperCheck21),
Feb., 2021,
(with conf. program,
slides, and
video)
Bibtex
Fault-Tolerant Computing with Heterogeneous Hardening Modes,
Florian Kriebel, Faiq Khalid, Bharath Srinivas
Prabakaran, Semeen Rehman, and Muhammad Shafique,
Dependable Embedded Systems
(Embedded Systems series, open access),
eds. Jörg Henkelandand Nikil Dutt,
pp. 161--180, Jan., 2021, Springer
Bibtex.
Direct Heap Snapshotting in the Java HotSpot VM: a Prototype,
Ludvig Janiuk,
M.S. thesis,
KTH Royal Institute of Technology,
Dec., 2020,
Bibtex.
Soft Errors Detection and Automatic Recovery Based on Replication Combined with Different Levels of Checkpointing,
Diego Montezanti, Enzo Rucci, Armando De Giusti,
Marcelo Naiouf, Dolores Rexachs, and Emilio Luque,
Future Generation Computer Systems 113,
pp. 240--254, Dec, 2020, Elsevier,
Bibtex.
Déploiement Efficace d'Applications Cloud
dans les Infrastructures Fog Distribuées,
Arif Ahmed, PhD thesis,
U. Rennes 1, Dec., 2020,
Bibtex
Deploying Checkpoint/Restart for Production Workloads at NERSC,
Zhengji Zhao, Rebecca Hartman-Baker, Gene Cooperman,
Proc. of the Int. Conf. for High
Performance Computing, Networking, Storage and Analysis
(State of the Practice)
(SC'20),
3 pages, Nov., 2020, IEEE Computer Society,
Bibtex
Containers Runtimes War: A Comparative Study,
Ramzi Debab and Walid Khaled Hidouci,
Proc. of Future Technologies Conference (FTC'10),
Volume 2,
pp. 135--161, Nov., 2020,
(Advances in Intelligent Systems and Computing
book series, volume 1289, Springer)
Bibtex.
Profiles of Upcoming HPC Applications and Their Impact
on Reservation Strategies,
Ana Gainaru, Brice Goglin, Valentin Honoré,
and Guillaume Pallez,
IEEE Trans. on Parallel and
Distributed Systems 32(5),
pp. 1178--1190, Nov, 2020, IEEE Press,
Bibtex.
Improving Utilization of Heterogeneous Clusters,
Stafford, Esteban and Bosque, José Luis,
The Journal of Supercomputing 76,
pp. 8787--8800, Jan., 2020, Springer
Bibtex.
Privaros: A Framework for Privacy-Compliant Delivery Drones,
Rakesh Rajan Beck, Abhishek Vijeev, and Vinod Ganapathy,
Proc. of ACM SIGSAC Conference on Computer and Communications
Security (CCS'20),
pp. 181–194 Oct., 2020, ACM Press
(and arXiv preprrint
arXiv:2002.06512v3)
Bibtex.
System-Level vs. Application-Level Checkpointing,
Jonas Posner,
Proc. of IEEE Int. Conf. on Cluster Computing (CLUSTER'20),
pp. 404--405, Sept., 2020, IEEE Computer Society,
Bibtex.
Standard Compliant Snapshotting for SystemC Virtual Platforms,
Bastian Farkas, PhD thesis,
Technische Universität Braunschweig,
Sept., 2020,
Bibtex.
Implementation of Resilience as a Service for Parallel Computing,
Revina Awalia Putri, Idris Winamo, Wiratmoko Yuwono, and
Agus Priyo Utomo,
Proc. of International Electronics Symposium (IES'20),
pp. 626--630, Sept, 2020,
Bibtex.
Docker Container Deployment in Distributed Infrastructures with
Checkpoint/Restart Fog,
Arif Ahmed, Apoorve Mohan, Gene Cooperman, and Guillaume Pierre,
Proc. of 8th IEEE Int. Conf. on Mobile Cloud Computing,
Services, and Engineering (MobileCloud'20),
pp. 55--62, Aug., 2020, IEEE Press
(and
HAL Technical Report)
Bibtex.
Analysis of Checkpoint I/O Behavior,
Betzabeth León, Pilar Gomez-Sanchez, Daniel Franco,
Dolores Rexachs, and Emilio Luque,
International Conference on Computational Science (ICCS'20),
Lecture Notes in Computer Science 12137,
pp. 191--205, Jun., 2020, Springer
(and
ICCS Camera Ready Version)
Bibtex.
Determinación de la Eficiencia en el Procesamiento sore
Arquitecturas Multiprocesador y Estrategias de Tolerancia
a Fallos en HPC,
Jorge Rafael Osio, Diego Miguel Montezanti, Marcelo Angel Cappelletti,
Eduardo Kunysz, and Martín Morales,
XXII Workshop de Investigadores en Ciencias
de la Computación (WICC'20),
El Calafate, Santa Cruz (Argentina), May, 2020,
Bibtex.
Resumen de Tesis:
SEDAR: Detecciónn y Recuperación Automática
de Fallos Transitorios en Sistemas de Cómputo
de Altas Prestaciones,
(Thesis summary:
SEDAR: Detection and Automatic Recovery from Transient Faults
in High Performance Computing Systems)
Diego Montezanti,
XXII Workshop de Investigadores en Ciencias
de la Computación (WICC'20),
El Calafate, Santa Cruz (Argentina), May, 2020
Bibtex
Middleware to Manage Fault Tolerance Using
Semi-Coordinated Checkpoints,
Alvaro Wong, Elisa Heymann, Dolores Rexachs and Emilio Luque,
IEEE Trans. on Parallel and
Distributed Systems 32(2),
pp. 254--268, May, 2020, IEEE Press,
Bibtex.
Elastic Execution of Checkpointed MPI Applications,
Sumeet Gajjar and Saurabh Vaidya,
6 pages, May, 2020, arXiv preprint arXiv:2005.07543,
Bibtex.
Systèmes Résilients pour l'Automobile : d'une Approche
à Composants à une Approche à Objets
de la Tolérance aux Fautes Adaptative sur ROS,
Matthieu Amy, PhD thesis,
Institut National Polytechnique de Toulouse, May, 2020
Bibtex
SEDAR: Detecciónn y Recuperación Automática
de Fallos Transitorios en Sistemas de Cómputo
de Altas Prestaciones,
(SEDAR: Detection and Automatic Recovery from Transient Faults
in High Performance Computing Systems),
Diego Miguel Montezanti, PhD thesis,
Universidad Nacional de la Plata, Mar., 2020
Bibtex
GIFT: A Coupon Based Throttle-and-Reward Mechanism for Fair
and Efficient I/O Bandwidth Management
on Parallel Storage Systems,
Tirthak Patel, Rohan Garg, and Devesh Tiwari,
18th USENIX Conf. on File and Storage Technologies (FAST'20),
Feb., 2020,
Bibtex.
IaaS Cloud as a Virtual Environment for Experimentation
in Checkpoint Analysis,
Betzabeth Leén, Pilar Gomez-Sanchez, Daniel Franco,
Dolores Rexachs, and Emilio Luque,
Journal of Computer Science and
Technology (JCS&T) 19(2),
pp. 110--122, Oct., 2019,
Bibtex.
SEDAR: Detectando y Recuperando Fallos Transitorios
en Aplicaciones de HPC,
Diego Miguel Montezanti, Enzo Rucci, Dolores Rexachs del
Rosario, Emilio Luque Fadón, Marcelo Naiouf, and
Armando Eduardo De Giusti,
XXV Congreso Argentino de Ciencias
de la Computación (CACIC'19),
(Universidad Nacional de Río Cuarto Córdoba,
Oct., 2019)
Bibtex
Process
Migration-based Computational Offloading Framework
for IoT-supported Mobile Edge/Cloud Computing,
Abdullah Yousafzai, Ibrar Yaqoob, Muhammad Imran, Abdullah Gani,
and Rafidah Md Noor,
Internet of Things Journal 7(5),
pp. 4171--4182, Sept., 2019, IEEE Press,
Bibtex
Checkpointing the Un-checkpointable: MANA and the
Split-Process Approach,
Gene Cooperman,
MVAPICH User Group (MUG'19),
Columbus, Ohio, Aug. 20, 2019;
MUG'19 program,
slides;
video
(from https://insidehpc.com/2019/09/checkpointing-the-un-checkpointable-mana-and-the-split-process-approach/);
Bibtex.
Architectural-Space Exploration of Heterogeneous Reliability
and Checkpointing Modes for Out-of-Order Superscalar Processors,
Bharath Srinivas Prabakaran, Mihika Dave, Florian Kriebel,
Semeen Rehman, and Muhammad Shafique,
IEEE Access 7,
pp. 145324--145339, Jul.. 2019, IEEE Press,
(Also as
arXiv tech. report)
Bibtex.
Checkpoint/Restart Approaches for a Thread-based
MPI Runtime,
Julien Adam, Maxime Kermarquer, Jean-Baptiste Besnard,
Leonardo Bautista-Gomez, Marc Pérache,
Patrick Carribault, Julien Jaeger, Allen D Malony, and
Sameer Shende,
Parallel Comkputing 85,
pp. 204--219, Jul., 2019, Elsevier
Bibtex.
Resilience in high-level parallel programming languages,
Sara S. Hamouda, PhD thesis,
The Australian National University, Jun., 2019,
Bibtex
Extending the Domain of Transparent Checkpoint-Restart
for Large-scale HPC,
Rohan Garg, PhD thesis,
Northeastern U., May, 2019,
Bibtex
Exploring Semantic Reverse Engineering for Software Binary
Protection,
Pengfei Sun, PhD thesis,
Ruthers,The State University of New Jersey, May, 2019
Bibtex
Active Replication
for Centrally Coordinated Teams of Autonomous Vehicles,
Nasos Grigoropoulos, and Manos Koutsoubelias, and Spyros Lalis,
15th Int. Conf. on Distributed Computing
in Sensor Systems (DCOSS'19),
pp. 114--122, May, 2019, IEEE Press,
Bibtex
Prediction of Energy Consumption by Checkpoint/Restart
in HPC,
Marina Morán, Javier Balladini, Dolores Rexachs, and
Emilio Luque,
IEEE Access 7,
pp. 71791--71803, May, 2019, IEEE Press,
Bibtex.
Determinación de la eficiencia y estrategias de tolerancia
a fallos en arquitecturas multiprocesador para aplicaciones
de procesamiento de datos,
Jorge Rafael Osio, Diego Miguel Montezanti, Eduardo Kunysz, and
Daniel Martin Morales,
XXI Workshop de Investigadores en Ciencias
de la Computación (WICC'19), Apr., 2019,
Universidad Nacional de San Juan (Argentina),
Bibtex
Job Migration in HPC Clusters by Means
of Checkpoint/Restart,
Manuel Rodríguez-Pascual, Jiajun Cao,
José A. Moríñigo,
Gene Cooperman, Rafael Mayo-García,
The Journal of Supercomputing 75(10),
pp. 6517--6541, online Apr., 2019, Springer,
Bibtex
Resource Manager for Scalable Performance in ROS
Distributed Environments,
Daisuke Fukutomi, Takuya Azumi, Shinpei Kato, and Nobuhiko Nishio,
Design, Automation \& Test in Europe
Conference \& Exhibition (DATE'19),
pp. 1088--1093,
Mar., 2019, IEEE Press,
Bibtex
Checkpointing and Migration of IoT Edge Functions,
Pekka Karhula, and Jan Janak, and Henning Schulzrinne,
Proc. of 2nd Int. Workshop on Edge Systems, Analytics and Networking (EdgeSys'19; co-located with EuroSys'19), pp. 60-65,
Mar., 2019, ACM Press,
Bibtex
Uma Taxonomia de Sistemas de Tolerância a Falhas
em Ambientes de Computação em Nuvem Open Source,
Vinicius Santos Andrade, M.S. thesis,
Universidade Estadual Paulista "Jûlio de Mesquita Filho",
Jan., 2019
Bibtex
Elastic Scheduling in HPC Resource Management Systems,
Feng Liu, PhD thesis,
University of Minnesota, Dec., 2018,
Bibtex
H-RADIC: A Fault Tolerance Framework for Virtual Clusters
on Multi-Cloud Environments,
Ambrosio Royo, Jorge Villamayor, Marcela Castro-León,
Dolores Rexachs, and Emilio Luque,
Journal of Computer Science and
Technology (JCS&T) 18(3),
pp. e24--e24, Dec., 2018, Springer,
Bibtex.
Autonomic Approach based on Semantics and Checkpointing
for IoT System Management,
François Aïssaoui, PhD thesis,
Toulouse 1, Nov., 2018,
Bibtex
Checkpoint and Restart: An Energy Consumption Characterization,
Marina Morán, Javier Balladini, Dolores Rexachs, and
Emilio Luque,
Argentine Congress of Computer Science (CACIC'18),
pp. 19--33, Oct., 2018, Springer,
(Also appearing in Spanish as:
Factores que afectan el consumo energético de operaciones
de checkpoint y restart en clusters (XIX Workshop Prcoesamiento
Distribuido y Paralelo (WPDP) of CACIC'18),
with pdf),
Bibtex
Programming and Testing Support for Drone Based Applications,
Manos Koutsoubelias, PhD thesis,
University of Thessaly (Greece),
Sept., 2018,
Bibtex.
Resource Management for Extreme Scale High Performance Computing
Systems in the Presence of Failures,
Daniel Dauwe, PhD thesis,
Colorado State University, Sept., 2018,
Bibtex
Transparent High-Speed Network Checkpoint/Restart in MPI,
Julien Adam, Jean-Baptiste Besnard, Allen D Malony, Sameer Shende,
PéMarc rache, Patrick Carribault, Julien Jaeger,
Proc. of 25th European MPI User Group Meeting,
ACM,
12 pages,
Sept., 2018,
Bibtex.
Fault Tolerance in Cloud Computing Environment:
A Systematic Survey,
Moin Hasan, and Major Singh Goraya,
Computers in Industry 99,
pp. 156--172, Aug., 2018, Elsevier,
Bibtex
Characterization of I/O Patterns generated by Fault Tolerance in HPC environments,
Betzabeth Lén, Daniel Franco, Dolores Rexachs, and
Emilio Luque,
Proc. of Int. Conf. on Parallel and Distributed Processing
Techniques and Applications (PDPTA'18),
pp. 28--34, Jul.. 2018
Bibtex.
Leveraging the Checkpoint-Restart Technique for Optimizing
CPU Efficiency of ATLAS Production Applications
on Opportunistic Platforms,
D. Cameron, J. Elmsheuser, L. Heinrich,
W. Lavrijsen, P. Nilsson, V. Tsulaia, M. Vogel
on behalf of the ATLAS Collaboration,
5 pages, 2018, IOPScience 1085 (2018)032028,
Bibtex.
(See preprint version, below, also
by Cameron et al.)
Checkpointing a Subsystem Remotely,
Gene Cooperman,
MVAPICH User Group (MUG'18),
Columbus, Ohio, Aug. 7, 2018;
MUG'18 program,
slides;
Bibtex.
Automatic Characterization of HPC Job Parallel Filesystem
I/O Patterns,
Joseph P White, Alexander D Kofke, Robert L DeLeon,
Martins Innus, Matthew D Jones, and Thomas R Furlani,
Proc. of the Practice and Experience on Advanced Research Computing
(PEARC'18),
pp. 1--8,
July, 2018,
Bibtex.
Shiraz: Exploiting System Reliability and Application Resilience
Characteristics to Improve Large Scale System Throughput,
Rohan Garg, Tirthak Patel, Gene Cooperman, and Devesh Tiwari,
48th Annual IEEE/IFIP Int. Conf. on Dependable Systems
and Networks (DSN'18),
IEEE,
pp. 83--94,
July, 2018,
Bibtex.
Fault-Tolerance Support for Mobile Robotic Applications,
Manos Koutsoubelias and Spyros Lalis,
13th Int. Symp. on Industrial Embedded Systems (SIES'18),
IEEE,
pp. 1--10,
June, 2018,
Bibtex.
RaaS: Resilience as a Service,
Jorge Villamayor, Dolores Rexachs, Emilio Luque, Diego Lugones,
Proc. 18th IEEE/ACM Int. Symp. on Cluster, Cloud
and Grid Computing (CCGRID'18),
IEEE,
pp. 356--359,
May, 2018,
Bibtex.
CDBB: an NVRAM-based Burst Buffer Coordination System
for Parallel File Systems,
Ziqi Fan, Fenggang Wu, Jim Diehl, David HC Du, and Doug Voigt,
Proc. of the High Performance Computing Symposium
(HPC'18),
pp. 1:1--1:12, Apr., 2018, ACM Press,
Society for Computer Simulation International,
Bibtex.
Transparently Checkpointing Software Test Benches to Improve Productivity of SoC Verification in an Emulation Environment,
Ankit Garg, Suresh Krishnamurthy, Gene Cooperman, Rohan Garg,
and Jeff Evans,
2018 Design and Verification Conference and Exhibition
(DVCON-US 2018),
San Jose, CA, Feb. 27, 2018;
DVCON-US 2018,
slides,
Bibtex.
(IMPORTANT: Browser must allow popups, to view paper
at DVCon site.)
Transparent checkpointing over RDMA-based networks,
Jiajun Cao, PhD thesis,
Northeastern U., Dec., 2017,
Bibtex
ITALC: Interactive Tool for Application-Level Checkpointing,
Ritu Arora and Trung Nguyen Ba,
Proc. of Fourth International Workshop
on HPC User Support Tools (HUST'17),
Nov., 2017;
(slides),
Bibtex.
E-HPC: a Library for Elastic Resource Management
in HPC Environments>,
William Fox, Devarshi Ghoshal, Abel Souza, Gonzalo P Rodrigo,
and LavanyaRamakrishnan,
Proc. of the 12th Workshop on Workflows in Support
of Large-Scale Science,
Nov., 2017;
Bibtex.
Constructing the Formal Grammar of System Calls,
Nikolay Efanovand and Pavel Emelyanov,
Proc. 13th Central & Eastern European Software
Engineering Conference in Russia,
Oct., 2017;
Bibtex.
Selective Checkpointing for Minimizing Recovery Energy
and Efforts of Smartphone Apps,
Li Li, Yunhao Bai, Xiaorui Wang, Mai Zheng, and Feng Qin,
Eighth Int. Green and Sustainable Computing
Conference (IGSC'17),
IEEE,
pp. 1--8,
Oct., 2017;
Bibtex.
Leveraging the Checkpoint-restart Technique for Optimizing
CPU Efficiency of ATLAS Production Applications on Opportunistic
Platforms,
D. Cameron, J. Elmsheuser, L. Heinrich,
W. Lavrijsen, P. Nilsson, V. Tsulaia, M. Vogel
on behalf of the ATLAS Collaboration,
ATL-SOFT-PROC-2017-064,
Oct., 2017;
Bibtex.
When you have a hammer, everything is a nail: Checkpoint/Restart in Slurm,
Manuel Rodríguez-Pascual,
Jose Antonio Moríñigo, and
Rafael Mayo-García,
Slurm User Group Meeting — 2017,
Berkeley, CA, Sept. 26, 2017;
Slurm User Group Agenda (accessed Oct., 2017);
Bibtex.
DMTCP: Fixing the Single Point of Failure of the ROS Master,
Gene Cooperman and Twinkle Jain,
ROSCon 2017,
Vancouver, Canada, Sept. 21, 2017;
ROSCon'17 program,
slides;
video;
Bibtex.
Expedite any Simulation with DMTCP and Save Decades of Computation,
Balaji R (presenting), Sathish Kumar Sugumara, Gene Cooperman, Rohan Garg, and Jiajun Cao,
2017 Design and Verification Conference and Exhibition
(DVCON-India 2017),
Bengalauru, India, Sept. 15, 2017;
DVCON-India 2017,
Bibtex.
Intelligent Checkpointing Strategies
for IoT System Management,
Aïssaoui, François and Cooperman, Gene
and Monteil, Thierry and Tazi, Saïd,
Future Internet of Things and Cloud (FiCloud'17),
IEEE,
pp. 305--312,
Aug., 2017;
Bibtex.
Extending DMTCP Checkpointing for a Hybrid Software World,
Gene Cooperman,
MVAPICH User Group (MUG'17),
Columbus, Ohio, Aug. 16, 2017;
MUG'17 program,
slides;
video;
Bibtex.
A Methodology for Soft Errors Detection and Automatic Recovery,
Jorge Villamayor, Dolores Rexachs, Emilio Luque,
Diego Montezanti, A. De Giusti, and M. Naiouf,
Int. Conf. on High Performance Computing & Simulation,
July, 2017;
Bibtex.
When is the Right Time to Start the Fault Tolerance Protection?,
Jorge Villamayor, Dolores Rexachs, and Emilio Luque,
Int. Conf. on High Performance Computing & Simulation,
July, 2017;
Bibtex.
A Fault Tolerance Manager with Distributed Coordinated Checkpoints
for Automatic Recovery,
Jorge Villamayor, Dolores Rexachs, and Emilio Luque,
Int. Conf. on High Performance Computing
& Simulation (HPCS-17),
July, 2017;
Bibtex.
Performance of Android Cluster System Allowing Dynamic Node
Reconfiguration,
Yuki Sawada, Yusuke Arai, Kanemitsu Ootsu, Takashi Yokota,
and Takesh Ohkawa,
?Wireless Personal Communications, 93(4),
pp. 1067--1087, April, 2017, Springer,
Bibtex
Transition Watchpoints: Teaching Old Debuggers New Tricks,
Kapil Arya, Tyler Denniston, Ariel Rabkin, and Gene Cooperman,
The Art,
Science, and Engineering of Programming1(2),
28 pages,
Apr., 2017;
Bibtex.
Fault Tolerance and Message Passing Interface Programs,
Mohammad Miyan,
Int. J. of Advanced Research in Computer Science,
pp. 128--135,
Mar/Apr, 2017;
Bibtex.
A Reflexive Tactic for Polynomial Positivity using Numerical Solvers
and Floating-point Computations,
Érik Martin-Dorel and Pierre Roux,
ACM SIGPLAN Conference on Certified Programs and Proofs (CPP 2017),
pp. 90--99, Jan., 2017, ACM,
Bibtex
DMTCP: Deadline-aware Multipath TCP,
Huang, Chengyuan, Zhang, Jiao, Huang, Tao and Liu, Yunjie,
Proc. of Communications Workshops
(2017 IEEE Int. Conf. on ICC),
pp. 681--686, May, 2017, IEEE,
Bibtex
A Performance and Energy Comparison of Fault Tolerance Techniques
for Exascale Computing Systems,
D. Dauwe, S. Pasricha, A. A. Maciejewski, and H. J. Siegel,
IEEE Int. Conf. on Computer and Information Technology
(CIT'16),
pp. 436--443, Dec., 2016, IEEE,
Bibtex
Trace-free
Memory Data Structure Forensics via Past Inference
and Future Speculations,
Penfei Sun, Rui Han, Mingbo Zhang and Saman Zonouz,
Proc. of 32nd Annual Conf. on Computer Security Applications,
pp. 570--582, Dec., 2016, ACM,
Bibtex
Smart
Scene Management for IoT-based Constrained Devices
using Checkpointing,
François Aïssaoui, Gene Cooperman, Thierry Monteil,
and Saïd Tazi,
15th IEEE Int. Symp. on Network Computing and Applications
(NCA'16),
Cambridge, MA, USA, Oct. 31 - Nov. 2, 2016,
pp. 170--174, IEEE Press, Nov., 2016,
Bibtex
Adaptive Fault Tolerance on ROS: A Component-Based
Approach,
Jean-Charles Fabre, Michaël Lauer, Matthieu Amy,
ROSCon 2016,
slides and
video only,
Oct., 2016,
Bibtex
Applying Future Exascale HPC Methodologies in the Energy Sector,
José J Camata, José M Cela, Danilo Costa,
Alvaro LGA Coutinho, Daniel Fernández-Galisteo,
Carmen Jiménez, Vadim Kourdioumov,
Marta Mattoso, Rafael Mayo-García, Thomas Miras,
and J.A. Moríñigo,
Proc. of
Russian Supercomputing Days 2016,
p. 9--19, Sept., 2016,
UPCommons,
Bibtex
Deduplication Potential of HPC Applications' Checkpoints,
Jürgen Kaiser, Ramy Gad, Tim Süß,
Federico Padua, Lars Nagel and André Brinkmann,
Proc. of IEEE Int. Conf. on Cluster Computing (Cluster'16),
pp. 413--422,
Taipei, Taiwan,
IEEE Press, Sept., 2016,
Bibtex
Enhancing Energy Production with Exascale HPC Methods
(and
prior technical report),
Rafael Mayo-García, José J. Camata, José M.Cela,
Danilo Costa, Alvaro LGA Coutinho, Daniel Fernández-Galiste,
Carmen Jiménez, Vadim Kourdioumov, Marta Mattoso,
Thomas Miras, José A. Moríñigo, Jorge Navarro;
Philippe O. A. Navaux, Daniel de Oliveira,
Manuel Rodríguez-Pascual, Vítor Silva, Renan Souza, and
Patrick Valduriez,
Latin American High Performance Computing Conference
(CARLA'16),
pp. 233--246, Aug., 2016, Springer (Communications in
Computer and Information Science book series, CCIS, vol. 697),
Bibtex
Scalable System-level Transparent Checkpointing for OpenSHMEM,
Rohan Garg, Jérôme Vienne and Gene Cooperman,
OpenSHMEM and Related Technologies. Enhancing OpenSHMEM
for Hybrid Environments --- Third Workshop,
OpenSHMEM 2016,
Baltimore, MD, USA, Aug. 2--4, 2016, Revised Selected Papers
(OpenSHMEM'16),
pp. 52--65,
Lecture Notes in Computer Science, Volume 10007,
Springer-Verlag, Aug., 2016,
Bibtex
Extended
Batch Sessions and Three-Phase Debugging: Using DMTCP
to Enhance the Batch Environment,
Rohan Garg, Jiajun Cao, Kapil Arya, Gene Cooperman
and Jérôme Vienne,
Proc. of the (XSEDE16) Conference on Diversity, Big Data,
and Science at Scale,
pp. 42:1--42:8, ACM Press, July, 2016,
(and slides),
Bibtex.
Computational Studies of De Novo Motif Discovery
in Aptamer Selections,
Kevin R. Shieh, PhD thesis,
Yeshiva University, 2016,
Bibtex
A Checkpointing Methodology for Android Smartphone,
Yunhao Bai,
M.S. thesis, The Ohio State University,
2016,
Bibtex
Checkpointing with DMTCP and MVAPICH2 for Supercomputing,
Kapil Arya,
MVAPICH User Group (MUG'16),
Columbus, Ohio, Aug. 17, 2016;
MUG'16 program,
slides;
Bibtex.
Simulation Infrastructure for the Study
of Performance/QOS/Energy,
Georgios Ioannis Kopanas,
Diploma Thesis, U. of Thessaly
Feb., 2016,
Bibtex.
An Affinity-structure Database of Helix-turn-helix:
DNA Complexes with a Universal Coordinate System,
Mohammed and AlQuraishi, Shengdong Tang and Xide Xia,
BMC Bioinformatics 16:390, 19 pages,
2015, BioMed Central,
Bibtex.
HOL (y) Hammer: Online ATP Service for HOL Light,
Cezary Kaliszyk and Josef Urban,
Mathematics in Computer Science 9(1),
pp. 5--22, 2015, Springer,
(first published online on Jun 28, 2014)
Bibtex.
Parallel Application Signature for Performance Analysis and Prediction
(or alt),
Alvaro Wong, Dolores Rexachs, Emilio Luque,
IEEE Trans. on Parallel
and Distributed Systems 26(7),
pp. 2009--2019, 2015, IEEE Press,
Bibtex.
Elastic Job Bundling: An Adaptive Resource Request Strategy
for Large-scale Parallel Applications
(or alt),
Feng Liu and Jon B. Weissman,
Proc. of the Int. Conf.
for High Performance Computing, Networking, Storage and Analysis
(SC'15), 12 pages, Nov., 2015, ACM,
Bibtex.
Performance Improvement in Automata Learning:
Speeding up LearnLib using Parallelization and Checkpointing,
Marco Henrix,
M.S. thesis,
Radboud University Nijmegen, Netherlands, Aug., 2015,
Bibtex.
An Android Cluster System Capable
of Dynamic Node Reconfiguration,
Yuki Sawada, Yusuke Arai, Kanemitsu Ootsu, Takashi Yokota
and Takeshi Ohkawa,
Proc. of 2015 Seventh Int. Conf. on Ubiquitous
and Future Networks (ICUFN),
pp. 689--694, IEEE Press, July, 2015,
Bibtex.
Enabling Sender-initiated Distributed Applications and Checkpointing
in Content Centric Networks,
Nitinder Mohan,
Master of Technology Thesis,
IIIT Delhi (Indraprastha Institute of Information Technology),
July, 2015
Bibtex.
Optimizing Checkpoint Restart with Data Deduplication,
Chen, Zhengyu and Sun, Jianhua and Chen, Hao
Scientific Programming,
May, 2016, Hindawi Publishing Corporation
Bibtex.
Transparent Checkpointing for Supercomputing,
Jiajun Cao and Rohan Garg
MVAPICH User Group (MUG'15),
Columbus, Ohio, Aug. 20, 2015;
MUG'15 program,
slides, and
video;
Bibtex.
Transparent Checkpoint-Restart: Re-Thinking
the HPC Environment,
Gene Cooperman,
MVAPICH User Group (MUG'15),
Columbus, Ohio, Aug. 19, 2015;
MUG'15 program,
slides, and
video;
Bibtex.
Recent Trends towards Green Clouds
by using Fuzzy based Live Migration
(or alt),
Amrinder Kaur and Anil Kumar,
International Journal of Computer Applications
113(3) (0975--8887), pp. 17--22, Mar., 2015,
Bibtex.
Power-Check: An Energy-Efficient Checkpointing Framework
for HPC Clusters,
R.R.Chandrasekar, A. Venkatesh, K. Hamidouche and D.K. Panda,
Proc. of 15th IEEE/ACM Int. Symp.
on Cluster, Cloud and Grid Computing (CCGrid'15),
pp. 261--270,
IEEE Press, 2015, Bibtex.
Checkpointing as a Service in Heterogeneous Cloud Environments,
Jiajun Cao, Matthieu Simonin, Gene Cooperman and Christine Morin,
Proc. of 15th IEEE/ACM Int. Symp.
on Cluster, Cloud and Grid Computing (CCGrid'15),
pp. 61--70,
IEEE Press, 2015, Bibtex.
Energy Efficient Rescheduling Algorithm
for High Performance Computing,
Manisha Chauhan, Nazia Parveen, Sumit Kumar Saurav and GL, Ganga Prasad,
Nat. Conf. on Parallel Computing Technologies
(PARCOMPTECH'15),
IEEE Press, 2015,
Bibtex.
CCNCheck: Enabling Checkpointed Distributed Applications
in Content Centric Networks,
Nitinder Mohan and Pushpendra Singh,
CCNxCon'15: Content Centric Networking
(technical talk abstract), 2 pages,
Bibtex.
DMTCP: Bringing Interactive Checkpoint-Restart to Python,
Kapil Arya and Gene Cooperman,
Computational Science & Discovery,
16 pages, 2015, IOPScience,
Bibtex.
Using Checkpointing and Virtualization for Fault Injection,
Cyrille Artho and Kuniyasu Suzaki and Masami Hagiya and Watcharin Leungwattanakit and Richard Potter and Eric Platon and Yoshinori Tanabe and Franz Weitl and Mitsuharu Yamamoto,
International Journal of Networking and Computing 5(2),
pp. 347--372, 2015,
Bibtex.
Using Checkpointing and Virtualization for Fault Injection,
Cyrille Artho, Masami Hagiya, Watcharin Leungwattanakit,
Eric Platon, Richard Potter, Kuniyasu Suzaki,
Yoshinori Tanabe, Franz Weitl and Mitsuharu Yamamoto,
Second Int. Symp. on Computing and Networking (CANDAR'14),
pp. 144--150, Dec., 2014, IEEE Press,
Bibtex.
Be Kind, Rewind --- Checkpoint & Restore Capability
for Improving Reliability of Large-scale
Semiconductor Design,
Igor Ljubuncic, Ravi Giri, Avikam Rozenfeld, and Andrew Goldis,
2014 IEEE High Performance Extreme Computing Conference
(HPEC-2014),
6 pages, IEEE Press, Sept., 2014,
Bibtex.
Performance Evaluation of Checkpoint/Restart Techniques: For MPI Applications on Amazon Cloud,
Basma Abdel Azeem and Manal Helal,
Informatics and Systems, 9th Int. Conf. on (INFOS'14),
pp. 49--57, Sep., 2014, IEEE Press,
Bibtex
DMTCP: System-Level Checkpoint-Restart in User-Space,
Kapil Arya and Gene Cooperman,
MVAPICH User Group (MUG'14),
Columbus, Ohio, Aug. 26, 2014;
MUG'14 program,
slides, and
video;
Bibtex.
Metodología para Predecir el Consumo Energético
de Checkpoints en Sistemas de HPC,
Javier Balladini, Marina Morán,
Dolores Rexachs and Emilio Luque,
XX Congreso Argentino de Ciencias de la Computación
(CACCIC'14),
10 pages, Oct., 2014,
Bibtex.
Using SAGA and the Open Science Grid to Search for Aptamers,
Kevin Shieh, Pilib Ó Broin, David Rhee, Matthew Levy,
and Aaron Golden,
Proc. of 2014 Ann. Conf. on Extreme Science and
Engineering Discovery Environment (XSEDE'14),
Art. No. 27, Jul., 2014
Bibtex.
Simulation Speedup of ns-3 using Checkpoint and Restore (WNS3'14),
Kyle Harrigan and George Riley,
Proceedings of the 2014 Workshop on ns-3 (WNS3'14),
Art. No. 7, 2014
Bibtex.
User-Space Process Virtualization in the Context of
Checkpoint-Restart and Virtual Machines,
Kapil Arya, PhD thesis, Northeastern University,
August, 2014,
Bibtex.
Use of Checkpoint-Restart for Complex HEP Software
on Traditional Architectures and Intel MIC,
Kapil Arya, Gene Cooperman, Andrea Dotti and Peter Elmer,
J. Physics: Conference Series 523,
Conference 1,
(from Proc. of 15th Int. Workshop on Advanced Computing and
Analysis Techniques in Physics Research (ACAT2013)),
IOPScience, 8 pages, 2014,
Bibtex.
GemFI: A Fault Injection Tool for Studying the Behavior of Applications on Unreliable Substrates,
K. Parasyris, S.Tziantzoulis ; C.D. Antonopoulos,
and N. Bellas,
44th Ann. IEEE/IFIP Int. Conf. on Dependable Systems
and Networks (DSN),
pp. 622--629 , IEEE Press, Jun., 2014,
Bibtex.
Алгоритмы отказоустойчивого управления ресурсами пространственно-распределённых вычислительных систем
(Algorithms for Failover Resource Management in Distributed
Computing Systems),
А.Ю. Поляков , О.В. Молдованова, А.А. Пазников ,
М.Г. Курносов , С.Н. Мамойленко, А.В. Ефимов, (A. Yu. Polyakov et al.),
Vestnik SibGUTIS 11(4), (УДК 004.382.2)
pp. 11--29, 2014,
Bibtex.
Optimization Tools of Parallel Simulation of Nanostructures
with Quantum Dots,
K. V. Pavskii, M. G. Kurnosov, and A. Yu. Polyakov,
Optoelectronics, Instrumentation and
Data Processing 50(3),
pp. 260--265,
May, 2014, Springer Press,
Bibtex.
(Original Russian Text at: K.V. Pavskii, M.G. Kurnosov,
A.Yu. Polyakov, 2014, published in Avtometriya, 2014, Vol. 50,
No. 3, pp. 56--61.)
Modular Software Model Checking for Distributed Systems,
Leungwattanakit, W., Artho, C., Hagiya, M., Tanabe, Y., Yamamoto, M.,
and Takahashi, K.,
IEEE Trans. on Software Engineering 40(5),
pp. 483--501, May, 2014, IEEE Press,
Bibtex
Designing Scalable and Efficient I/O Middleware for
Fault-Resilient High-Performance Computing Clusters,
Raghunath Raja Chandraseka, PhD thesis, 2014,
The Ohio State University,
Bibtex
Improving the Efficiency of Fuzz Testing Using Checkpointing,
Erenst-Friedrich Zachow,
Masters Thesis, ETH-Zürich, April 1, 2014,
Bibtex.
Towards an Energy-Efficient Tool for Processing the Big Data,
Eric Renault and Selma Boumerdassi,
2nd International Conference on Future Internet of Things
and Cloud (FiCloud'14),
pp. 448--452, Aug., 2014, IEEE Press,
Bibtex
Abstraction Checkpointing Levels: Problems and Solutions,
Bakhta Meroufel and Ghalem Belalem,
International Journal of Computing 13(3),
pp. 158--169, 2014,
Bibtex.
Explorations of the Viability of ARM and Xeon Phi for Physics
Processing,
David Abdurachmanov, Kapil Arya, Josh Bendavid,
Tommaso Boccali, Gene Cooperman, Andrea Dotti, Peter Elmer,
Giulio Eulisse, Francesco Giacomini, Christopher D. Jones,
Matteo Manzali and Shahzad Muzaffar,
J. Physics: Conference Series 513,
Track 5,
(from Proc. of 20th Int. Conf. on Computing in High Energy
and Nuclear Physics (CHEP13)),
IOPScience, 7 pages, 2014,
Bibtex.
jmodeltest.org: Selection of Nucleotide Substitution Models
on the Cloud,
Jose Manuel Santorum, Diego Darriba, Guillermo L. Taboada,
and David Posada,
Bioinformatics 30(9),
pp. 1310-1311, Oxford Journals, Jan. 21, 2014,
Bibtex.
DMTCP: Bringing Checkpoint-Restart to Python,
Kapil Arya and Gene Cooperman,
Proc. of the 12th Python in Science Conf.
(SciPy 2013),
6 pages, 2013, Bibtex.
A Framework for an In-depth Comparison of Scale-up
and Scale-out,
Michael Sevilla, Ike Nassi, Kleoni Ioannidou, Scott Brandt,
and Carlos Maltzahn,
Proc. of 2013 Int. Workshop on Data-Intensive
Scalable Computing Systems (DISCS'13),
pp. 13--18, 2013
Bibtex.
A Tool for Selecting the Right Target Machine
for Parallel Scientific Applications,
Javier Panadero, Alvaro Wong, Dolores Rexachs, and Emilio Luque,
Procedia Computer Science 18,
pp. 1824--1833, Elsevier, 2013,
Bibtex.
Formal Mathematics on Display: A Wiki for Flyspeck,
Carst Tankink, Cezary Kaliszyk, Josef Urban, and Herman Geuvers,
Intelligent Computer Mathematics,
Lecture Notes in Computer Science Volume, vol. 7961,
pp. 152--167, Springer, 2013,
Bibtex.
Towards Computing as a Utility via Adaptive Middleware: An Experiment
in Cross-paradigm Execution,
Jaroslaw Slawinski and Vaidy Sunderam,
Parallel Processing Letters 23(2),
18 pages,
World Scientific, June, 2013,
Bibtex.
Calculation of the Subgroups of a Trivial-Fitting Group,
Alexander J. Hulpke,
Proc. of 38th Int. Symp. on Symbolic
and Algebraic Computation, pp. 205--210, 2013,
ACM Press,
Bibtex.
Semi-Automated Debugging via Binary
Search through a Process Lifetime,
Kapil Arya, Tyler Denniston, Ana-Maria Visan, and Gene Cooperman,
Proc. of 7th Workshop on Programming Languages
and Operating Systems (PLOS)
(part of Proc. of 24th ACM Symp. on Operating System
Principles (SOSP)), 2013,
ACM Press, Oct., 2013, Bibtex.
Shorten Device Boot Time for Automotive IVI and Navigation
Systems (slides),
Jim Huang and Shi-Wu Lo (developers, 0xlab),
Automotive Linux Summit (ALS2013), May 28, 2013.
(See "Part II: Userspace solution: Checkpointing";
begins at slide 66)
SweeD: Likelihood-Based Detection of Selective Sweeps in Thousands
of Genomes,
P. Pavlidis, D. Živkovic, A. Stamatakis, N. Alachiotis
and P. Pavlidi,
Heidelberg Institute for Theoretical Studies,
Technical report Exelixis-RRDR-2013-1, February, 2013
`
A Survey of Fault Tolerance Mechanisms and Checkpoint/Restart
Implementations for High Performance Computing Systems,
I.P. Egwutuoha, D. Levy, B. Selic and S. Chen,
The Journal of Supercomputing, Feb., 2013, Springer
Proposal of Incremental Software Simulation for Reduction
of Evaluation Time,
Atsushi Shina, Kanemitsu Ootsu, Takeshi Ohkawa, Takashi Yokota
and Takanobu Baba,
Third Int. Conf. on Networking and Computing (ICNC),
pp. 311--315, IEEE Press, Dec., 2012,
Bibtex.
Implement Checkpointing for Android (to speed up boot time
and development process) (slides),
Jim Huang and Kito Cheng (developers, 0xlab),
Embedded Linux Conference Europe (ELCE2012),
Barcelona, Spain; Nov. 5--7, 2012.
Bibtex.
Towards Fault-tolerant Energy-efficient High Performance Computing in the Cloud,
Kurt L. Keville, Rohan Garg, David J. Yates
and Kaply Arya and Gene Cooperman,
Proc. of 2012 IEEE Computer Society International
Conference on Cluster Computing.
pp. 622--626, 2012,
Bibtex.
Adapting MPI to MapReduce PaaS Clouds: An Experiment
in Cross-Paradigm Execution,
Jaroslaw Slawinski and Vaidy Sunderam,
Proc. of 2012 IEEE/ACM Fifth Int. Conf. on Utility and
Cloud Computing (UCC '12), pp. 199--203, 2012,
Bibtex.
Creating and Improving Multi-Threaded Geant4.
Xin Dong, Gene Cooperman, John Apostolakis, Sverre Jarp, Andrzej Nowak, Makoto Asai and Daniel Brandt,
Journal of Physics: Conference Series,
Volume 396, Part 5, 2012
Temporal Meta-Programming: Treating Time as a Spatial Dimension,
Ana-Maria Visan, PhD thesis, Northeastern University,
April, 2012, Bibtex.
Verification of Embedded Control Systems by Simulation and Program
Execution Control,
Stefan Resmerita and Wolfgang Pree,
American Control Conference (ACC), pp. 3581--3586,
June, 2012, IEEE Press,
Bibtex
Checkpointing in Distributed Heterogeneous Environments,
Michael Schöttner and John Mehnert-Spahn,
Technical Report, Heinrich Heine University,
Duesseldorf, Germany, 26 pages, March, 2012,
(from Universität Düsseldorf: Publications),
Bibtex.
Source-Level Transformation of Legacy Sequential Program into
Scalable Thread-Parallel Code,
Xin Dong, PhD thesis, Northeastern University,
Dec., 2011, Bibtex.
Model Checking Distributed Systems by Combining Caching
and Process Checkpointing,
Watcharin Leungwattanakit, Cyrille Artho,
Masami Hagiya, Yoshinori Tanabe, and Mitsuharu Yamamoto,
26th IEEE/ACM International Conference on Automated Software
Engineering (ASE), pp. 103--112,
IEEE Press, Dec., 2011.
Bibtex.
Including the Workload Effect in the Parallel Program Signature,
J.M. Canillas, A. Wong, D. Rexachs, and E. Luque,
Proc. of 13th Int. Conf. on High Performance Computing and
Communications (HPCC), pp. 304--311,
IEEE Computer Society, Sept., 2011.
Bibtex.
Predicting Parallel Applications Performance Using Signatures: the Workload Effect,
J.M. Canillas, A. Wong, D. Rexachs, and E. Luque,
9th IEEE/ACS International Conference on Computer Systems
and Applications (AICCSA), pp. 299--300,
IEEE Computer Society, Dec., 2011.
Bibtex.
URDB: A Universal Reversible Debugger Based on
Decomposing Debugging Histories,
Ana-Maria Visan, Kapil Arya, Gene Cooperman, and Tyler Denniston,
Proc. of 6th Workshop on Programming Languages
and Operating Systems (PLOS)
(part of Proc. of 23rd ACM Symp. on Operating System
Principles (SOSP)), 2011,
ACM Press, Oct., 2011. Bibtex.
Direct Inference of Protein--DNA Interactions using Compressed
Sensing Methods,
Mohammed AlQuraishi and Harley H. McAdams,
Proc. of National Academy of Sciences
(PNAS) 108(36), pp. 14819--14824,
Sept. 6, 2011.
Full Text (html),
Full Text (pdf),
Bibtex.
Hiroyuki Takizawa and Kentaro Koyama and Katsuto Sato and
Kazuhiko Komatsu and Hiroaki Kobayashi,
CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications,
Proc. of 2011 IEEE International Parallel and Distributed
Processing Symposium, pp. 864--876
IEEE Computer Society, May, 2011.
Bibtex.
Distributed Speculative Parallelization using Checkpoint Restart,
Devarshi Ghoshal, Sreesudhan R. Ramkumar, and Arun Chauhan,
Procedia Computer Science, 4,
pp. 422--431,
May, 2011,
Slides,
Bibtex.
Unibus: Aspects of Heterogeneity and Fault Tolerance in Cloud Computing
M. Slawiñska, J. Slawinski, and V. Sunderam,
Proc. of IEEE Int. Symp. on Parallel & Distributed Processing, Workshops and PhD Forum (IPDPSW), pp. 1--10,
Apr., 2010,
Bibtex.
Click
|