DMTCP Publications

Citing DMTCP (please cite this publication):

DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop.
Jason Ansel, Kapil Arya and Gene Cooperman.
23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS'09).
12 pages, Rome, Italy. May, 2009.
• Slides.
• Bibtex.

A. DMTCP Publications (an annotated history of enhancements to the DMTCP distribution; reverse chronological order):

Enabling Practical Transparent Checkpointing for {MPI}: A Topological Sort Approach,
(and earlier technical report of same name at https://arxiv.org/abs/2408.02218)
Yao Xu and Gene Cooperman,
Proc. of 2024 IEEE International Conference on Cluster Computing (Cluster'24), Sept., 2024, IEEE Computer Society,
Bibtex.
(The MANA paper introduced "split processes" to transparently checkpointing independently of the underlying network software. But that algorithm interposed an MPI barrier in front of each collective communication, in order to ensure a safe syncronization point (and hence a consistent snapshot for checkpointing). This incurred high runtime overhead for collective communication. A new sequence number algorithm is shown with zero additional inter-process communication, and so has almost no runtime overhead.)
CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM,
(and earlier technical report of same name at https://arxiv.org/abs/2008.10596)
Twinkle Jain and Gene Cooperman,
Proc. of the Int. Conf. for High Performance Computing, Networking, Storage and Analysis (SC'20), pp. 1083--1097, Nov., 2020, IEEE Computer Society,
Bibtex.
(Extending "split-process" approach to efficient checkpointing for CUDA: This typically involves less than 1% runtime overhead, unlike the earlier approach of CRUM, which required a separate proxy process.)
MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing,
Rohan Garg, Gregory Price, and Gene Cooperman, Proc. of 28th Int. Symp. on High Performance Parallel and Distributed Computing (HPDC'19),
Phoenix, AZ, USA, ACM, pp. 49--60, June, 2019, (and tech. report at https://arxiv.org/abs/1904.12595),
• Slides (pdf), Slides (odp), Slides (pptx). Bibtex.
(Novel "split-process" approach to isolating and checkpointing just the MPI application at the original cluster: On restart (on the same cluster or on a different one), the underlying network, the ranks-per-node, and even the MPI library (e.g., MPICH or Open MPI) can each be changed.)
CRUM: Checkpoint-Restart Support for CUDA's Unified Memory,
Rohan Garg, Apoorve Mohan, Michael Sullivan, and Gene Cooperman, Proc. of IEEE Int. Conf. on Cluster Computing (Cluster'18),
Belfast, United Kingdon, IEEE, pp. 302--313, Sept., 2018, (and tech. report at https://arxiv.org/abs/1808.00117), Bibtex.
(Transparent checkpointing of CUDA, including UVM (Unified Virtual Memory), using a proxy approach)
System-level Scalable Checkpoint-Restart for Petascale Computing.
Jiajun Cao, Kapil Arya, Rohan Garg, Shawn Matott, Dhabaleswar K. Panda, Hari Subramoni, Jérôme Vienne and Gene Cooperman,
Proc. of 22nd IEEE Int. Conf. on Parallel and Distributed Systems (ICPADS'16),
Wuhan, China, IEEE Press, pp. 932--941, Dec., 2016, Bibtex.
(Scalable transparent checkpointing: MPI-based HPCG using 32,752 CPU cores, and MPI-based NAMD using 16,368 CPU cores)
Design and Implementation for Checkpointing of Distributed Resources using Process-level Virtualization.
Kapil Arya, Rohan Garg, Artem Y. Polyakov and Gene Cooperman,
Proc. of IEEE Int. Conf. on Cluster Computing (Cluster'16),
pp. 402--412, Taipei, Taiwan, IEEE Press, Sept., 2016. Bibtex.
(Extensible, adaptable, transparent checkpointing via a plugin implementation of process virtualization)
Transparent Checkpoint-Restart over InfiniBand.
Jiajun Cao, Gregory Kerr, Kapil Arya and Gene Cooperman,
Proc. of ACM Symposium on High Performance Parallel and Distributed Computing (HPDC'14),
12 pages, Vancouver, Canada, June, 2014. Bibtex.
(Transparent checkpointing of MPI over InfiniBand; no need to tear down network, and reconstruct during restart)
Checkpoint-Restart for a Network of Virtual Machines.
Rohan Garg, Komal Sodha, Zhengping Jin and Gene Cooperman,
Proc. of 2013 IEEE Computer Society International Conference on Cluster Computing.
8 pages, Indianapolis, USA. Sept., 2013. Slides. Bibtex.
(Transparent checkpointing of network of Qemu's over KVM)
DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop.
Jason Ansel, Kapil Arya and Gene Cooperman.
Proc. of 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS'09).
12 pages, Rome, Italy. May, 2009. Slides. Bibtex.
(Transparent checkpointing of distributed processes)
Transparent Adaptive Library-Based Checkpointing for Master-Worker Style Parallelism,
Gene Cooperman, Jason Ansel and Xiaoqin Ma,
Proc. of 6th IEEE Int. Symp. on Cluster Computing and the Grid (CCGrid06),
pp. 283--291, IEEE Press, 2006 Bibtex
(Master-worker style checkpointing experiment in distributed checkpointing, prior to DMTCP itself)
Transparent User-Level Checkpointing for the Native POSIX Thread Library for Linux.
Michael Rieker, Jason Ansel, and Gene Cooperman.
Proc. of 2006 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'06).
pp. 492--498, Las Vegas, NV. Jun., CSREA Press, 2006. Bibtex.
(The original MTCP: Transparent checkpointing for multi-threaded checkpointing for a single process)

B. Publications and Conference Presentations mostly by other teams, and using DMTCP in their work (not simply citing DMTCP) (in reverse chronological order):

Live Migration of Multi-Container Kubernetes Pods in Multi-Cluster Serverless Edge Systems,
Poggiani, Leonardo and Puliafito, Carlo and Virdis, Antonio and Mingozzi, Enzo,
Proceedings of the 1st Workshop on Serverless at the Edge (SEATED'24) pp. 9--16, Sept., 2024,
Bibtex
AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems,
Jia, Jie and Liu, Yi and Liu, Yanke and Chen, Yifan and Lin, Fang,
European Conference on Parallel Processing (Euro-Par'24) Lecture Notes in Computer Science 14803, pp. 342--355, Aug., 2024, Springer Nature,
Bibtex
Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC,
Timalsina, Madan and Gerhardt, Lisa and Tyler, Nicholas and Blaschke, Johannes P and Arndt, William,
arXiv preprint arXiv:2407.19117, 9 pages, Jul., 2024,
Bibtex
Limitless FaaS: Overcoming serverless Functions Execution Time Limits with Invoke Driven Architecture and Memory Checkpoints,
Rodrigo Landa Andraca and Mahdi Zaree,
10th International Conference on Web Research (ICWR'24), pp. 210--215, May, 2024,
Bibtex
Application-Transparent Strategies to Optimize Limited Resources in HPC and Big Data,
Twinkle Jain, PhD thesis, Northeastern University, May, 2024,
Bibtex
DCU-CHK: Checkpointing for Large-scale CPU-DCU Heterogeneous Computing Systems,
Jia, Jie and Lin, Xinyuan and Lin, Fang and Liu, Yi,
CCF Transactions on High Performance Computing pp. 171--187, Jan., 2024, Springer,
Bibtex
Digital Twin Migration using the {OKD} Platform: A Use-Case for Emergency Vehicles,
Ribeiro, Bruno and Gonçalves, Pedro and Bartolomeu, Paulo C,
33rd International Telecommunication Networks and Applications Conference (ATNAC'24), pp. 63--69, Dec.., 2023, IEEE,
Bibtex.
HPC@Cloud: a Provider-Agnostic Toolkit to Enable the Execution of {HPC} Applications on Public Clouds,
Pereira Filho, Vanderlei Munhoz,
Masters Thesis, Universidade Federal de Santa Catarina, Nov. 4, 2023,
Bibtex.
Implementation-Oblivious Transparent Checkpoint-Restart for MPI,
Xu, Yao and Belyaev, Leonid and Jain, Twinkle and Schafer, Derek and Skjellum, Anthony and Cooperman, Gene,
Proceedings of the SC'23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W'23; SuperCheck'23), pp. 1738--1747, Nov., 2023,
Bibtex
Migration Transparente de Conteneurs,
Hamdi Abdelaziz, Daniele Miorandi, and Guillaume Pierre,
Masters Thesis, École nationale Supérieurede lInformatique; ex. INI (Institut National de formation en Informatique, Alger, Alg&eactue;rie), Sept., 2023,
Bibtex
Gestión del Almacenamiento para Tolerancia a Fallos en Computación de Altas Prestaciones,
Betzabeth León, PhD thesis, Universitat Autónoma de Barcelona, Mar., 2023,
Bibtex
Debugging MPI Implementations via Reduction-to-Primitives,
Cooperman, Gene and Li, Dahong and Zhao, Zhengji,
IEEE/ACM Third International Symposium on Checkpointing for Supercomputing (SuperCheck'22, pp. 10--18, Nov., 2022,
Bibtex
Good Shepherds Care For Their Cattle: Seamless Pod Migration in Geo-Distributed Kubernetes,
Paulo Souza Junior, Daniele Miorandi, and Guillaume Pierre,
6th IEEE Int. Conf. on Fog and Edge Computing (ICFEC'22), 9~pages, May, 2022,
Bibtex
A Model of Checkpoint Behavior for Applications that have I/O,
León, Betzabeth and M{éndez, Sandra and Franco, Daniel and Rexachs, Dolores and Luque, Emilio,
The Journal of Supercomputing 78, pp. 15404–15436, Apr., 2022, Springer,
Bibtex Bibtex
Provisioning Strategies for Centralized Bare-Metal Clusters,
Apoorve Mohan, PhD thesis, Northeastern University, Dec., 2021,
Bibtex
MANA-2.0: A Future-Proof Design for Transparent Checkpointing of MPI at Scale,
Yao Xu, Zhengji Zhao, Rohan Garg, Harsh Khetawat, Rebecca Hartman-Baker, Gene Cooperman,
Int. Symp. on Checkpointing for Supercomputing (SuperCheck'SC-21) , (SC Workshops Supplementaray Proceedings (SCWS)) pp. 68--78, Nov., 2021,
Bibtex
Assessing the Use Cases of Persistent Memory in High-Performance Scientific Computing,
Yehonatan Fridman, Yaniv Snir, Matan Rusanovsky, Kfir Zvi, Harel Levin, Danny Hendler, Hagit Attiya, and Gal Oren,
2021 IEEE/ACM 11th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS); workshop at SC21, pp. 11--20, IEEE, Oct., 2021
Bibtex
Service Migration in a Distributed Virtualization System,
Pablo Andrés Pessolani, Luis Santiago Re, Tomás Andrés Fleitas,
Journal of Computer Science \& Technology 21(2), pp. 171--187, Oct., 2021, Springer,
Bibtex
Performance and Energy Task Migration Model for Heterogeneous Clusters,
Esteban Stafford and José Luis Bosque,
The Journal of Supercomputing 77(9), pp. 10053--10064, 2021, Springer,
Bibtex
MigrOS: Transparent Live-Migration Support for Containerised RDMA Applications,
Maksym Planeta, Jan Bierbaum, Leo Sahaya Daphne Antony, Torsten Hoefler, and Hermann Härtig,
2021 USENIX Annual Technical Conference (USENIX ATC 21), pp. 47--63, July, 2021, (also as an arXiv technical report from Oct., 2020)
Bibtex
Verifiable Application-level Checkpoint and Restart Framework for Parallel Computing,
I. Gankevich and I. Petriakov and A. Gavrikov and D. Tereshchenko and G. Mozhaiskii,
Proc. of 9th Int. Conf. on Distributed Computing and Grid Technologies in Science and Education (GRID'2021), pp. 47--63, July, 2021,
Bibtex
An Innovative Approach for Cloud-Based Web Dev App Migration,
C Sunil, KS Raghunandan, KN Ranjit, HK Chethan, and G Hemantha Kumar,
ICT with Intelligent Applications, pp. 807--817, December, 2021, Springer
Bibtex
Determination of Suitable Resource Discovery Tool and Methodology for High-Volume Internet of Things (IoT),
Mohd Tamizan Abu Bakar and Azrul Amri Jamal,
Journal of Physics: Conference Series: 1st Int. Recent Trends in Engineering, Advanced Computing and Technology Conference (RETREAT), Volume 1874, Number 1, pp. 012046 (11 pages), June, 2021, IOP Publishing,
Bibtex
ROS Rescue: Fault Tolerance System for Robot Operating System,
Pushyami Kaveti and Hanumant Singh, Pushyami Kaveti and Hanumant Singh,
Robot Operating System (ROS), pp. 381--397, 2021, Springer, (and an arXiv technical report from Oct., 2019)
Bibtex
Analysis of Parallel Application Checkpoint Storage for System Configuration,
Betzabeth León, Daniel Franco, Dolores Rexachs, and Emilio Luque,
The Journal of Supercomputing 77(5), pp. 4582--4617, May, 2021, Springer,
Bibtex
Transparent Checkpointing for OpenGL Applications on GPUs,
David Hou and Jun Gan and Yue Li and Younes El Idrissi Yazami and Twinkle Jain,
First Int. Symp. on Checkpointing for Supercomputing (SuperCheck21), 3 pages, Feb., 2021, (with conf. program, slides, and video)
Bibtex
Checkpointing SPAdes for Metagenome Assembly: Transparency versus Performance in Production,
Twinkle Jain and Jie Wang,
First Int. Symp. on Checkpointing for Supercomputing (SuperCheck21), 4 pages, Feb., 2021, (with conf. program, slides, and video)
Bibtex
Optimized Memoryless Fair-Share HPC Resources Scheduling Using Transparent Checkpoint-Restart Preemption,
Kfir Zvi and Gal Oren,
First Int. Symp. on Checkpointing for Supercomputing (SuperCheck21), 4 pages, Feb., 2021, (with conf. program, slides, and video)
Bibtex
Improving Scalability and Reliability of MPI-agnostic Transparent Checkpointing for Production Workloads at NERSC,
Prashant Singh Chouhan, Harsh Khetawat, Neil Resnik, Jain Twinkle, Rohan Garg, Gene Cooperman, Rebecca Hartman-Baker, Zhengji Zhao,
First Int. Symp. on Checkpointing for Supercomputing (SuperCheck21), 4 pages, Feb., 2021, (with conf. program, slides, and video)
Bibtex
So Why Can't I Checkpoint That? (keynote talk),
Gene Cooperman,
First Int. Symp. on Checkpointing for Supercomputing (SuperCheck21), Feb., 2021, (with conf. program, slides, and video)
Bibtex
Fault-Tolerant Computing with Heterogeneous Hardening Modes,
Florian Kriebel, Faiq Khalid, Bharath Srinivas Prabakaran, Semeen Rehman, and Muhammad Shafique,
Dependable Embedded Systems (Embedded Systems series, open access),
eds. Jörg Henkelandand Nikil Dutt, pp. 161--180, Jan., 2021, Springer
Bibtex.
Direct Heap Snapshotting in the Java HotSpot VM: a Prototype,
Ludvig Janiuk,
M.S. thesis, KTH Royal Institute of Technology,
Dec., 2020,
Bibtex.
Soft Errors Detection and Automatic Recovery Based on Replication Combined with Different Levels of Checkpointing,
Diego Montezanti, Enzo Rucci, Armando De Giusti, Marcelo Naiouf, Dolores Rexachs, and Emilio Luque,
Future Generation Computer Systems 113, pp. 240--254, Dec, 2020, Elsevier,
Bibtex.
Déploiement Efficace d'Applications Cloud dans les Infrastructures Fog Distribuées,
Arif Ahmed, PhD thesis, U. Rennes 1, Dec., 2020,
Bibtex
Deploying Checkpoint/Restart for Production Workloads at NERSC,
Zhengji Zhao, Rebecca Hartman-Baker, Gene Cooperman,
Proc. of the Int. Conf. for High Performance Computing, Networking, Storage and Analysis (State of the Practice) (SC'20),
3 pages, Nov., 2020, IEEE Computer Society,
Bibtex
Containers Runtimes War: A Comparative Study,
Ramzi Debab and Walid Khaled Hidouci,
Proc. of Future Technologies Conference (FTC'10), Volume 2, pp. 135--161, Nov., 2020, (Advances in Intelligent Systems and Computing book series, volume 1289, Springer)
Bibtex.
Profiles of Upcoming HPC Applications and Their Impact on Reservation Strategies,
Ana Gainaru, Brice Goglin, Valentin Honoré, and Guillaume Pallez,
IEEE Trans. on Parallel and Distributed Systems 32(5), pp. 1178--1190, Nov, 2020, IEEE Press,
Bibtex.
Improving Utilization of Heterogeneous Clusters,
Stafford, Esteban and Bosque, José Luis,
The Journal of Supercomputing 76, pp. 8787--8800, Jan., 2020, Springer
Bibtex.
Privaros: A Framework for Privacy-Compliant Delivery Drones,
Rakesh Rajan Beck, Abhishek Vijeev, and Vinod Ganapathy,
Proc. of ACM SIGSAC Conference on Computer and Communications Security (CCS'20), pp. 181–194 Oct., 2020, ACM Press
(and arXiv preprrint arXiv:2002.06512v3)
Bibtex.
System-Level vs. Application-Level Checkpointing,
Jonas Posner,
Proc. of IEEE Int. Conf. on Cluster Computing (CLUSTER'20), pp. 404--405, Sept., 2020, IEEE Computer Society,
Bibtex.
Standard Compliant Snapshotting for SystemC Virtual Platforms,
Bastian Farkas, PhD thesis, Technische Universität Braunschweig,
Sept., 2020,
Bibtex.
Implementation of Resilience as a Service for Parallel Computing,
Revina Awalia Putri, Idris Winamo, Wiratmoko Yuwono, and Agus Priyo Utomo,
Proc. of International Electronics Symposium (IES'20), pp. 626--630, Sept, 2020,
Bibtex.
Docker Container Deployment in Distributed Infrastructures with Checkpoint/Restart Fog,
Arif Ahmed, Apoorve Mohan, Gene Cooperman, and Guillaume Pierre,
Proc. of 8th IEEE Int. Conf. on Mobile Cloud Computing, Services, and Engineering (MobileCloud'20), pp. 55--62, Aug., 2020, IEEE Press
(and HAL Technical Report)
Bibtex.
Analysis of Checkpoint I/O Behavior,
Betzabeth León, Pilar Gomez-Sanchez, Daniel Franco, Dolores Rexachs, and Emilio Luque,
International Conference on Computational Science (ICCS'20), Lecture Notes in Computer Science 12137, pp. 191--205, Jun., 2020, Springer
(and ICCS Camera Ready Version)
Bibtex.
Determinación de la Eficiencia en el Procesamiento sore Arquitecturas Multiprocesador y Estrategias de Tolerancia a Fallos en HPC,
Jorge Rafael Osio, Diego Miguel Montezanti, Marcelo Angel Cappelletti, Eduardo Kunysz, and Martín Morales,
XXII Workshop de Investigadores en Ciencias de la Computación (WICC'20), El Calafate, Santa Cruz (Argentina), May, 2020,
Bibtex.
Resumen de Tesis: SEDAR: Detecciónn y Recuperación Automática de Fallos Transitorios en Sistemas de Cómputo de Altas Prestaciones,
(Thesis summary: SEDAR: Detection and Automatic Recovery from Transient Faults in High Performance Computing Systems)
Diego Montezanti,
XXII Workshop de Investigadores en Ciencias de la Computación (WICC'20), El Calafate, Santa Cruz (Argentina), May, 2020
Bibtex
Middleware to Manage Fault Tolerance Using Semi-Coordinated Checkpoints,
Alvaro Wong, Elisa Heymann, Dolores Rexachs and Emilio Luque,
IEEE Trans. on Parallel and Distributed Systems 32(2), pp. 254--268, May, 2020, IEEE Press,
Bibtex.
Elastic Execution of Checkpointed MPI Applications,
Sumeet Gajjar and Saurabh Vaidya,
6 pages, May, 2020, arXiv preprint arXiv:2005.07543,
Bibtex.
Systèmes Résilients pour l'Automobile : d'une Approche à Composants à une Approche à Objets de la Tolérance aux Fautes Adaptative sur ROS,
Matthieu Amy, PhD thesis,
Institut National Polytechnique de Toulouse, May, 2020
Bibtex
SEDAR: Detecciónn y Recuperación Automática de Fallos Transitorios en Sistemas de Cómputo de Altas Prestaciones,
(SEDAR: Detection and Automatic Recovery from Transient Faults in High Performance Computing Systems),
Diego Miguel Montezanti, PhD thesis,
Universidad Nacional de la Plata, Mar., 2020
Bibtex
GIFT: A Coupon Based Throttle-and-Reward Mechanism for Fair and Efficient I/O Bandwidth Management on Parallel Storage Systems,
Tirthak Patel, Rohan Garg, and Devesh Tiwari, 18th USENIX Conf. on File and Storage Technologies (FAST'20), Feb., 2020,
Bibtex.
IaaS Cloud as a Virtual Environment for Experimentation in Checkpoint Analysis,
Betzabeth Leén, Pilar Gomez-Sanchez, Daniel Franco, Dolores Rexachs, and Emilio Luque,
Journal of Computer Science and Technology (JCS&T) 19(2), pp. 110--122, Oct., 2019,
Bibtex.
SEDAR: Detectando y Recuperando Fallos Transitorios en Aplicaciones de HPC,
Diego Miguel Montezanti, Enzo Rucci, Dolores Rexachs del Rosario, Emilio Luque Fadón, Marcelo Naiouf, and Armando Eduardo De Giusti,
XXV Congreso Argentino de Ciencias de la Computación (CACIC'19),
(Universidad Nacional de Río Cuarto Córdoba, Oct., 2019)
Bibtex
Process Migration-based Computational Offloading Framework for IoT-supported Mobile Edge/Cloud Computing,
Abdullah Yousafzai, Ibrar Yaqoob, Muhammad Imran, Abdullah Gani, and Rafidah Md Noor,
Internet of Things Journal 7(5),
pp. 4171--4182, Sept., 2019, IEEE Press, Bibtex
Checkpointing the Un-checkpointable: MANA and the Split-Process Approach,
Gene Cooperman,
MVAPICH User Group (MUG'19),
Columbus, Ohio, Aug. 20, 2019; MUG'19 program, slides;
video (from https://insidehpc.com/2019/09/checkpointing-the-un-checkpointable-mana-and-the-split-process-approach/);
Bibtex.
Architectural-Space Exploration of Heterogeneous Reliability and Checkpointing Modes for Out-of-Order Superscalar Processors,
Bharath Srinivas Prabakaran, Mihika Dave, Florian Kriebel, Semeen Rehman, and Muhammad Shafique,
IEEE Access 7,
pp. 145324--145339, Jul.. 2019, IEEE Press,
(Also as arXiv tech. report)
Bibtex.
Checkpoint/Restart Approaches for a Thread-based MPI Runtime,
Julien Adam, Maxime Kermarquer, Jean-Baptiste Besnard, Leonardo Bautista-Gomez, Marc Pérache, Patrick Carribault, Julien Jaeger, Allen D Malony, and Sameer Shende,
Parallel Comkputing 85, pp. 204--219, Jul., 2019, Elsevier Bibtex.
Resilience in high-level parallel programming languages,
Sara S. Hamouda, PhD thesis, The Australian National University, Jun., 2019,
Bibtex
Extending the Domain of Transparent Checkpoint-Restart for Large-scale HPC,
Rohan Garg, PhD thesis, Northeastern U., May, 2019,
Bibtex
Exploring Semantic Reverse Engineering for Software Binary Protection,
Pengfei Sun, PhD thesis,
Ruthers,The State University of New Jersey, May, 2019
Bibtex
Active Replication for Centrally Coordinated Teams of Autonomous Vehicles,
Nasos Grigoropoulos, and Manos Koutsoubelias, and Spyros Lalis,
15th Int. Conf. on Distributed Computing in Sensor Systems (DCOSS'19),
pp. 114--122, May, 2019, IEEE Press, Bibtex
Prediction of Energy Consumption by Checkpoint/Restart in HPC,
Marina Morán, Javier Balladini, Dolores Rexachs, and Emilio Luque,
IEEE Access 7,
pp. 71791--71803, May, 2019, IEEE Press,
Bibtex.
Determinación de la eficiencia y estrategias de tolerancia a fallos en arquitecturas multiprocesador para aplicaciones de procesamiento de datos,
Jorge Rafael Osio, Diego Miguel Montezanti, Eduardo Kunysz, and Daniel Martin Morales,
XXI Workshop de Investigadores en Ciencias de la Computación (WICC'19), Apr., 2019, Universidad Nacional de San Juan (Argentina), Bibtex
Job Migration in HPC Clusters by Means of Checkpoint/Restart,
Manuel Rodríguez-Pascual, Jiajun Cao, José A. Moríñigo, Gene Cooperman, Rafael Mayo-García,
The Journal of Supercomputing 75(10), pp. 6517--6541, online Apr., 2019, Springer, Bibtex
Resource Manager for Scalable Performance in ROS Distributed Environments,
Daisuke Fukutomi, Takuya Azumi, Shinpei Kato, and Nobuhiko Nishio,
Design, Automation \& Test in Europe Conference \& Exhibition (DATE'19),
pp. 1088--1093, Mar., 2019, IEEE Press, Bibtex
Checkpointing and Migration of IoT Edge Functions,
Pekka Karhula, and Jan Janak, and Henning Schulzrinne,
Proc. of 2nd Int. Workshop on Edge Systems, Analytics and Networking (EdgeSys'19; co-located with EuroSys'19), pp. 60-65, Mar., 2019, ACM Press, Bibtex
Uma Taxonomia de Sistemas de Tolerância a Falhas em Ambientes de Computação em Nuvem Open Source,
Vinicius Santos Andrade, M.S. thesis, Universidade Estadual Paulista "Jûlio de Mesquita Filho", Jan., 2019
Bibtex
Elastic Scheduling in HPC Resource Management Systems,
Feng Liu, PhD thesis, University of Minnesota, Dec., 2018,
Bibtex
H-RADIC: A Fault Tolerance Framework for Virtual Clusters on Multi-Cloud Environments,
Ambrosio Royo, Jorge Villamayor, Marcela Castro-León, Dolores Rexachs, and Emilio Luque,
Journal of Computer Science and Technology (JCS&T) 18(3), pp. e24--e24, Dec., 2018, Springer, Bibtex.
Autonomic Approach based on Semantics and Checkpointing for IoT System Management,
François Aïssaoui, PhD thesis, Toulouse 1, Nov., 2018,
Bibtex
Checkpoint and Restart: An Energy Consumption Characterization,
Marina Morán, Javier Balladini, Dolores Rexachs, and Emilio Luque,
Argentine Congress of Computer Science (CACIC'18),
pp. 19--33, Oct., 2018, Springer,
(Also appearing in Spanish as: Factores que afectan el consumo energético de operaciones de checkpoint y restart en clusters (XIX Workshop Prcoesamiento Distribuido y Paralelo (WPDP) of CACIC'18), with pdf), Bibtex
Programming and Testing Support for Drone Based Applications,
Manos Koutsoubelias, PhD thesis, University of Thessaly (Greece), Sept., 2018,
Bibtex.
Resource Management for Extreme Scale High Performance Computing Systems in the Presence of Failures,
Daniel Dauwe, PhD thesis, Colorado State University, Sept., 2018,
Bibtex
Transparent High-Speed Network Checkpoint/Restart in MPI,
Julien Adam, Jean-Baptiste Besnard, Allen D Malony, Sameer Shende, PéMarc rache, Patrick Carribault, Julien Jaeger,
Proc. of 25th European MPI User Group Meeting,
ACM, 12 pages, Sept., 2018, Bibtex.
Fault Tolerance in Cloud Computing Environment: A Systematic Survey,
Moin Hasan, and Major Singh Goraya,
Computers in Industry 99, pp. 156--172, Aug., 2018, Elsevier, Bibtex
Characterization of I/O Patterns generated by Fault Tolerance in HPC environments,
Betzabeth Lén, Daniel Franco, Dolores Rexachs, and Emilio Luque,
Proc. of Int. Conf. on Parallel and Distributed Processing Techniques and Applications (PDPTA'18), pp. 28--34, Jul.. 2018
Bibtex.
Leveraging the Checkpoint-Restart Technique for Optimizing CPU Efficiency of ATLAS Production Applications on Opportunistic Platforms,
D. Cameron, J. Elmsheuser, L. Heinrich, W. Lavrijsen, P. Nilsson, V. Tsulaia, M. Vogel on behalf of the ATLAS Collaboration,
5 pages, 2018, IOPScience 1085 (2018)032028,
Bibtex. (See preprint version, below, also by Cameron et al.)
Checkpointing a Subsystem Remotely,
Gene Cooperman,
MVAPICH User Group (MUG'18),
Columbus, Ohio, Aug. 7, 2018; MUG'18 program, slides;
Bibtex.
Automatic Characterization of HPC Job Parallel Filesystem I/O Patterns,
Joseph P White, Alexander D Kofke, Robert L DeLeon, Martins Innus, Matthew D Jones, and Thomas R Furlani,
Proc. of the Practice and Experience on Advanced Research Computing (PEARC'18),
pp. 1--8, July, 2018, Bibtex.
Shiraz: Exploiting System Reliability and Application Resilience Characteristics to Improve Large Scale System Throughput,
Rohan Garg, Tirthak Patel, Gene Cooperman, and Devesh Tiwari,
48th Annual IEEE/IFIP Int. Conf. on Dependable Systems and Networks (DSN'18),
IEEE, pp. 83--94, July, 2018, Bibtex.
Fault-Tolerance Support for Mobile Robotic Applications,
Manos Koutsoubelias and Spyros Lalis,
13th Int. Symp. on Industrial Embedded Systems (SIES'18),
IEEE, pp. 1--10, June, 2018, Bibtex.
RaaS: Resilience as a Service,
Jorge Villamayor, Dolores Rexachs, Emilio Luque, Diego Lugones,
Proc. 18th IEEE/ACM Int. Symp. on Cluster, Cloud and Grid Computing (CCGRID'18),
IEEE, pp. 356--359, May, 2018, Bibtex.
CDBB: an NVRAM-based Burst Buffer Coordination System for Parallel File Systems,
Ziqi Fan, Fenggang Wu, Jim Diehl, David HC Du, and Doug Voigt,
Proc. of the High Performance Computing Symposium (HPC'18),
pp. 1:1--1:12, Apr., 2018, ACM Press,
Society for Computer Simulation International, Bibtex.
Transparently Checkpointing Software Test Benches to Improve Productivity of SoC Verification in an Emulation Environment,
Ankit Garg, Suresh Krishnamurthy, Gene Cooperman, Rohan Garg, and Jeff Evans,
2018 Design and Verification Conference and Exhibition (DVCON-US 2018),
San Jose, CA, Feb. 27, 2018; DVCON-US 2018,
slides, Bibtex.
(IMPORTANT: Browser must allow popups, to view paper at DVCon site.)
Transparent checkpointing over RDMA-based networks,
Jiajun Cao, PhD thesis, Northeastern U., Dec., 2017,
Bibtex
ITALC: Interactive Tool for Application-Level Checkpointing,
Ritu Arora and Trung Nguyen Ba,
Proc. of Fourth International Workshop on HPC User Support Tools (HUST'17),
Nov., 2017; (slides), Bibtex.
E-HPC: a Library for Elastic Resource Management in HPC Environments>,
William Fox, Devarshi Ghoshal, Abel Souza, Gonzalo P Rodrigo, and LavanyaRamakrishnan,
Proc. of the 12th Workshop on Workflows in Support of Large-Scale Science,
Nov., 2017; Bibtex.
Constructing the Formal Grammar of System Calls,
Nikolay Efanovand and Pavel Emelyanov,
Proc. 13th Central & Eastern European Software Engineering Conference in Russia,
Oct., 2017; Bibtex.
Selective Checkpointing for Minimizing Recovery Energy and Efforts of Smartphone Apps,
Li Li, Yunhao Bai, Xiaorui Wang, Mai Zheng, and Feng Qin,
Eighth Int. Green and Sustainable Computing Conference (IGSC'17),
IEEE, pp. 1--8, Oct., 2017; Bibtex.
Leveraging the Checkpoint-restart Technique for Optimizing CPU Efficiency of ATLAS Production Applications on Opportunistic Platforms,
D. Cameron, J. Elmsheuser, L. Heinrich, W. Lavrijsen, P. Nilsson, V. Tsulaia, M. Vogel on behalf of the ATLAS Collaboration,
ATL-SOFT-PROC-2017-064,
Oct., 2017; Bibtex.
When you have a hammer, everything is a nail: Checkpoint/Restart in Slurm,
Manuel Rodríguez-Pascual, Jose Antonio Moríñigo, and Rafael Mayo-García,
Slurm User Group Meeting — 2017,
Berkeley, CA, Sept. 26, 2017; Slurm User Group Agenda (accessed Oct., 2017);
Bibtex.
DMTCP: Fixing the Single Point of Failure of the ROS Master,
Gene Cooperman and Twinkle Jain,
ROSCon 2017,
Vancouver, Canada, Sept. 21, 2017; ROSCon'17 program, slides;
video;
Bibtex.
Expedite any Simulation with DMTCP and Save Decades of Computation,
Balaji R (presenting), Sathish Kumar Sugumara, Gene Cooperman, Rohan Garg, and Jiajun Cao,
2017 Design and Verification Conference and Exhibition (DVCON-India 2017),
Bengalauru, India, Sept. 15, 2017; DVCON-India 2017,
Bibtex.
Intelligent Checkpointing Strategies for IoT System Management,
Aïssaoui, François and Cooperman, Gene and Monteil, Thierry and Tazi, Saïd,
Future Internet of Things and Cloud (FiCloud'17),
IEEE, pp. 305--312, Aug., 2017; Bibtex.
Extending DMTCP Checkpointing for a Hybrid Software World,
Gene Cooperman,
MVAPICH User Group (MUG'17),
Columbus, Ohio, Aug. 16, 2017; MUG'17 program, slides;
video;
Bibtex.
A Methodology for Soft Errors Detection and Automatic Recovery,
Jorge Villamayor, Dolores Rexachs, Emilio Luque, Diego Montezanti, A. De Giusti, and M. Naiouf,
Int. Conf. on High Performance Computing & Simulation,
July, 2017; Bibtex.
When is the Right Time to Start the Fault Tolerance Protection?,
Jorge Villamayor, Dolores Rexachs, and Emilio Luque,
Int. Conf. on High Performance Computing & Simulation,
July, 2017; Bibtex.
A Fault Tolerance Manager with Distributed Coordinated Checkpoints for Automatic Recovery,
Jorge Villamayor, Dolores Rexachs, and Emilio Luque,
Int. Conf. on High Performance Computing & Simulation (HPCS-17),
July, 2017; Bibtex.
Performance of Android Cluster System Allowing Dynamic Node Reconfiguration,
Yuki Sawada, Yusuke Arai, Kanemitsu Ootsu, Takashi Yokota, and Takesh Ohkawa,
?Wireless Personal Communications, 93(4), pp. 1067--1087, April, 2017, Springer,
Bibtex
Transition Watchpoints: Teaching Old Debuggers New Tricks,
Kapil Arya, Tyler Denniston, Ariel Rabkin, and Gene Cooperman,
The Art, Science, and Engineering of Programming1(2),
28 pages, Apr., 2017; Bibtex.
Fault Tolerance and Message Passing Interface Programs,
Mohammad Miyan,
Int. J. of Advanced Research in Computer Science,
pp. 128--135, Mar/Apr, 2017; Bibtex.
A Reflexive Tactic for Polynomial Positivity using Numerical Solvers and Floating-point Computations,
Érik Martin-Dorel and Pierre Roux,
ACM SIGPLAN Conference on Certified Programs and Proofs (CPP 2017), pp. 90--99, Jan., 2017, ACM,
Bibtex
DMTCP: Deadline-aware Multipath TCP,
Huang, Chengyuan, Zhang, Jiao, Huang, Tao and Liu, Yunjie,
Proc. of Communications Workshops (2017 IEEE Int. Conf. on ICC), pp. 681--686, May, 2017, IEEE,
Bibtex
A Performance and Energy Comparison of Fault Tolerance Techniques for Exascale Computing Systems,
D. Dauwe, S. Pasricha, A. A. Maciejewski, and H. J. Siegel,
IEEE Int. Conf. on Computer and Information Technology (CIT'16),
pp. 436--443, Dec., 2016, IEEE,
Bibtex
Trace-free Memory Data Structure Forensics via Past Inference and Future Speculations,
Penfei Sun, Rui Han, Mingbo Zhang and Saman Zonouz,
Proc. of 32nd Annual Conf. on Computer Security Applications, pp. 570--582, Dec., 2016, ACM,
Bibtex
Smart Scene Management for IoT-based Constrained Devices using Checkpointing,
François Aïssaoui, Gene Cooperman, Thierry Monteil, and Saïd Tazi,
15th IEEE Int. Symp. on Network Computing and Applications (NCA'16),
Cambridge, MA, USA, Oct. 31 - Nov. 2, 2016, pp. 170--174, IEEE Press, Nov., 2016,
Bibtex
Adaptive Fault Tolerance on ROS: A Component-Based Approach,
Jean-Charles Fabre, Michaël Lauer, Matthieu Amy,
ROSCon 2016, slides and video only, Oct., 2016,
Bibtex
Applying Future Exascale HPC Methodologies in the Energy Sector,
José J Camata, José M Cela, Danilo Costa, Alvaro LGA Coutinho, Daniel Fernández-Galisteo, Carmen Jiménez, Vadim Kourdioumov, Marta Mattoso, Rafael Mayo-García, Thomas Miras, and J.A. Moríñigo,
Proc. of Russian Supercomputing Days 2016, p. 9--19, Sept., 2016, UPCommons,
Bibtex
Deduplication Potential of HPC Applications' Checkpoints,
Jürgen Kaiser, Ramy Gad, Tim Süß, Federico Padua, Lars Nagel and André Brinkmann,
Proc. of IEEE Int. Conf. on Cluster Computing (Cluster'16),
pp. 413--422, Taipei, Taiwan, IEEE Press, Sept., 2016,
Bibtex
Enhancing Energy Production with Exascale HPC Methods (and prior technical report),
Rafael Mayo-García, José J. Camata, José M.Cela, Danilo Costa, Alvaro LGA Coutinho, Daniel Fernández-Galiste, Carmen Jiménez, Vadim Kourdioumov, Marta Mattoso, Thomas Miras, José A. Moríñigo, Jorge Navarro; Philippe O. A. Navaux, Daniel de Oliveira, Manuel Rodríguez-Pascual, Vítor Silva, Renan Souza, and Patrick Valduriez,
Latin American High Performance Computing Conference (CARLA'16), pp. 233--246, Aug., 2016, Springer (Communications in Computer and Information Science book series, CCIS, vol. 697),
Bibtex
Scalable System-level Transparent Checkpointing for OpenSHMEM,
Rohan Garg, Jérôme Vienne and Gene Cooperman,
OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments --- Third Workshop,
OpenSHMEM 2016, Baltimore, MD, USA, Aug. 2--4, 2016, Revised Selected Papers (OpenSHMEM'16),
pp. 52--65, Lecture Notes in Computer Science, Volume 10007, Springer-Verlag, Aug., 2016,
Bibtex
Extended Batch Sessions and Three-Phase Debugging: Using DMTCP to Enhance the Batch Environment,
Rohan Garg, Jiajun Cao, Kapil Arya, Gene Cooperman and Jérôme Vienne,
Proc. of the (XSEDE16) Conference on Diversity, Big Data, and Science at Scale,
pp. 42:1--42:8, ACM Press, July, 2016, (and slides),
Bibtex.
Computational Studies of De Novo Motif Discovery in Aptamer Selections,
Kevin R. Shieh, PhD thesis, Yeshiva University, 2016,
Bibtex
A Checkpointing Methodology for Android Smartphone,
Yunhao Bai, M.S. thesis, The Ohio State University, 2016,
Bibtex
Checkpointing with DMTCP and MVAPICH2 for Supercomputing,
Kapil Arya,
MVAPICH User Group (MUG'16),
Columbus, Ohio, Aug. 17, 2016; MUG'16 program, slides;
Bibtex.
Simulation Infrastructure for the Study of Performance/QOS/Energy,
Georgios Ioannis Kopanas,
Diploma Thesis, U. of Thessaly
Feb., 2016, Bibtex.
An Affinity-structure Database of Helix-turn-helix: DNA Complexes with a Universal Coordinate System,
Mohammed and AlQuraishi, Shengdong Tang and Xide Xia,
BMC Bioinformatics 16:390, 19 pages, 2015, BioMed Central,
Bibtex.
HOL (y) Hammer: Online ATP Service for HOL Light,
Cezary Kaliszyk and Josef Urban,
Mathematics in Computer Science 9(1), pp. 5--22, 2015, Springer,
(first published online on Jun 28, 2014)
Bibtex.
Parallel Application Signature for Performance Analysis and Prediction (or alt),
Alvaro Wong, Dolores Rexachs, Emilio Luque,
IEEE Trans. on Parallel and Distributed Systems 26(7), pp. 2009--2019, 2015, IEEE Press,
Bibtex.
Elastic Job Bundling: An Adaptive Resource Request Strategy for Large-scale Parallel Applications (or alt),
Feng Liu and Jon B. Weissman,
Proc. of the Int. Conf. for High Performance Computing, Networking, Storage and Analysis (SC'15), 12 pages, Nov., 2015, ACM,
Bibtex.
Performance Improvement in Automata Learning: Speeding up LearnLib using Parallelization and Checkpointing,
Marco Henrix,
M.S. thesis, Radboud University Nijmegen, Netherlands, Aug., 2015,
Bibtex.
An Android Cluster System Capable of Dynamic Node Reconfiguration,
Yuki Sawada, Yusuke Arai, Kanemitsu Ootsu, Takashi Yokota and Takeshi Ohkawa,
Proc. of 2015 Seventh Int. Conf. on Ubiquitous and Future Networks (ICUFN), pp. 689--694, IEEE Press, July, 2015,
Bibtex.
Enabling Sender-initiated Distributed Applications and Checkpointing in Content Centric Networks,
Nitinder Mohan,
Master of Technology Thesis, IIIT Delhi (Indraprastha Institute of Information Technology), July, 2015
Bibtex.
Optimizing Checkpoint Restart with Data Deduplication,
Chen, Zhengyu and Sun, Jianhua and Chen, Hao
Scientific Programming, May, 2016, Hindawi Publishing Corporation
Bibtex.
Transparent Checkpointing for Supercomputing,
Jiajun Cao and Rohan Garg
MVAPICH User Group (MUG'15),
Columbus, Ohio, Aug. 20, 2015; MUG'15 program, slides, and video;
Bibtex.
Transparent Checkpoint-Restart: Re-Thinking the HPC Environment,
Gene Cooperman,
MVAPICH User Group (MUG'15),
Columbus, Ohio, Aug. 19, 2015; MUG'15 program, slides, and video;
Bibtex.
Recent Trends towards Green Clouds by using Fuzzy based Live Migration (or alt),
Amrinder Kaur and Anil Kumar,
International Journal of Computer Applications 113(3) (0975--8887), pp. 17--22, Mar., 2015,
Bibtex.
Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters,
R.R.Chandrasekar, A. Venkatesh, K. Hamidouche and D.K. Panda,
Proc. of 15th IEEE/ACM Int. Symp. on Cluster, Cloud and Grid Computing (CCGrid'15),
pp. 261--270, IEEE Press, 2015, Bibtex.
Checkpointing as a Service in Heterogeneous Cloud Environments,
Jiajun Cao, Matthieu Simonin, Gene Cooperman and Christine Morin,
Proc. of 15th IEEE/ACM Int. Symp. on Cluster, Cloud and Grid Computing (CCGrid'15),
pp. 61--70, IEEE Press, 2015, Bibtex.
Energy Efficient Rescheduling Algorithm for High Performance Computing,
Manisha Chauhan, Nazia Parveen, Sumit Kumar Saurav and GL, Ganga Prasad,
Nat. Conf. on Parallel Computing Technologies (PARCOMPTECH'15), IEEE Press, 2015,
Bibtex.
CCNCheck: Enabling Checkpointed Distributed Applications in Content Centric Networks,
Nitinder Mohan and Pushpendra Singh,
CCNxCon'15: Content Centric Networking (technical talk abstract), 2 pages,
Bibtex.
DMTCP: Bringing Interactive Checkpoint-Restart to Python,
Kapil Arya and Gene Cooperman,
Computational Science & Discovery, 16 pages, 2015, IOPScience,
Bibtex.
Using Checkpointing and Virtualization for Fault Injection,
Cyrille Artho and Kuniyasu Suzaki and Masami Hagiya and Watcharin Leungwattanakit and Richard Potter and Eric Platon and Yoshinori Tanabe and Franz Weitl and Mitsuharu Yamamoto,
International Journal of Networking and Computing 5(2), pp. 347--372, 2015,
Bibtex.
Using Checkpointing and Virtualization for Fault Injection,
Cyrille Artho, Masami Hagiya, Watcharin Leungwattanakit, Eric Platon, Richard Potter, Kuniyasu Suzaki, Yoshinori Tanabe, Franz Weitl and Mitsuharu Yamamoto,
Second Int. Symp. on Computing and Networking (CANDAR'14), pp. 144--150, Dec., 2014, IEEE Press,
Bibtex.
Be Kind, Rewind --- Checkpoint & Restore Capability for Improving Reliability of Large-scale Semiconductor Design,
Igor Ljubuncic, Ravi Giri, Avikam Rozenfeld, and Andrew Goldis,
2014 IEEE High Performance Extreme Computing Conference (HPEC-2014),
6 pages, IEEE Press, Sept., 2014,
Bibtex.
Performance Evaluation of Checkpoint/Restart Techniques: For MPI Applications on Amazon Cloud,
Basma Abdel Azeem and Manal Helal,
Informatics and Systems, 9th Int. Conf. on (INFOS'14), pp. 49--57, Sep., 2014, IEEE Press,
Bibtex
DMTCP: System-Level Checkpoint-Restart in User-Space,
Kapil Arya and Gene Cooperman,
MVAPICH User Group (MUG'14),
Columbus, Ohio, Aug. 26, 2014; MUG'14 program, slides, and video;
Bibtex.
Metodología para Predecir el Consumo Energético de Checkpoints en Sistemas de HPC,
Javier Balladini, Marina Morán, Dolores Rexachs and Emilio Luque,
XX Congreso Argentino de Ciencias de la Computación (CACCIC'14),
10 pages, Oct., 2014, Bibtex.
Using SAGA and the Open Science Grid to Search for Aptamers,
Kevin Shieh, Pilib Ó Broin, David Rhee, Matthew Levy, and Aaron Golden,
Proc. of 2014 Ann. Conf. on Extreme Science and Engineering Discovery Environment (XSEDE'14), Art. No. 27, Jul., 2014
Bibtex.
Simulation Speedup of ns-3 using Checkpoint and Restore (WNS3'14),
Kyle Harrigan and George Riley,
Proceedings of the 2014 Workshop on ns-3 (WNS3'14), Art. No. 7, 2014
Bibtex.
User-Space Process Virtualization in the Context of Checkpoint-Restart and Virtual Machines,
Kapil Arya, PhD thesis, Northeastern University, August, 2014,
Bibtex.
Use of Checkpoint-Restart for Complex HEP Software on Traditional Architectures and Intel MIC,
Kapil Arya, Gene Cooperman, Andrea Dotti and Peter Elmer,
J. Physics: Conference Series 523, Conference 1,
(from Proc. of 15th Int. Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT2013)),
IOPScience, 8 pages, 2014, Bibtex.
GemFI: A Fault Injection Tool for Studying the Behavior of Applications on Unreliable Substrates,
K. Parasyris, S.Tziantzoulis ; C.D. Antonopoulos, and N. Bellas,
44th Ann. IEEE/IFIP Int. Conf. on Dependable Systems and Networks (DSN), pp. 622--629 , IEEE Press, Jun., 2014,
Bibtex.
Алгоритмы отказоустойчивого управления ресурсами пространственно-распределённых вычислительных систем
(Algorithms for Failover Resource Management in Distributed Computing Systems),
А.Ю. Поляков , О.В. Молдованова, А.А. Пазников , М.Г. Курносов , С.Н. Мамойленко, А.В. Ефимов, (A. Yu. Polyakov et al.),
Vestnik SibGUTIS 11(4), (УДК 004.382.2) pp. 11--29, 2014,
Bibtex.
Optimization Tools of Parallel Simulation of Nanostructures with Quantum Dots,
K. V. Pavskii, M. G. Kurnosov, and A. Yu. Polyakov,
Optoelectronics, Instrumentation and Data Processing 50(3), pp. 260--265,
May, 2014, Springer Press,
Bibtex.
(Original Russian Text at: K.V. Pavskii, M.G. Kurnosov, A.Yu. Polyakov, 2014, published in Avtometriya, 2014, Vol. 50, No. 3, pp. 56--61.)
Modular Software Model Checking for Distributed Systems,
Leungwattanakit, W., Artho, C., Hagiya, M., Tanabe, Y., Yamamoto, M., and Takahashi, K.,
IEEE Trans. on Software Engineering 40(5), pp. 483--501, May, 2014, IEEE Press,
Bibtex
Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters,
Raghunath Raja Chandraseka, PhD thesis, 2014, The Ohio State University,
Bibtex
Improving the Efficiency of Fuzz Testing Using Checkpointing,
Erenst-Friedrich Zachow,
Masters Thesis, ETH-Zürich, April 1, 2014,
Bibtex.
Towards an Energy-Efficient Tool for Processing the Big Data,
Eric Renault and Selma Boumerdassi,
2nd International Conference on Future Internet of Things and Cloud (FiCloud'14), pp. 448--452, Aug., 2014, IEEE Press,
Bibtex
Abstraction Checkpointing Levels: Problems and Solutions, Bakhta Meroufel and Ghalem Belalem,
International Journal of Computing 13(3), pp. 158--169, 2014,
Bibtex.
Explorations of the Viability of ARM and Xeon Phi for Physics Processing,
David Abdurachmanov, Kapil Arya, Josh Bendavid, Tommaso Boccali, Gene Cooperman, Andrea Dotti, Peter Elmer, Giulio Eulisse, Francesco Giacomini, Christopher D. Jones, Matteo Manzali and Shahzad Muzaffar,
J. Physics: Conference Series 513, Track 5,
(from Proc. of 20th Int. Conf. on Computing in High Energy and Nuclear Physics (CHEP13)),
IOPScience, 7 pages, 2014, Bibtex.
jmodeltest.org: Selection of Nucleotide Substitution Models on the Cloud,
Jose Manuel Santorum, Diego Darriba, Guillermo L. Taboada, and David Posada,
Bioinformatics 30(9),
pp. 1310-1311, Oxford Journals, Jan. 21, 2014,
Bibtex.
DMTCP: Bringing Checkpoint-Restart to Python,
Kapil Arya and Gene Cooperman, Proc. of the 12th Python in Science Conf. (SciPy 2013),
6 pages, 2013, Bibtex.
A Framework for an In-depth Comparison of Scale-up and Scale-out,
Michael Sevilla, Ike Nassi, Kleoni Ioannidou, Scott Brandt, and Carlos Maltzahn,
Proc. of 2013 Int. Workshop on Data-Intensive Scalable Computing Systems (DISCS'13), pp. 13--18, 2013
Bibtex.
A Tool for Selecting the Right Target Machine for Parallel Scientific Applications,
Javier Panadero, Alvaro Wong, Dolores Rexachs, and Emilio Luque,
Procedia Computer Science 18, pp. 1824--1833, Elsevier, 2013,
Bibtex.
Formal Mathematics on Display: A Wiki for Flyspeck,
Carst Tankink, Cezary Kaliszyk, Josef Urban, and Herman Geuvers,
Intelligent Computer Mathematics,
Lecture Notes in Computer Science Volume, vol. 7961, pp. 152--167, Springer, 2013,
Bibtex.
Towards Computing as a Utility via Adaptive Middleware: An Experiment in Cross-paradigm Execution,
Jaroslaw Slawinski and Vaidy Sunderam,
Parallel Processing Letters 23(2), 18 pages,
World Scientific, June, 2013,
Bibtex.
Calculation of the Subgroups of a Trivial-Fitting Group,
Alexander J. Hulpke,
Proc. of 38th Int. Symp. on Symbolic and Algebraic Computation, pp. 205--210, 2013, ACM Press,
Bibtex.
Semi-Automated Debugging via Binary Search through a Process Lifetime,
Kapil Arya, Tyler Denniston, Ana-Maria Visan, and Gene Cooperman,
Proc. of 7th Workshop on Programming Languages and Operating Systems (PLOS) (part of Proc. of 24th ACM Symp. on Operating System Principles (SOSP)), 2013,
ACM Press, Oct., 2013, Bibtex.
Shorten Device Boot Time for Automotive IVI and Navigation Systems (slides),
Jim Huang and Shi-Wu Lo (developers, 0xlab),
Automotive Linux Summit (ALS2013), May 28, 2013.
(See "Part II: Userspace solution: Checkpointing"; begins at slide 66)
SweeD: Likelihood-Based Detection of Selective Sweeps in Thousands of Genomes,
P. Pavlidis, D. Živkovic, A. Stamatakis, N. Alachiotis and P. Pavlidi,
Heidelberg Institute for Theoretical Studies, Technical report Exelixis-RRDR-2013-1, February, 2013 `
A Survey of Fault Tolerance Mechanisms and Checkpoint/Restart Implementations for High Performance Computing Systems,
I.P. Egwutuoha, D. Levy, B. Selic and S. Chen,
The Journal of Supercomputing, Feb., 2013, Springer
Proposal of Incremental Software Simulation for Reduction of Evaluation Time,
Atsushi Shina, Kanemitsu Ootsu, Takeshi Ohkawa, Takashi Yokota and Takanobu Baba,
Third Int. Conf. on Networking and Computing (ICNC), pp. 311--315, IEEE Press, Dec., 2012, Bibtex.
Implement Checkpointing for Android (to speed up boot time and development process) (slides),
Jim Huang and Kito Cheng (developers, 0xlab),
Embedded Linux Conference Europe (ELCE2012),
Barcelona, Spain; Nov. 5--7, 2012. Bibtex.
Towards Fault-tolerant Energy-efficient High Performance Computing in the Cloud,
Kurt L. Keville, Rohan Garg, David J. Yates and Kaply Arya and Gene Cooperman,
Proc. of 2012 IEEE Computer Society International Conference on Cluster Computing. pp. 622--626, 2012, Bibtex.
Adapting MPI to MapReduce PaaS Clouds: An Experiment in Cross-Paradigm Execution,
Jaroslaw Slawinski and Vaidy Sunderam,
Proc. of 2012 IEEE/ACM Fifth Int. Conf. on Utility and Cloud Computing (UCC '12), pp. 199--203, 2012, Bibtex.
Creating and Improving Multi-Threaded Geant4.
Xin Dong, Gene Cooperman, John Apostolakis, Sverre Jarp, Andrzej Nowak, Makoto Asai and Daniel Brandt,
Journal of Physics: Conference Series, Volume 396, Part 5, 2012
Temporal Meta-Programming: Treating Time as a Spatial Dimension,
Ana-Maria Visan, PhD thesis, Northeastern University, April, 2012, Bibtex.
Verification of Embedded Control Systems by Simulation and Program Execution Control,
Stefan Resmerita and Wolfgang Pree,
American Control Conference (ACC), pp. 3581--3586, June, 2012, IEEE Press, Bibtex
Checkpointing in Distributed Heterogeneous Environments,
Michael Schöttner and John Mehnert-Spahn,
Technical Report, Heinrich Heine University, Duesseldorf, Germany, 26 pages, March, 2012,
(from Universität Düsseldorf: Publications),
Bibtex.
Source-Level Transformation of Legacy Sequential Program into Scalable Thread-Parallel Code,
Xin Dong, PhD thesis, Northeastern University, Dec., 2011, Bibtex.
Model Checking Distributed Systems by Combining Caching and Process Checkpointing,
Watcharin Leungwattanakit, Cyrille Artho, Masami Hagiya, Yoshinori Tanabe, and Mitsuharu Yamamoto,
26th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 103--112,
IEEE Press, Dec., 2011. Bibtex.
Including the Workload Effect in the Parallel Program Signature,
J.M. Canillas, A. Wong, D. Rexachs, and E. Luque,
Proc. of 13th Int. Conf. on High Performance Computing and Communications (HPCC), pp. 304--311,
IEEE Computer Society, Sept., 2011. Bibtex.
Predicting Parallel Applications Performance Using Signatures: the Workload Effect,
J.M. Canillas, A. Wong, D. Rexachs, and E. Luque,
9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA), pp. 299--300,
IEEE Computer Society, Dec., 2011. Bibtex.
URDB: A Universal Reversible Debugger Based on Decomposing Debugging Histories,
Ana-Maria Visan, Kapil Arya, Gene Cooperman, and Tyler Denniston,
Proc. of 6th Workshop on Programming Languages and Operating Systems (PLOS) (part of Proc. of 23rd ACM Symp. on Operating System Principles (SOSP)), 2011,
ACM Press, Oct., 2011. Bibtex.
Direct Inference of Protein--DNA Interactions using Compressed Sensing Methods,
Mohammed AlQuraishi and Harley H. McAdams,
Proc. of National Academy of Sciences (PNAS) 108(36), pp. 14819--14824,
Sept. 6, 2011. Full Text (html), Full Text (pdf), Bibtex.
Hiroyuki Takizawa and Kentaro Koyama and Katsuto Sato and Kazuhiko Komatsu and Hiroaki Kobayashi,
CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications,
Proc. of 2011 IEEE International Parallel and Distributed Processing Symposium, pp. 864--876
IEEE Computer Society, May, 2011. Bibtex.
Distributed Speculative Parallelization using Checkpoint Restart,
Devarshi Ghoshal, Sreesudhan R. Ramkumar, and Arun Chauhan,
Procedia Computer Science, 4, pp. 422--431,
May, 2011, Slides, Bibtex.
Unibus: Aspects of Heterogeneity and Fault Tolerance in Cloud Computing M. Slawiñska, J. Slawinski, and V. Sunderam,
Proc. of IEEE Int. Symp. on Parallel & Distributed Processing, Workshops and PhD Forum (IPDPSW), pp. 1--10,
Apr., 2010, Bibtex.

Click here for comments.