Job monitoring and steering in D-Grid’s High Energy Physics Community Grid☆
Introduction
Grid Computing in High Energy Physics (HEP) is driven by the needs of the Large Hadron Collider (LHC) at CERN. Finding the Higgs particle which is responsible for the masses of particles is, besides many others, the most prominent goal of the experiments located at CERN. The LHC experiments have to process large amounts of data on the order of several PB per year. This vast amount of data must be processed and distributed to researchers all over the world. A Grid is the best solution to efficiently perform these tasks, thus, the LHC Computing Grid (LCG) [21] was deployed.
In the HEP community, many users will perform computations on the Grid. Until now, many monitoring tools which provide information about the infrastructure and the resources have been created, but the support for the user to manage, trace and control the execution of his own jobs is rare. In the HEPCG project [17], [18] of Germany’s D-Grid initiative [9] a set of tools is developed, which support the user with job specific monitoring and steering [27].
In a typical application scenario, one user can submit hundreds or thousands of jobs to the Grid. The job status information provided by the gLite middleware [13] gives only little information about the execution and environment of the jobs. Especially, if a job fails, it is very hard or even impossible to find the reason for its failure. In Section 2 the tool AMon, which supports users by collecting information about the jobs and presenting them in a Web Portal, is described. The Job Execution Monitor (JEM) (see Section 3) provides information about the job’s environment and the execution of the job’s start-up scripts. Finally, once a job is running, the user is able to interactively control the execution of the job through the computational steering tool RMOST (see Section 4). All of these three tools were developed in the HEPCG project.
Section snippets
AMon — A user centric job monitoring for large job sets
When running hundreds or thousands of jobs, a user needs help to keep an overview of the execution of the jobs. What is their status? Are jobs crashing? Are any problems arising? What is the resource usage of the jobs?
The job monitoring system AMon was designed to fulfill these needs. It provides the user with sufficient information on his/her jobs and their resource usage. Additionally, the monitoring data are pre-analyzed to give the user hints on possible problems. The data are presented in
Monitoring the execution of jobs — the job execution monitor
The main goals of the Python based “Job Execution Monitor” are monitoring of the execution of script files within user jobs in a Grid environment and the classification of job failures.
It has been developed because existing job monitoring tools in the LCG/gLite environment only provide very limited functionality. These are either command line tools which deliver simple text strings for every job or the information they provide is very limited (e.g., status only). However, there are tools like
Online Steering in HEPCG
When a Grid job runs, the user is supported with information about the execution of the job through the computational steering system RMOST (Result Monitoring and Online Steering Tool) [23], [30]. In contrast to the Job Execution Monitor, RMOST provides interactive access to the binary executable. Online Steering establishes an interactive connection to the job at runtime, visualizes intermediate results or runtime data, and allows to modify job parameters. Thus, the cycle of preparing a job,
Related Work
Resource monitoring tools like GridIce [1], and the LCG Real Time Monitor [22] focus on the service availability and resource usage from the resource’s point of view. E.g. the numbers of running/waiting/aborted/finished jobs are shown for each resource. In this they differ from the basic approach of job monitoring, which focus on a user’s jobs.
OCM-G [4] is a job monitoring tool for the Grid. It focuses on parallel jobs which are distributed across multiple sites. It contains an infrastructure
Conclusions
The HEPCG project identified the need of the user community for monitoring and the execution control of large job sets. An overall concept with three information levels was designed: an easily accessible, graphical overview on job status and resource usage, a detailed job execution monitoring environment and failure classification, and an interactive online steering of Grid applications. Appropriate tools, which provide the necessary support to the user for monitoring and steering, were
D. Lorenz is a Ph.D. student at the Institute of Operating Systems and Distributed Systems at the Universität Siegen in Germany. In 2005, he received a Dipl.-Inf. degree in computer science at the Rheinische Friedrichs-Wilhelm-Universität Bonn, Germany. Since December 2005 he works as a scientific associate at the University of Siegen for the HEPCG project of the D-Grid Initiative. Currently, his research focus is online steering in computational Grids.
References (33)
GridICE: A monitoring service for Grid systems
Future Generation Computer Systems
(2005)- ATLAS Collaboration, ATLAS TDR 14 detector and physics performance, volume 1. Technical Report LHCC 99-14, CERN,...
- ATLAS Collaboration. ATLAS TDR 15 detector and physics performance, volume 2. Technical Report LHCC 99-15, CERN,...
Monitoring Grid applications with Grid-enabled OMIS-Monitor
- R. Brun, F. Rademakers, ROOT — an object oriented data analysis framework, in: Proceedings of AIHENP’96 Workshop,...
- J.D. Brunner, et al. VASE: the visualization and application steering environment, in: Proc. of the 1993 ACM/IEEE conf....
Fault tolerance in the R-GMA information and monitoring system
- CERN - European Laboratory for Particle Physics, Athena Developer Guide....
- D-Grid webpage:...
- et al.
An object-based infrastructure for program monitoring and steering
A time-coherent model for the steering of parallel simulations
CUMULVS: Providing fault-tolerance, visualization and steering of parallel applications
International Journal of High Performance Computing Applications
Falcon: On-line monitoring for steering parallel programs
Concurrency: Practice and Experience
Cited by (8)
Toward dynamic and attribute based publication, discovery and selection for cloud computing
2010, Future Generation Computer SystemsCitation Excerpt :It is not clear if it is transmitted as XML, text or binary. Finally, [48] does not have a single test cause: such as a before and after screen shot of the supposed user interface. In summary, using the clusters themselves is anything but easy.
Monitoring and steering Grid applications with GRID superscalar
2010, Future Generation Computer SystemsCitation Excerpt :It supports monitoring of Grid entities such as resources and applications. Finally, the job monitoring system AMon [33] provides the user with sufficient information on the jobs and the resource usage. The monitoring data are pre-analyzed to give the user hints on possible problems.
Analysis of series of measurements from job-centric monitoring by statistical functions
2017, Computer ScienceMonitoring Data Streams at Process Level in Scientific Big Data Batch Clusters
2015, Proceedings - 2014 International Symposium on Big Data Computing, BDC 2014Analyzing data flows of WLCG jobs at batch job level
2015, Journal of Physics: Conference SeriesAutomatic analysis of large data sets: A walk-through on methods from different perspectives
2013, Proceedings - 2013 International Conference on Cloud Computing and Big Data, CLOUDCOM-ASIA 2013
D. Lorenz is a Ph.D. student at the Institute of Operating Systems and Distributed Systems at the Universität Siegen in Germany. In 2005, he received a Dipl.-Inf. degree in computer science at the Rheinische Friedrichs-Wilhelm-Universität Bonn, Germany. Since December 2005 he works as a scientific associate at the University of Siegen for the HEPCG project of the D-Grid Initiative. Currently, his research focus is online steering in computational Grids.
P. Buchholz is full professor for Experimental Particle Physics at the University of Siegen, Germany. He received his Ph.D. from the University of Dortmund, Germany, in 1986. He has over 20 years of research experience in developing, running and analysing experiments in the fields of particle and astroparticle physics. His main interests are the development of detectors and the corresponding fast digital trigger and readout electronics, the physics of heavy quarks and the origin and composition of ultra-high energy cosmics rays.
T. Harenberg is a Ph.D. and reseach assistant at the Institute of Mathematics and sciences at the Bergische Universität Wuppertal in Germany. In 1999, he received a Dipl.-Phys. degree in physics at the Bergische Universität Wuppertal for his studies on pixel detectors at the ATLAS detector. Afterwards, he worked on integration the DZERO and the AMANDA experiments into the European Data Grid project and received his Ph.D. in 2005. Since then, he is the leader of the Grid delevopment group of the University of Wuppertal.
R. Müller-Pfefferkorn is a researcher at the Center for Information Services and High Performance Computing (ZIH) at the Technische Universität of Dresden, Germany. In 2001 he received a Ph.D. in particle physics from the TU Dresden. After a research position at the Institute for Nuclear and Particle Physics in Dresden he became research assistant at ZIH in 2002. His research interests are in the fields of monitoring in Grid computing, Grid data management, code optimization and performance analysis of parallel programs.
R. Neumann graduated as an Economy Engineer at the Engineering School of Leipzig in 1968 and got his diploma in Physics from the Technical University of Dresden in 1974. He was responsible for the software development in an industrial company. Since 2000 he works in the Center for Information Services and High Performance Computing (ZIH) where he designs tools for performance analysis of parallel programs. Currently his research work focuses on performance visualisation of heterogeneous grid applications in the scope of the D-Grid project funded by the Federal Ministry of Education and Research.
W. Walkowiak is scientific staff member in the Experimental Particle Physics group at the University of Siegen, Germany. He received his Ph.D. from the University of Mainz, Germany, in 1999. He has over 10 years of research experience in developing, running and analysing experiments in the field of particle physics. His main interests are the development of detectors for particle physics experiments and study of the physics of heavy quarks. His interests also include computing as needed for the new generation of particle physics experiments.
R. Wismüller is a full professor for Operating Systems and Distributed Systems at University of Siegen, Germany. He received a Ph.D. and a State doctorate in Computer Science from Technische Universität München, Germany in 1994 and 2001, respectively. He has over 15 years of research experience in the field of tools for on-line monitoring and control of parallel and distributed programs, including debugging, performance analysis, and monitoring techniques. His interests also include parallel and distributed programming, Grid computing, optimizing compiler techniques, computer architecture, networking, and security.
- ☆
This work is partly funded by the Bundesministerium für Bildung und Forschung (BMBF) as part of the German e-Science Initiative (contract 01AK802E, HEP-CG).