Job monitoring and steering in D-Grid’s High Energy Physics Community Grid

https://doi.org/10.1016/j.future.2008.05.009Get rights and content

Abstract

In the High Energy Physics Comunity Grid (HEPCG) of Germany’s D-Grid initiative, a suite of tools supporting the user in monitoring his jobs was developed. In the HEP community many users submit large numbers of jobs. A considerable fraction of these jobs fail for various reasons. Until now, it has been hard or even impossible for the user to find the reason for the job failure. The AMon tool supports the user with a graphical web-based overview on status and resource usage of his jobs. The script wrapper JEM (Job Execution Monitor) monitors a job’s environment giving detailed information about the job execution. Finally, once the job itself is running, a computational steering tool allows the user to interact with his job at runtime, to visualize intermediate results, and to modify job parameters.

Introduction

Grid Computing in High Energy Physics (HEP) is driven by the needs of the Large Hadron Collider (LHC) at CERN. Finding the Higgs particle which is responsible for the masses of particles is, besides many others, the most prominent goal of the experiments located at CERN. The LHC experiments have to process large amounts of data on the order of several PB per year. This vast amount of data must be processed and distributed to researchers all over the world. A Grid is the best solution to efficiently perform these tasks, thus, the LHC Computing Grid (LCG) [21] was deployed.

In the HEP community, many users will perform computations on the Grid. Until now, many monitoring tools which provide information about the infrastructure and the resources have been created, but the support for the user to manage, trace and control the execution of his own jobs is rare. In the HEPCG project [17], [18] of Germany’s D-Grid initiative [9] a set of tools is developed, which support the user with job specific monitoring and steering [27].

In a typical application scenario, one user can submit hundreds or thousands of jobs to the Grid. The job status information provided by the gLite middleware [13] gives only little information about the execution and environment of the jobs. Especially, if a job fails, it is very hard or even impossible to find the reason for its failure. In Section 2 the tool AMon, which supports users by collecting information about the jobs and presenting them in a Web Portal, is described. The Job Execution Monitor (JEM) (see Section 3) provides information about the job’s environment and the execution of the job’s start-up scripts. Finally, once a job is running, the user is able to interactively control the execution of the job through the computational steering tool RMOST (see Section 4). All of these three tools were developed in the HEPCG project.

Section snippets

AMon — A user centric job monitoring for large job sets

When running hundreds or thousands of jobs, a user needs help to keep an overview of the execution of the jobs. What is their status? Are jobs crashing? Are any problems arising? What is the resource usage of the jobs?

The job monitoring system AMon was designed to fulfill these needs. It provides the user with sufficient information on his/her jobs and their resource usage. Additionally, the monitoring data are pre-analyzed to give the user hints on possible problems. The data are presented in

Monitoring the execution of jobs — the job execution monitor

The main goals of the Python based “Job Execution Monitor” are monitoring of the execution of script files within user jobs in a Grid environment and the classification of job failures.

It has been developed because existing job monitoring tools in the LCG/gLite environment only provide very limited functionality. These are either command line tools which deliver simple text strings for every job or the information they provide is very limited (e.g., status only). However, there are tools like

Online Steering in HEPCG

When a Grid job runs, the user is supported with information about the execution of the job through the computational steering system RMOST (Result Monitoring and Online Steering Tool) [23], [30]. In contrast to the Job Execution Monitor, RMOST provides interactive access to the binary executable. Online Steering establishes an interactive connection to the job at runtime, visualizes intermediate results or runtime data, and allows to modify job parameters. Thus, the cycle of preparing a job,

Related Work

Resource monitoring tools like GridIce [1], and the LCG Real Time Monitor [22] focus on the service availability and resource usage from the resource’s point of view. E.g. the numbers of running/waiting/aborted/finished jobs are shown for each resource. In this they differ from the basic approach of job monitoring, which focus on a user’s jobs.

OCM-G [4] is a job monitoring tool for the Grid. It focuses on parallel jobs which are distributed across multiple sites. It contains an infrastructure

Conclusions

The HEPCG project identified the need of the user community for monitoring and the execution control of large job sets. An overall concept with three information levels was designed: an easily accessible, graphical overview on job status and resource usage, a detailed job execution monitoring environment and failure classification, and an interactive online steering of Grid applications. Appropriate tools, which provide the necessary support to the user for monitoring and steering, were

D. Lorenz is a Ph.D. student at the Institute of Operating Systems and Distributed Systems at the Universität Siegen in Germany. In 2005, he received a Dipl.-Inf. degree in computer science at the Rheinische Friedrichs-Wilhelm-Universität Bonn, Germany. Since December 2005 he works as a scientific associate at the University of Siegen for the HEPCG project of the D-Grid Initiative. Currently, his research focus is online steering in computational Grids.

References (33)

  • S. Andreozzi

    GridICE: A monitoring service for Grid systems

    Future Generation Computer Systems

    (2005)
  • ATLAS Collaboration, ATLAS TDR 14 detector and physics performance, volume 1. Technical Report LHCC 99-14, CERN,...
  • ATLAS Collaboration. ATLAS TDR 15 detector and physics performance, volume 2. Technical Report LHCC 99-15, CERN,...
  • B. Balis

    Monitoring Grid applications with Grid-enabled OMIS-Monitor

  • R. Brun, F. Rademakers, ROOT — an object oriented data analysis framework, in: Proceedings of AIHENP’96 Workshop,...
  • J.D. Brunner, et al. VASE: the visualization and application steering environment, in: Proc. of the 1993 ACM/IEEE conf....
  • R. Byrom

    Fault tolerance in the R-GMA information and monitoring system

  • CERN - European Laboratory for Particle Physics, Athena Developer Guide....
  • D-Grid webpage:...
  • G.S. Eisenhauer et al.

    An object-based infrastructure for program monitoring and steering

  • A. Esnard et al.

    A time-coherent model for the steering of parallel simulations

  • G.A. Geist et al.

    CUMULVS: Providing fault-tolerance, visualization and steering of parallel applications

    International Journal of High Performance Computing Applications

    (1997)
  • gLite webpage:...
  • RFC 2078: GSSAPI version...
  • W. Gu

    Falcon: On-line monitoring for steering parallel programs

    Concurrency: Practice and Experience

    (1998)
  • T. Harenberg
  • Cited by (8)

    • Toward dynamic and attribute based publication, discovery and selection for cloud computing

      2010, Future Generation Computer Systems
      Citation Excerpt :

      It is not clear if it is transmitted as XML, text or binary. Finally, [48] does not have a single test cause: such as a before and after screen shot of the supposed user interface. In summary, using the clusters themselves is anything but easy.

    • Monitoring and steering Grid applications with GRID superscalar

      2010, Future Generation Computer Systems
      Citation Excerpt :

      It supports monitoring of Grid entities such as resources and applications. Finally, the job monitoring system AMon [33] provides the user with sufficient information on the jobs and the resource usage. The monitoring data are pre-analyzed to give the user hints on possible problems.

    • Monitoring Data Streams at Process Level in Scientific Big Data Batch Clusters

      2015, Proceedings - 2014 International Symposium on Big Data Computing, BDC 2014
    • Analyzing data flows of WLCG jobs at batch job level

      2015, Journal of Physics: Conference Series
    • Automatic analysis of large data sets: A walk-through on methods from different perspectives

      2013, Proceedings - 2013 International Conference on Cloud Computing and Big Data, CLOUDCOM-ASIA 2013
    View all citing articles on Scopus

    D. Lorenz is a Ph.D. student at the Institute of Operating Systems and Distributed Systems at the Universität Siegen in Germany. In 2005, he received a Dipl.-Inf. degree in computer science at the Rheinische Friedrichs-Wilhelm-Universität Bonn, Germany. Since December 2005 he works as a scientific associate at the University of Siegen for the HEPCG project of the D-Grid Initiative. Currently, his research focus is online steering in computational Grids.

    P. Buchholz is full professor for Experimental Particle Physics at the University of Siegen, Germany. He received his Ph.D. from the University of Dortmund, Germany, in 1986. He has over 20 years of research experience in developing, running and analysing experiments in the fields of particle and astroparticle physics. His main interests are the development of detectors and the corresponding fast digital trigger and readout electronics, the physics of heavy quarks and the origin and composition of ultra-high energy cosmics rays.

    T. Harenberg is a Ph.D. and reseach assistant at the Institute of Mathematics and sciences at the Bergische Universität Wuppertal in Germany. In 1999, he received a Dipl.-Phys. degree in physics at the Bergische Universität Wuppertal for his studies on pixel detectors at the ATLAS detector. Afterwards, he worked on integration the DZERO and the AMANDA experiments into the European Data Grid project and received his Ph.D. in 2005. Since then, he is the leader of the Grid delevopment group of the University of Wuppertal.

    R. Müller-Pfefferkorn is a researcher at the Center for Information Services and High Performance Computing (ZIH) at the Technische Universität of Dresden, Germany. In 2001 he received a Ph.D. in particle physics from the TU Dresden. After a research position at the Institute for Nuclear and Particle Physics in Dresden he became research assistant at ZIH in 2002. His research interests are in the fields of monitoring in Grid computing, Grid data management, code optimization and performance analysis of parallel programs.

    R. Neumann graduated as an Economy Engineer at the Engineering School of Leipzig in 1968 and got his diploma in Physics from the Technical University of Dresden in 1974. He was responsible for the software development in an industrial company. Since 2000 he works in the Center for Information Services and High Performance Computing (ZIH) where he designs tools for performance analysis of parallel programs. Currently his research work focuses on performance visualisation of heterogeneous grid applications in the scope of the D-Grid project funded by the Federal Ministry of Education and Research.

    W. Walkowiak is scientific staff member in the Experimental Particle Physics group at the University of Siegen, Germany. He received his Ph.D. from the University of Mainz, Germany, in 1999. He has over 10 years of research experience in developing, running and analysing experiments in the field of particle physics. His main interests are the development of detectors for particle physics experiments and study of the physics of heavy quarks. His interests also include computing as needed for the new generation of particle physics experiments.

    R. Wismüller is a full professor for Operating Systems and Distributed Systems at University of Siegen, Germany. He received a Ph.D. and a State doctorate in Computer Science from Technische Universität München, Germany in 1994 and 2001, respectively. He has over 15 years of research experience in the field of tools for on-line monitoring and control of parallel and distributed programs, including debugging, performance analysis, and monitoring techniques. His interests also include parallel and distributed programming, Grid computing, optimizing compiler techniques, computer architecture, networking, and security.

    This work is partly funded by the Bundesministerium für Bildung und Forschung (BMBF) as part of the German e-Science Initiative (contract 01AK802E, HEP-CG).

    View full text