The following article is Open access

PanDA Pilot Submission using Condor-G: Experience and Improvements

, , , , , , and

Published under licence by IOP Publishing Ltd
, , Citation Xin Zhao et al 2011 J. Phys.: Conf. Ser. 331 072069 DOI 10.1088/1742-6596/331/7/072069

1742-6596/331/7/072069

Abstract

PanDA (Production and Distributed Analysis) is the workload management system of the ATLAS experiment, used to run managed production and user analysis jobs on the grid. As a late-binding, pilot-based system, the maintenance of a smooth and steady stream of pilot jobs to all grid sites is critical for PanDA operation. The ATLAS Computing Facility (ACF) at BNL, as the ATLAS Tier1 center in the US, operates the pilot submission systems for the US. This is done using the PanDA "AutoPilot" scheduler component which submits pilot jobs via Condor-G, a grid job scheduling system developed at the University of Wisconsin-Madison. In this paper, we discuss the operation and performance of the Condor-G pilot submission at BNL, with emphasis on the challenges and issues encountered in the real grid production environment. With the close collaboration of Condor and PanDA teams, the scalability and stability of the overall system has been greatly improved over the last year. We review improvements made to Condor-G resulting from this collaboration, including isolation of site-based issues by running a separate Gridmanager for each remote site, introduction of the 'Nonessential' job attribute to allow Condor to optimize its behavior for the specific character of pilot jobs, better understanding and handling of the Gridmonitor process, as well as better scheduling in the PanDA pilot scheduler component. We will also cover the monitoring of the health of the system.

Export citation and abstract BibTeX RIS

Please wait… references are loading.
10.1088/1742-6596/331/7/072069