P Nilsson

Overview of ATLAS PanDA Workload Management

T. Maeno, K. De, T. Wenaus et al.|Journal of Physics Conference Series|2011

Cited by 76Open Access

The Production and Distributed Analysis System (PanDA) plays a key role in the ATLAS distributed computing infrastructure. All ATLAS Monte-Carlo simulation and data reprocessing jobs pass through the PanDA system. We will describe how PanDA manages job execution on the grid using dynamic resource estimation and data replication together with intelligent brokerage in order to meet the scaling and automation requirements of ATLAS distributed computing. PanDA is also the primary ATLAS system for processing user and group analysis jobs, bringing further requirements for quick, flexible adaptation to the rapidly evolving analysis use cases of the early datataking phase, in addition to the high reliability, robustness and usability needed to provide efficient and transparent utilization of the grid for analysis users. We will describe how PanDA meets ATLAS requirements, the evolution of the system in light of operational experience, how the system has performed during the first LHC data-taking phase, and plans for the future.

Evolution of the ATLAS PanDA workload management system for exascale computational science

T. Maeno, K. De, A. Klimentov et al.|Journal of Physics Conference Series|2014

Cited by 39Open Access

An important foundation underlying the impressive success of data processing and analysis in the ATLAS experiment [1] at the LHC [2] is the Production and Distributed Analysis (PanDA) workload management system [3]. PanDA was designed specifically for ATLAS and proved to be highly successful in meeting all the distributed computing needs of the experiment. However, the core design of PanDA is not experiment specific. The PanDA workload management system is capable of meeting the needs of other data intensive scientific applications. Alpha-Magnetic Spectrometer [4], an astro-particle experiment on the International Space Station, and the Compact Muon Solenoid [5], an LHC experiment, have successfully evaluated PanDA and are pursuing its adoption. In this paper, a description of the new program of work to develop a generic version of PanDA will be given, as well as the progress in extending PanDA's capabilities to support supercomputers and clouds and to leverage intelligent networking. PanDA has demonstrated at a very large scale the value of automated dynamic brokering of diverse workloads across distributed computing resources. The next generation of PanDA will allow other data-intensive sciences and a wider exascale community employing a variety of computing platforms to benefit from ATLAS' experience and proven tools.

On efficient max-min fair routing algorithms

Michał Pióro, P Nilsson, Eligijus Kubilinskas et al.|Unknown|2004

Cited by 33

In the paper, we consider the problem of routing and bandwidth allocation in networks that support elastic traffic. We assume that the bandwidth demand between each source-destination (S-D) pair is specified in terms of a minimum and maximum value, and a set of flows between each S-D pair is allowed to realize these demands. (We say that a set of flows realizes the demand associated with an S-D pair, if the sum of the bandwidths allocated to these flows is greater than the minimum value assumed for the demand of that S-D pair). In this setting, we show that routing and bandwidth allocation can be formulated as an optimization problem, where network utilization is to be maximized under capacity and the widely used max-min fairness constraints. We describe three different algorithms to solve variants of this problem. The most important one, an efficient, original algorithm assuming multipath routing is studied in detail and illustrated with a numerical example.

Experience from a pilot based system for ATLAS

P Nilsson|Journal of Physics Conference Series|2008

Cited by 32Open Access

The PanDA software provides a highly performing distributed production and distributed analysis system. It is the first system in the ATLAS experiment [1] to use a pilot based late job delivery technique. This paper describes the architecture of the pilot system used in PanDA. Unique features have been implemented for high reliability automation in a distributed environment. Performance of PanDA is analyzed from one and a half years of experience of performing distributed computing on the Open Science Grid (OSG) infrastructure. Experience with pilot delivery mechanism using Condor-G [2], and a glide-in factory developed under OSG will be described.

The ATLAS PanDA Pilot in Operation

P Nilsson, José Manuel Rodríguez Caballero, K. De et al.|Journal of Physics Conference Series|2011

Cited by 24Open Access

The ATLAS Production and Distributed Analysis system (PanDA) was designed to meet ATLAS requirements for a data-driven workload management system capable of operating at LHC data processing scale. Submitted jobs are executed on worker nodes by pilot jobs sent to the grid sites by pilot factories. This poster provides an overview of the PanDA pilot system and presents major features added in light of recent operational experience, including multi-job processing, advanced job recovery for jobs with output storage failures, gLExec based identity switching from the generic pilot to the actual user, and other security measures. The PanDA system serves all ATLAS distributed processing and is the primary system for distributed analysis; it is currently used at over 100 sites world-wide. We analyze the performance of the pilot system in processing real LHC data on the OSG, EGI and Nordugrid infrastructures used by ATLAS, and describe plans for its evolution.

Is this you? Claim your profile.

Top publicationsby citations