Milind Chabbi

Integrating Asynchronous Task Parallelism with MPI

Sanjay Chatterjee, Sağnak Taşırlar, Zoran Budimlić et al.|Unknown|2013

Cited by 83

Effective combination of inter-node and intra-node parallelism is recognized to be a major challenge for future extreme-scale systems. Many researchers have demonstrated the potential benefits of combining both levels of parallelism, including increased communication-computation overlap, improved memory utilization, and effective use of accelerators. However, current “hybrid programming” approaches often require significant rewrites of application code and assume a high level of programmer expertise. Dynamic task parallelism has been widely regarded as a programming model that combines the best of performance and programmability for shared-memory programs. For distributed-memory programs, most users rely on efficient implementations of MPI. In this paper, we propose HCMPI (Habanero-C MPI), an integration of the Habanero-C dynamic task-parallel programming model with the widely used MPI message-passing interface. All MPI calls are treated as asynchronous tasks in this model, thereby enabling unified handling of messages and tasking constructs. For programmers unfamiliar with MPI, we introduce distributed data-driven futures (DDDFs), a new data-flow programming model that seamlessly integrates intra-node and inter-node data-flow parallelism without requiring any knowledge of MPI. Our novel runtime design for HCMPI and DDDFs uses a combination of dedicated communication and computation specific worker threads. We evaluate our approach on a set of micro-benchmarks as well as larger applications and demonstrate better scalability compared to the most efficient MPI implementations, while offering a unified programming model to integrate asynchronous task parallelism with distributed-memory parallelism.

High performance locks for multi-level NUMA systems

Milind Chabbi, Michael Fagan, John Mellor‐Crummey|Unknown|2015

Cited by 55Open Access

Efficient locking mechanisms are critically important for high performance computers. On highly-threaded systems with a deep memory hierarchy, the throughput of traditional queueing locks, e.g., MCS locks, falls off due to NUMA effects. Two-level cohort locks perform better on NUMA systems, but fail to deliver top performance for deep NUMA hierarchies. In this paper, we describe a hierarchical variant of the MCS lock that adapts the principles of cohort locking for architectures with deep NUMA hierarchies. We describe analytical models for throughput and fairness of Cohort-MCS (C-MCS) and Hierarchical MCS (HMCS) locks that enable us to tailor these locks for high performance on any target platform without empirical tuning. Using these models, one can select parameters such that an HMCS lock will deliver better fairness than a C-MCS lock for a given throughput, or deliver better throughput for a given fairness. Our experiments show that, under high contention, a three-level HMCS lock delivers up to 7.6x higher lock throughput than a C-MCS lock on a 128-thread IBM Power 755 and a five-level HMCS lock delivers up to 72x higher lock throughput on a 4096-thread SGI UV 1000. On the K-means clustering code from the MineBench suit, a three-level HMCS lock reduces the running time by up to 55% compared to the C-MCS lock on a IBM Power 755.

REDSPY

Shasha Wen, Milind Chabbi, Xu Liu|Unknown|2017

Cited by 39

Complex code bases with several layers of abstractions have abundant inefficiencies that affect the execution time. Value redundancy is a kind of inefficiency where the same values are repeatedly computed, stored, or retrieved over the course of execution. Not all redundancies can be easily detected or eliminated with compiler optimization passes due to the inherent limitations of the static analysis.

DeadSpy

Milind Chabbi, John Mellor‐Crummey|Unknown|2012

Cited by 37

Software systems often suffer from various kinds of performance inefficiencies resulting from data structure choice, lack of design for performance, and ineffective compiler optimization. Avoiding unnecessary operations, and in particular memory accesses, is desirable. In this paper, we describe DeadSpy --- a tool that dynamically detects every dead write to memory in a given execution and provides actionable feedback to the programmer. This tool provides a methodical way to identify dead writes, which is a common symptom of performance inefficiencies. Our analysis of the SPEC CPU2006 benchmarks showed that the fraction of dead writes is surprisingly high. In fact, we observed that the SPEC CPU2006 gcc benchmark has 61% dead writes on average across its reference inputs. DeadSpy pinpoints source lines contributing to such inefficiencies. In several case studies with high dead writes, simple code restructuring to eliminate dead writes improved their performance significantly. For gcc, avoiding dead writes improved its running time by as much as 28% for some inputs and 14% on average. We recommend dead write elimination as an important step in performance tuning.

Contention-conscious, locality-preserving locks

Milind Chabbi, John Mellor‐Crummey|Unknown|2016

Cited by 35

Over the last decade, the growing use of cache-coherent NUMA architectures has spurred the development of numerous locality-preserving mutual exclusion algorithms. NUMA-aware locks such as HCLH, HMCS, and cohort locks exploit locality of reference among nearby threads to deliver high lock throughput under high contention. However, the hierarchical nature of these locality-aware locks increases latency, which reduces the throughput of uncontended or lightly-contended critical sections. To date, no lock design for NUMA systems has delivered both low latency under low contention and high throughput under high contention.

Is this you? Claim your profile.

Top publicationsby citations