H

Hong Wang

Shandong Xiehe University

ORCID: 0009-0008-2874-2168

Publishes on Parallel Computing and Optimization Techniques, Embedded Systems Design Techniques, Interconnection Networks and Systems. 75 papers and 1.5k citations.

75Publications
1.5kTotal Citations

Is this you? Claim your profile.

Add your photo, update your bio, and get notified when your ranking changes.

Top publicationsby citations

Speculative precomputation
Cited by 321

This paper explores Speculative Precomputation, a technique that uses idle thread context in a multithreaded architecture to improve performance of single-threaded applications. It attacks program stalls from data cache misses by pre-computing future memory accesses in available thread contexts, and prefetching these data. This technique is evaluated by simulating the performance of a research processor based on the Itanium™ ISA supporting Simultaneous Multithreading. Two primary forms of Speculative Precomputation are evaluated. If only the non-speculative thread spawns speculative threads, performance gains of up to 30% are achieved when assuming ideal hardware. However, this speedup drops considerably with more realistic hardware assumptions. Permitting speculative threads to directly spawn additional speculative threads reduces the overhead associated with spawning threads and enables significantly more aggressive speculation, overcoming this limitation. Even with realistic costs for spawning threads, speedups as high as 169% are achieved, with an average speedup of 76%.

EXOCHI
Cited by 147

Future mainstream microprocessors will likely integrate specialized accelerators, such as GPUs, onto a single die to achieve better performance and power efficiency. However, it remains a keen challenge to program such a heterogeneous multicore platform, since these specialized accelerators feature ISAs and functionality that are significantly different from the general purpose CPU cores. In this paper, we present EXOCHI: (1) Exoskeleton Sequencer(EXO), an architecture to represent heterogeneous acceleratorsas ISA-based MIMD architecture resources, and a shared virtual memory heterogeneous multithreaded program execution model that tightly couples specialized accelerator cores with generalpurpose CPU cores, and (2) C for Heterogeneous Integration(CHI), an integrated C/C++ programming environment that supports accelerator-specific inline assembly and domain-specific languages. The CHI compiler extends the OpenMP pragma for heterogeneous multithreading programming, and produces a single fat binary with code sections corresponding to different instruction sets. The runtime can judiciously spread parallel computation across the heterogeneous cores to optimize performance and power.

Speculative precomputation: long-range prefetching of delinquent loads
Cited by 121

This paper explores Speculative Precomputation, a technique that uses idle thread contexts in a multithreaded architecture to improve performance of single-threaded applications. It attacks program stalls from data cache misses by pre-computing future memory accesses in available thread contexts, and prefetching these data. This technique is evaluated by simulating the performance of a research processor based on the Itanium/sup TM/ ISA supporting Simultaneous Multithreading. Two primary forms of Speculative Precomputation are evaluated. If only the non-speculative thread spawns speculative threads, performance gains of up to 30% are achieved when assuming ideal hardware. However, this speedup drops considerably with more realistic hardware assumptions. Permitting speculative threads to directly spawn additional speculative threads reduces the overhead associated with spawning threads and enables significantly more aggressive speculation, overcoming this limitation. Even with realistic costs for spawning threads, speedups as high as 169% are achieved, with an average speedup of 76%.

Post-pass binary adaptation for software-based speculative precomputation
Cited by 105

Recently, a number of thread-based prefetching techniques have been proposed. These techniques aim at improving the latency of single-threaded applications by leveraging multithreading resources to perform memory prefetching via speculative prefetch threads. Software-based speculative precomputation (SSP) is one such technique, proposed for multithreaded Itanium models. SSP does not require expensive hardware support-instead it relies on the compiler to adapt binaries to perform prefetching on otherwise idle hardware thread contexts at run time. This paper presents a post-pass compilation tool for generating SSP-enhanced binaries. The tool is able to: (1) analyze a single-threaded application to generate prefetch threads; (2) identify and embed trigger points in the original binary; and (3) produce a new binary that has the prefetch threads attached. The execution of the new binary spawns the speculative prefetch threads, which are executed concurrently with the main thread. Our results indicate that for a set of pointer-intensive benchmarks, the prefetching performed by the speculative threads achieves an average of 87% speedup on an in-order processor and 5% speedup on an out-of-order processor.

Intel® atom™ processor core made FPGA-synthesizable
Cited by 69

We present an FPGA-synthesizable version of the Intel Atom processor core, synthesized to a Virtex-5 based FPGA emulation system. To make the production Atom design in SystemVerilog synthesizable through industry standard EDA tool flow, we transformed and mapped latches in the design, converted clock gating, and replaced nonsynthesizable constructs with FPGA-synthesizable counterparts. Additionally, as the target FPGA emulator is hosted on a PC platform with the Pentium-based CPU socket that supports a significantly different front side bus (FSB) protocol from that of the Atom processor, we replaced the existing bus control logic in the Atom core with an alternate FSB protocol to communicate with the rest of the PC platform. With these efforts, we succeeded in synthesizing the entire Atom processor core to fit within a single Virtex-5 LX330 FPGA. The synthesizable Atom core runs at 50Mhz on the Pentium PC motherboard with fully functional I/O peripherals. It is capable of booting off-the-shelf MS-DOS, Windows XP and Linux operating systems, and executing standard x86 workloads.