R* Optimizer Validation and Performance Evaluation for Distributed Queries

Lothar F. Mackert(IBM Research - Almaden), Guy M. Lohman(IBM Research - Almaden)
Very Large Data Bases
July 1, 1994
Cited by 237

Abstract

Few database query optimizer models have been validated against actual performance. This paper extends an earlier optimizer validation and performance evaluation of R’ to di.rfribu& queries, i.e. single SQL statements having tables at multiple sites. Actual R* message, I/O, and CPU resources consumed and the corresponding costs estimated by the optimizer were written to database tables using new SQL commands, permitting automated control from application programs for collecting, reducing, and comparing test data. A number of tests were rnn over a wide variety of dynamically-created test databases, SQL queries, and system parameters. Both high-speed networks (comparable to a local area network) and medium-speed long-haul networks (for linking geographically dispersed hosts) were evaluated. The tests confirmed the accuracy of R*‘s message cost model and the significant contribution of local (CPU and I/O) costs, even for a medium-speed network. Although distributed queries consume more resources overall, the response time for some execution strategies improves disproportionately by exploiting both concurrency and reduced contention for buffers. For distributed joins in which a copy of the inner table must be transferred to the join site, shipping the whole inner table dominated the strategy of fetching only those inner tuples that matched each outer-table value, even though the former strategy may require additional I/O. Bloomjoins (hashed semijoins) consistently performed better than semijoins and the best R* strategies. Few of the distributed optimizer models proposed over the last decade CAPER 83, BERN 81~. CHAN 82, CHU 82, EPST 78, HEVN 79. KERS 82, ONUE 83, PERR 84, WONG 83, YAO 79, YU 831 have been validated by comparison with actual performance. The only known validations, for Distributed INGRES [STON 821 and the Crystal multicomputer [LU 853, have assumed only a high-speed local-area network linking the distributed systems. Also, the Distributed INGRES study focused primarily on reducing response time by exploiting parallelism using table partitioning and broadcast messages. In contrast, R* seeks to minimize total resonrces consumed, has not implemented table partitionings, and does not presume a network broadcast capability. There are many important questions that a thorough validation should answer: . Under what circumstances (regions of the parameter space) does the optimizer choose a suboptimal plan, or, worse, a particularly bad plan? . To which parameters is the actual performance most sensitive? . Are these parameters being modeled accurately by the optimizer? . What is the impact of variations from the optimizer’s simplifying as. Is it possible to simplify the optimizer’s model (by using heuristics, for


Related Papers

No related papers found

Powered by citation graph analysis