R* Optimizer Validation and Performance Evaluation for Distributed QueriesFew database query optimizer models have been validated against actual performance. This paper extends an earlier optimizer validation and performance evaluation of R’ to di.rfribu& queries, i.e. single SQL statements having tables at multiple sites. Actual R* message, I/O, and CPU resources consumed and the corresponding costs estimated by the optimizer were written to database tables using new SQL commands, permitting automated control from application programs for collecting, reducing, and comparing test data. A number of tests were rnn over a wide variety of dynamically-created test databases, SQL queries, and system parameters. Both high-speed networks (comparable to a local area network) and medium-speed long-haul networks (for linking geographically dispersed hosts) were evaluated. The tests confirmed the accuracy of R*‘s message cost model and the significant contribution of local (CPU and I/O) costs, even for a medium-speed network. Although distributed queries consume more resources overall, the response time for some execution strategies improves disproportionately by exploiting both concurrency and reduced contention for buffers. For distributed joins in which a copy of the inner table must be transferred to the join site, shipping the whole inner table dominated the strategy of fetching only those inner tuples that matched each outer-table value, even though the former strategy may require additional I/O. Bloomjoins (hashed semijoins) consistently performed better than semijoins and the best R* strategies. Few of the distributed optimizer models proposed over the last decade CAPER 83, BERN 81~. CHAN 82, CHU 82, EPST 78, HEVN 79. KERS 82, ONUE 83, PERR 84, WONG 83, YAO 79, YU 831 have been validated by comparison with actual performance. The only known validations, for Distributed INGRES [STON 821 and the Crystal multicomputer [LU 853, have assumed only a high-speed local-area network linking the distributed systems. Also, the Distributed INGRES study focused primarily on reducing response time by exploiting parallelism using table partitioning and broadcast messages. In contrast, R* seeks to minimize total resonrces consumed, has not implemented table partitionings, and does not presume a network broadcast capability. There are many important questions that a thorough validation should answer: . Under what circumstances (regions of the parameter space) does the optimizer choose a suboptimal plan, or, worse, a particularly bad plan? . To which parameters is the actual performance most sensitive? . Are these parameters being modeled accurately by the optimizer? . What is the impact of variations from the optimizer’s simplifying as. Is it possible to simplify the optimizer’s model (by using heuristics, for
R* optimizer validation and performance evaluation for local queriesFew database query optimizer models have been validated against actual performance. This paper presents the methodology and results of a thorough validation of the optimizer and evaluation of the performance of the experimental distributed relational database management system R * , which inherited and extended to a distributed environment the optimization algorithms of System R. Optimizer estimated costs and actual R * resources consumed were written to database tables using new SQL commands, permitting automated control from SQL application programs of test data collection and reduction. A number of tests were run over a wide variety of dynamically-created test databases, SQL queries, and system parameters. The results for single-table access, sorting, and local 2-table joins are reported here. The tests confirmed the accuracy of the majority of the I/O cost model, the significant contribution of CPU cost to total cost, and the need to model CPU cost in more detail than was done in System R. The R * optimizer now retains cost components separately and estimates the number of CPU instructions, including those for applying different kinds of predicates. The sensitivity of I/O cost to buffer space motivated the development of more detailed models of buffer utilization unclustered index scans and nested-loop joins often benefit from pages remaining in the buffers, whereas concurrent scans of the data pages and the index pages for multiple tables during joins compete for buffer share. Without an index on the join column of the inner table, the optimizer correctly avoids the nested-loop join, confirming the need for merge-scan joins. When the join column of the inner is indexed, the optimizer overestimates the cost of the nested-loop join, whose actual performance is very sensitive to three parameters that are extremely difficult to estimate (1) the join (result) cardinality, (2) the outer table's cardinality, and (3) the number of buffer pages available to store the inner table. Suggestions are given for improved database statistics, prefetch and page replacement strategies for the buffer manager, and the use of temporary indexes and Bloom filters (hashed semijoins) to reduce access of unneeded data.
R* optimizer validation and performance evaluation for local queriesFew database query optimizer models have been validated against actual performance. This paper presents the methodology and results of a thorough validation of the optimizer and evaluation of the performance of the experimental distributed relational database management system R*, which inherited and extended to a distributed environment the optimization algorithms of System R. Optimizer estimated costs and actual R* resources consumed were written to database tables using new SQL commands, permitting automated control from SQL application programs of test data collection and reduction. A number of tests were run over a wide variety of dynamically-created test databases, SQL queries, and system parameters. The results for single-table access, sorting, and local 2-table joins are reported here. The tests confirmed the accuracy of the majority of the I/O cost model, the significant contribution of CPU cost to total cost, and the need to model CPU cost in more detail than was done in System R. The R* optimizer now retains cost components separately and estimates the number of CPU instructions, including those for applying different kinds of predicates. The sensitivity of I/O cost to buffer space motivated the development of more detailed models of buffer utilization unclustered index scans and nested-loop joins often benefit from pages remaining in the buffers, whereas concurrent scans of the data pages and the index pages for multiple tables during joins compete for buffer share. Without an index on the join column of the inner table, the optimizer correctly avoids the nested-loop join, confirming the need for merge-scan joins. When the join column of the inner is indexed, the optimizer overestimates the cost of the nested-loop join, whose actual performance is very sensitive to three parameters that are extremely difficult to estimate (1) the join (result) cardinality, (2) the outer table's cardinality, and (3) the number of buffer pages available to store the inner table. Suggestions are given for improved database statistics, prefetch and page replacement strategies for the buffer manager, and the use of temporary indexes and Bloom filters (hashed semijoins) to reduce access of unneeded data.
Index scans using a finite LRU buffer: a validated I/O modelLothar F. Mackert, Guy M. Lohman|ACM Transactions on Database Systems|1989 Indexes are commonly employed to retrieve a portion of a file or to retrieve its records in a particular order. An accurate performance model of indexes is essential to the design, analysis, and tuning of file management and database systems, and particularly to database query optimization. Many previous studies have addressed the problem of estimating the number of disk page fetches when randomly accessing k records out of N given records stored on T disk pages. This paper generalizes these results, relaxing two assumptions that usually do not hold in practice: unlimited buffer and unique records for each key value. Experiments show that the performance of an index scan is very sensitive to buffer size limitations and multiple records per key value. A model for these more practical situations is presented and a formula derived for estimating the performance of an index scan. We also give a closed-form approximation that is easy to compute. The theoretical results are validated using the R * distributed relational database system. Although we use database terminology throughout the paper, the model is more generally applicable whenever random accesses are made using keys.
Communicating Rule Systems