Factors affecting the performance of multiuser database management systems

Authors:
Shahram Ghandeharizadeh;David J. DeWitt
Affiliations:
Computer Sciences Department, University of Wisconsin, Madison, WI;Computer Sciences Department, University of Wisconsin, Madison, WI
Venue:
SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Year:
1990

Citing 1
Cited 0

The Gamma Database Machine Project

IEEE Transactions on Knowledge and Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

While in the past 20 years database management systems (DBMS) have become a critical component of almost all organizations, their behavior in a multiuser environment has surprisingly not been studied carefully. In order to help us understand the multiuser performance of the multiprocessor Gamma database machine [DEWI90], we began by studying the performance of a single processor version of this system. In this paper, we describe some of the factors that affect the performance of DBMS in a multiuser environment. We refer the interested reader to [GHAN90] for more details.For these experiments, the Gamma software was run on a VAX 11/750 with 2 megabytes of memory and a 330 megabyte Fujistu disk drive. An 8 Kbyte disk page was used and the buffer pool was set at 61 pages. A second processor was used to simulate users submitting queries.In a DBMS, queries can be classified according to their pattern of disk accesses. Those that either sequentially scan all the pages of a relation or use a clustered index to retrieve only those pages containing tuples that satisfy a selection predicate, access the disk sequentially. Queries that use a non-clustered index to process a query tend to access disk pages randomly.For those queries that access the disk sequentially, it is very important to avoid random disk accesses in presence of multiple, concurrently executing queries. Consider a query that selects 1 tuple from a 12,500 tuple relation (each tuple is 208 bytes long) by sequentially scanning the relation. As shown in Figure 1, as the multiprogramming level (MPL) is increased from 1 to 2, the throughput of the system actually decreases. In the case of a high degree of data sharing, the two concurrently executing queries will generally access the same relation (out of a set of 10 identical relations). However, this does not necessarily mean that these queries are sufficiently synchronized to share pages in the buffer pool. The result is that the disk ends up performing a series of random disk requests instead of a series of sequential disk requests had each query been submitted consecutively. The random disk requests result in a higher average seek time. As shown in Figure 1, the drop in throughput is largest for the low degree of data sharing as the two concurrently executing queries may access any relation in the database. Thus, on the average the head of the disk must travel a longer distance on each disk access and since the average seek time increases as a function of the square root of the distance traveled by the head of the disk, the average service time of the disk is higher. To further illustrate the complex behavior that a database system can exhibit, consider a range selection query that uses a non-clustered index to select 15 tuples out of a 12,500 tuple relation. Since with a non-clustered index the order of index records is not the same as the order of the tuples in the indexed relation, each tuple retrieved results in a random disk I/O. As shown in Figure 2, the throughput of the system is highest for the high degree of data sharing because when a query commits and its corresponding terminal submits a new query, the new query will generally access the same relation as the previous query. The result is that the required index pages will generally be resident in the buffer pool. On the other hand, the probability that the newly submitted query will access the same relation as the previous query is much lower with the low and medium degrees of data sharing than with the high degree of data sharing. Furthermore, since each query processes a large number of pages, the execution of one query tends to flush the buffer pool of pages from some previously accessed relation resulting in a very low percentage of buffer pool hits for subsequent queries as illustrated in Figure 3.For each of the degrees of data sharing, the throughput of the system increases from a multiprogramming level (MPL) of one to twelve. But observe from Figure 4 that the disk becomes 100% utilized at a MPL of four. The reason that the throughput continues to increases from a MPL of 4 to 12 is because the disk controller utilizes an elevator algorithm when more than two disk requests are pending and consequently enforces some locality of reference on the random disk accesses. The result is that the average seek time decreases.At MPLs higher than twelve, the throughput of the system begins to decrease for each of the degrees of data sharing due to the decrease in percentage of buffer pool hits (see Figure 3). Recall that all the disk requests made by this query type are random and that the buffer pool utilizes an LRU replacement policy for all the pages (index + data). At MPLs higher than twelve, the data pages begin to compete with index pages for the buffer pool resulting in a decrease in the percentage of buffer pool hits. In addition, this increases the load on the disk and reduces the load on the CPU resulting in a drop in CPU as shown in Figure 5.Other factors that affect the performance of a DBMS include the use of a software read-ahead mechanism and the availability of a hardware disk cache. We have observed up to a 30% improvement in throughput with a software read-ahead mechanism. Its benefits, however, diminish when the disk becomes 100% utilized. While a track-size hardware disk cache is extremely beneficial for sequential scan queries executing by themselves, such a mechanism provides only very marginal benefits in a multiuser environment.