Measuring the heterogeneity of cross-company dataset

Authors:
Jia Chen;Ye Yang;Wen Zhang;Gregory Gay
Affiliations:
Chinese Academy of Sciences, Beijing, China;Chinese Academy of Sciences, Beijing, China;Chinese Academy of Sciences, Beijing, China;West Virginia University, Morgantown, WV
Venue:
Proceedings of the 11th International Conference on Product Focused Software
Year:
2010

Citing 8
Cited 0

A Procedure for Analyzing Unbalanced Datasets

IEEE Transactions on Software Engineering
Calibrating the COCOMO II post-architecture model

Proceedings of the 20th international conference on Software engineering
Performance Evaluation of General and Company Specific Models in Software Development Effort Estimation

Management Science
Bayesian Analysis of Empirical Software Engineering Cost Models

IEEE Transactions on Software Engineering
Software Engineering Economics

Software Engineering Economics
Software Cost Estimation with Cocomo II with Cdrom

Software Cost Estimation with Cocomo II with Cdrom
Preliminary Data Analysis Methods in Software Estimation

Software Quality Control
Cross versus Within-Company Cost Estimation Studies: A Systematic Review

IEEE Transactions on Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

As a standard practice, general effort estimate models are calibrated from large cross-company datasets. However, many of the records within such datasets are taken from companies that have calibrated the model to match their own local practices. Locally calibrated models are a double-edged sword; they often improve estimate accuracy for that particular organization, but they also encourage the growth of local biases. Such biases remain present when projects from that firm are used in a new cross-company dataset. Over time, such biases compound, and the reliability and accuracy of a general model derived from the data will be affected by the increased level of heterogeneity. In this paper, we propose a statistical measure of the exact level of heterogeneity of a cross-company dataset. In experimental tests, we measure the heterogeneity of two COCOMO-based datasets and demonstrate that one is more homogeneous than the other. Such a measure has potentially important implications for both model maintainers and model users. Furthermore, a heterogeneity measure can be used to inform users of the appropriate data handling techniques.