An empirical analysis of data preprocessing for machine learning-based software cost estimation
<b>Background</b>: It is widely recognized that software effort estimation is a regression problem. Model Tree (MT) is one of the Machine Learning based regression techniques that is useful for software effort estimation, but as other machine learning algorithms, the MT has a large space of configurations and requires to carefully setting its parameters. The choice of such parameters is a dataset dependent so no general guideline can govern this process which forms the motivation of this work. <b>Aims</b>: This study investigates the effect of using the most recent optimization algorithm called Bees algorithm to specify the optimal choice of MT parameters that fit a specific dataset and therefore improve prediction accuracy. <b>Method</b>: We used MT with optimal parameters identified by the Bees algorithm to construct software effort estimation model. The model has been validated over eight datasets come from two main sources: PROMISE and ISBSG. Also we used 3-Fold cross validation to empirically assess the prediction accuracies of different estimation models. As benchmark, results are also compared to those obtained with Stepwise Regression, Case-Based Reasoning and Multi-Layer Perceptron. <b>Results</b>: The results obtained from combination of MT and Bees algorithm are encouraging and outperforms other well-known estimation methods applied on employed datasets. They are also interesting enough to suggest the effectiveness of MT among the techniques that are suitable for effort estimation. <b>Conclusions</b>: The use of the Bees algorithm enabled us to automatically find optimal MT parameters that are required to construct accurate effort estimation model for each individual dataset.