OATAO - Open Archive Toulouse Archive Ouverte Open Access Week

Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset

Ratolojanahary, Romy and Houé Ngouna, Raymond and Medjaher, Kamal and Junca-Bourié, Jean and Dauriac, Fabien and Sebilo, Mathieu Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset. (2019) Expert Systems with Applications (131). 299-307. ISSN 0957-4174

[img] (Document in English)

PDF (Author's version) - Depositor and staff only until 20 October 2019 - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
2MB

Official URL: https://doi.org/10.1016/j.eswa.2019.04.049

Abstract

In the current era of “information everywhere”, extracting knowledge from a great amount of data is increasingly acknowledged as a promising channel for providing relevant insights to decision makers. One key issue encountered may be the poor quality of the raw data, particularly due to the high missingness, that may affect the quality and the relevance of the results’ interpretation. Automating the exploration of the underlying data with powerful methods, allowing to handle missingness and then perform a learning process to discover relevant knowledge, can then be considered as a successful strategy for systems’ monitoring. Within the context of water quality analysis, the aim of the present study is to propose a robust method for selecting the best algorithm to combine with MICE (Multivariate Imputations by Chained Equations) in order to handle multiple relationships between a high amount of features of interest (more than 200) concerned with a high rate of missingness (more than 80%). The main contribution is to improve MICE, taking advantage of the ability of Machine Learning algorithms to address complex relation- ships among a large number of parameters. The competing methods that are implemented are Random Forest (RF), Boosted Regression Trees (BRT), K-Nearest Neighbors (KNN) and Support Vector Regression (SVR). The obtained results show that the hybridization of MICE with SVR, KNN, RF and BRT performs better than the original MICE taken alone. Furthermore, MICE-SVR gives a good trade-off in terms of performance and computing time.

Item Type:Article
HAL Id:hal-02134695
Audience (journal):International peer-reviewed journal
Uncontrolled Keywords:
Institution:French research institutions > Centre National de la Recherche Scientifique - CNRS (FRANCE)
Université de Toulouse > Institut National Polytechnique de Toulouse - INPT (FRANCE)
French research institutions > Institut National de la Recherche Agronomique - INRA (FRANCE)
French research institutions > Institut de Recherche pour le Développement - IRD (FRANCE)
Other partners > Université de Paris Diderot - Paris 7 (FRANCE)
Other partners > Sorbonne Université (FRANCE)
Other partners > Université Paris Est Créteil Val de Marne - UPEC (FRANCE)
Other partners > Agence de l’Eau Adour-Garonne (FRANCE)
Other partners > Chambre d'Agriculture des Hautes-Pyrénées - CA65 (FRANCE)
Laboratory name:
Statistics:download
Deposited By: Kamal Medjaher
Deposited On:07 May 2019 08:08

Repository Staff Only: item control page