Developer-Friendly and Computationally Efficient Predictive Modeling without Information Leakage: The emil Package for R

Christofer L. Bäcklin, Mats G. Gustafsson

Main Article Content

Abstract

Data driven machine learning for predictive modeling problems (classification, regression, or survival analysis) typically involves a number of steps beginning with data preprocessing and ending with performance evaluation. A large number of packages providing tools for the individual steps are available for R, but there is a lack of tools for facilitating rigorous performance evaluation of the complete procedures assembled from them by means of cross-validation, bootstrap, or similar methods. Such a tool should strictly prevent test set observations from influencing model training and meta-parameter tuning, so-called information leakage, in order to not produce overly optimistic performance estimates. Here we present a new package for R denoted emil (evaluation of modeling without information leakage) that offers this form of performance evaluation. It provides a transparent and highly customizable framework for facilitating the assembly, execution, performance evaluation, and interpretation of complete procedures for classification, regression, and survival analysis. The components of package emil have been designed to be as modular and general as possible to allow users to combine, replace, and extend them if needed. Package emil was also developed with scalability in mind and has a small computational overhead, which is a key requirement for analyzing the very big data sets now available in fields like medicine, physics, and finance. First package emil's functionality and usage is explained. Then three specific application examples are presented to show its potential in terms of parallelization, customization for survival analysis, and development of ensemble models. Finally a brief comparison to similar software is provided.

Article Details

Article Sidebar