Statistics and Its Interface

Volume 9 (2016)

Number 4

Special Issue on Statistical and Computational Theory and Methodology for Big Data

Guest Editors: Ming-Hui Chen (University of Connecticut); Radu V. Craiu (University of Toronto); Faming Liang (University of Florida); and Chuanhai Liu (Purdue University)

Statistical methods and computing for big data

Pages: 399 – 414

DOI: https://dx.doi.org/10.4310/SII.2016.v9.n4.a1

Authors

Chun Wang (University of Connecticut, U.S.A.)

Ming-Hui Chen (University of Connecticut, U.S.A.)

Elizabeth Schifano (University of Connecticut, U.S.A.)

Jing Wu (University of Connecticut, U.S.A.)

Jun Yan (University of Connecticut, U.S.A.)

Abstract

Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the capacity of standard analytic tools. They present opportunities as well as challenges to statisticians. The role of computational statisticians in scientific discovery from big data analyses has been under-recognized even by peer statisticians. This article summarizes recent methodological and software developments in statistics that address the big data challenges. Methodologies are grouped into three classes: subsampling-based, divide and conquer, and online updating for stream data. As a new contribution, the online updating approach is extended to variable selection with commonly used criteria, and their performances are assessed in a simulation study with stream data. Software packages are summarized with focuses on the open source $\mathsf{R} \textrm{ and } \mathsf{R}$ packages, covering recent tools that help break the barriers of computer memory and computing power. Some of the tools are illustrated in a case study with a logistic regression for the chance of airline delay.

Keywords

bootstrap, divide and conquer, external memory algorithm, high performance computing, online update, sampling, software

2010 Mathematics Subject Classification

Primary 62-02, 62-04. Secondary 62-07.

Published 14 September 2016