research-article

Open Access

Identifying the Big Shots—A Quantile-Matching Way in the Big Data Context

Authors:
Guangrui (Kayla) Li

Operations Management and Information Systems, Schulich Business School, York University, Toronto, ON, Canada

Operations Management and Information Systems, Schulich Business School, York University, Toronto, ON, Canada
View Profile

,
Mike K. P. So

Department of Information Systems, Business Statistics and Operations Management, Business School, Hong Kong University of Science and Technology, Hong Kong

Department of Information Systems, Business Statistics and Operations Management, Business School, Hong Kong University of Science and Technology, Hong Kong
View Profile

,
Kar Yan Tam

Department of Information Systems, Business Statistics and Operations Management, Business School, Hong Kong University of Science and Technology, Hong Kong

Department of Information Systems, Business Statistics and Operations Management, Business School, Hong Kong University of Science and Technology, Hong Kong

0000-0003-3242-0184
View Profile

ACM Transactions on Management Information Systems Volume 13 Issue 2Article No.: 19pp 1–30https://doi.org/10.1145/3490395

Published:10 March 2022Publication History

ACM Transactions on Management Information Systems

Abstract

The prevalence of big data has raised significant epistemological concerns in information systems research. This study addresses two of them—the deflated p-value problem and the role of explanation and prediction. To address the deflated p-value problem, we propose a multivariate effect size method that uses the log-likelihood ratio test. This method measures the joint effect of all variables used to operationalize one factor, thus overcoming the drawback of the traditional effect size method (θ), which can only be applied at the single variable level. However, because factors can be operationalized as different numbers of variables, direct comparison of multivariate effect size is not possible. A quantile-matching method is proposed to address this issue. This method provides consistent comparison results with the classic quantile method. But it is more flexible and can be applied to scenarios where the quantile method fails. Furthermore, an absolute multivariate effect size statistic is developed to facilitate concluding without comparison. We have tested our method using three different datasets and have found that it can effectively differentiate factors with various effect sizes. We have also compared it with prediction analysis and found consistent results: explanatorily influential factors are usually also predictively influential in a large sample scenario.

1 INTRODUCTION

Increasingly large volumes of data can now be stored and processed. These large, granular datasets enable service innovation [38], create opportunities for managers to realize strategic business value [12, 24, 34, 42], and for information systems (IS) researchers to investigate emergent phenomena or revisit established phenomena more broadly and deeply [2], and significantly influence how people derive knowledge and make decisions [1]. However, they also lead to epistemological concerns [1]; it is necessary to investigate how traditional research methods should be adapted for big data environments, and particularly to address the “deflated p-value” problem [39] and the role of prediction versus explanation. These are the issues we will explore in this study.

The deflated p-value problem refers to the phenomenon in which the p-value quickly approaches zero as the sample size increases so that even a tiny deviation from a given value can be detected [39]. Therefore, variables will be significant when the data sample becomes massive, which results in p-values having little practical utility [39]. A group of over 800 scientists [6] has thus called for abandoning statistical significance as a standard (the 0.05 dichotomania). Similarly, the American Statistical Association commented in an editorial [54, 55] that the categorical drawing of conclusions based on p-values should be avoided.

We advocate focusing on the effect size rather than the p-value of a focal factor. The effect size measures the magnitude of an effect [44] and provides a more generally interpretable, quantitative description of the size of an observed effect that is independent of the possibly misleading influences of sample size [18]. Although researchers have begun to report one type of effect size—the regression coefficient \( \theta \), which measures the sensitivity of a dependent variable to changes in an independent variable—along with the p-value, this univariate effect size method has several limitations. First, it cannot be generalized to a multiple-variable scenario. For example, socioeconomic status is determined by the three variables of income, education, and occupation, a person's personality is measured from five dimensions by many items, and categorical factors are usually operationalized as a set of dummy variables. The effect size at the factor level would be of more interest in these situations. But it cannot be directly measured by the univariate effect size method in a summative way (e.g., \( {\theta _1} + {\theta _2} + {\theta _3} \) ), as the individual effects usually correlate with each other and cannot be treated independently [47]. Second, its use hinders comparison among different variables and different studies, especially when various scales are used.

To deal with this problem, we have made use of the likelihood ratio test (LRT) statistic to measure the joint effect size at the factor level, which we name the multivariate effect size. The reason we have chosen the LRT statistic as our basic measure is because of its close connection to the regression coefficient (\( \theta \)) and its ease of calculation and interpretation. The LRT statistic asymptotically follows the \( {\chi ^2} \) distribution, but one factor can be operationalized as several variables. More independent variables (IVs) provide more freedom in estimating the variability of the dependent variable [20, 44], thus creating bias in the estimation of multivariate effect sizes and making it challenging to compare across factors. Previously developed methods to solve this issue include the cumulative probability method and those that match representative characteristics of the distribution, e.g., location, spread, skewness, and peakedness. The cumulative probability method transforms the test statistic to the corresponding cumulative probability (ranging from 0 to 1) based on the cumulative distribution function. The method allows us to compare different types of studies and provide a concise and universal summary of different statistical tests [37]. But its applicability is significantly constrained—the corresponding cumulative probability cannot be explicitly calculated for test statistics obtained in a large sample context. For example, in RStudio (version 1.1.383), pchisq()—the function that calculates the cumulative probability based on the value of the \( {\chi ^2} \) test statistic—would produce a probability value of 1 for a \( {\chi ^2} \) test statistic larger than 1,590 with less than 20 degrees of freedom, while this function can display the probability at a maximum of 1–10⁻³²⁰, which is smaller than but close to one. That is, the cumulative probability method cannot differentiate the effect size of factors with a \( {\chi ^2} \) test statistic larger than 1,590, which is frequently encountered in large sample research. In the example that we will discuss in Section 4 of this article, the LRT statistics for all factors exceeded 1,590 with a 25% (n = 1,565,720) sample. Thus, in handling very large data samples, the cumulative probability method is no longer an effective comparison tool. The other method, which matches the representative characteristics of the distribution—e.g., mean and SD—does not produce results consistent with the cumulative probability method and can lead to wrong conclusions. For example, assume that two factors, \( {F_1} \) and \( {F_2} \), are operationalized as 3 and 5 variables, with LRT statistics \( LR{T_1} = 21.11 \) and \( LR{T_2} = 25.75 \) following the \( {\chi ^2} \) distribution. The transformed statistics \( {Z_1} \) and \( {Z_2} \) based on the mean and SD would be \( \frac{{21.11 - 3}}{{\sqrt {2{\rm{*}}3} }} = 7.39 \) and \( \frac{{25.75 - 5}}{{\sqrt {2{\rm{*}}5} }} = 6.56 \), respectively. Therefore, we would claim that the first factor \( {F_1} \) has a larger effect size. However, the corresponding cumulative probabilities for \( LR{T_1} \) and \( LR{T_2} \) are both 0.9999, indicating these two factors have the same effect size.

A quantile-matching-based transformation method is proposed to address this problem. This method attempts to minimize the distance of different distributions across cumulative probabilities to remove the impact of various degrees of freedom. The method has three advantages: (1) Feasibility—it has a one-to-one corresponding relationship with cumulative probability under perfect matching, and consequently inherits the merits of cumulative probability in comparisons. But it can be applied to scenarios where the cumulative probability method fails as the sample size scales significantly. (2) Flexibility—the quantile matching transformation can be done on the full distribution or a specific interval, providing tremendous flexibility in real large sample applications. Most cumulative probabilities would approach 1 in large sample context, so matching on the right tail of the distribution would be more helpful. (3) Fast calculation—it is easy to implement and calculate.

To demonstrate the flexibility of the quantile matching method, we have applied it to three different intervals—(1) full distribution, (2) [0.99, 0.9999999] interval, and (3) the extreme scenario [1–10⁻³²⁰, 1–10⁻³²²] interval of the \( {\chi ^2} \) distributions. Nevertheless, as the corresponding \( {\chi ^2} \) statistics are not calculatable in [1–10⁻³²⁰, 1–10⁻³²²] in RStudio, we have used the estimates obtained from quantile mechanics (QM) and natural spline interpolation for matching [45]. Furthermore, we have developed an absolute multivariate effect size method to help researchers decide between small, medium, and large effect sizes without needing to compare with other factors and facilitate the comparison of the same factor across different studies.

In addition, we have measured the focal factor's effect size in terms of prediction. We have compared both sets of results and found explanatorily influential factors derived from the LRT analysis in a large sample context are predictively influential as well. This then helps answer our second question concerning the role of prediction versus explanation in large sample research.

This study makes three main contributions to the methodology literature of IS research: (1) we developed a multivariate effect size method by making use of the LRT statistic to address the deflated p-value problem. This method enables the measurement of a joint effect size across the variables used to operationalize the focal factor. (2) We have introduced the quantile-matching method to deal with the impact of different numbers of operationalized variables on the multivariate effect size estimation of the focal factors. This method produces results consistent with those of the cumulative probability method under perfect matching and can effectively handle large samples. (3) In large sample applications, we found explanatory influential factors are usually also predictively influential.

The remainder of the article is organized in the following order. First, we will review the history of the p-value and the concerns raised about its misuse and misinterpretation and discuss existing methods to deal with the deflated p-value problem. Next, we will examine the relative and absolute multivariate effect size for different factors using the LRT statistic and adjust the bias caused by different numbers of operationalized variables across factors using the quantile-matching method, demonstrating its unique advantages. Then, we will present our empirical analysis using an example taken from an IS application on e-mail marketing (n = 6,230,253) and compare the results from our method with the ROC analysis results. We have also replicated our analysis on the US accident dataset and Airbnb listings dataset. Finally, we will discuss our results and their implications.

2 THEORETICAL BACKGROUND

2.1 The P-value and Big Data

The p-value was first introduced by the UK statistician Ronald Fisher [17] in the 1920s to determine whether the observed data complied with the proposed hypotheses by calculating the difference between the predicted and the observed data series. Fisher regarded the p-value as an informal way of checking whether the evidence of a treatment effect was worth a second look. However, in the late 1920s, during the movement to make evidence-based decision-making more rigorous and objective, many non-statistician authors combined Fisher's easy-to-calculate p-value with Neyman and Pearson's reassuringly rigorous rule-based approach to create a hybrid system, and thus a p-value of 0.05 became enshrined as “statistically significant” [43].

The abuse and misinterpretation of the p-value were first noted in the late 20th century. Researchers were mainly concerned about two aspects. First, the p-value does not establish the probability as to whether the investigator's hypothesis is correct; it only represents the false-positive rate given the observations [15, 21, 22]. Second, using the p-value to divide the results into “significant” and “insignificant” categories is arbitrary [16]. More extensive concerns about the p-value have recently been voiced and are presented in Table 1. Among these concerns, the deflated p-value problem has received considerable attention. This problem is not a new phenomenon. In 1993, Kass and Raftery [30] pointed out “frequently tests tend to reject null hypotheses almost systematically in very large samples”, an observation that has been further elaborated by Lin et al. [39], Kim and Ji [33], and Kim et al. [32]. Because the p-value measures the distance between the data and the null hypothesis united by standard errors, as n approaches infinity, the standard error will be close to 0. This means even a tiny deviation from a given value can be detected.

Table 1.

Paper	Journal	Concerns of p-value	Proposed Solution
Lin et al. [2013]	ISR	Deflated p-value problem in the large sample context.	Report effect size and confidence interval.
Halsey et al. [2015]	Nature Methods	P-value varies highly across samples.Increasing statistical power (sample size) could mitigate the variation across samples.	Report effect size estimates and their precision (95% confidence intervals).
Kim and Ji [2015]	Journal of Empirical Finance	Deflated p-value problem in the large sample context.	Select a different level of significance by taking account of sample size.
Lazzeroni et al. [2016]	Nature Methods	P-value varies highly across samples.	Propose the p-value prediction interval and the p-value confidence intervals to capture the uncertainty in a sole p-value.
Goodman [2016]	Science	The p-value neither measures nor is part of a formula that provides the credibility of the conclusions.	The P-values are unlikely to disappear, and the ASA did not recommend their elimination – rather a change in how they are interpreted and used.The way statistical inference is taught to scientists should contain a variety of named, competing approaches each with strengths and weaknesses.
Kyriacou [2016]	JAMA	The concept of the p-value is frequently misunderstood and misused.	The automatic application of dichotomized hypothesis testing based on prearranged levels of statistical significance (0.05 p-value) should be replaced by a more complex process using effect estimates, confidence intervals, and even p-values, thereby permitting scientists, statisticians, and clinicians to use their own inferential capabilities to assign scientific significance.
Altman and Krzywinski [2017]	Nature Methods	A p-value is a probability statement about the observed sample in the context of a hypothesis, not about the hypotheses being tested.	Three main ideas for using, interpreting, and reporting p-values have emerged: (1) the use of more stringent p-value cutoffs supported by Bayesian analysis, (2) the use of the p-value to estimate the false discovery rate (FDR), and (3) the combination of p-values and effect sizes to create more informative confidence intervals.
Altman and Krzywinski [2016]	Nature Methods	Even when the null hypothesis is true, if we have done many tests, we will have a high chance of obtaining a significant p-value, and the confidence interval does not mitigate the problem either.	N/A.
Kim et al. [2018]	ABACUS	Deflated p-value problem in the large sample context.	Report effect size and confidence interval.
Benjamin et al. [2018]	Nature Human Behavior	The corresponding Bayes Factor of p = 0.05 is only 2.5 to 3.4, providing ‘weak’ or ‘very weak’ support for the alternative hypothesis.	Change the threshold of significance to 0.005, as the corresponding Bayes Factor would be 14 to 26, indicating ‘substantial’ or ‘very strong’ support for the alternative hypothesis, and the minimum false positive rate will decrease to 5%.

Degrees of Freedom, n	\( {\hat{a}_n} \) (Full Distribution)	\( {\hat{b}_n} \) (Full Distribution)
2	0.602056	1.397977
3	1.312148	1.687911
4	2.073774	1.926307
5	2.86678	2.13332
6	3.681391	2.318726
7	4.512021	2.488112
8	5.355134	2.645015
9	6.208324	2.791839
10	7.069872	2.930306
11	7.939758	3.060265
12	8.814545	3.185479
13	9.694645	3.305381
14	10.579434	3.420594
15	11.468401	3.531628
16	12.361122	3.638908
17	13.257240	3.742791
18	14.156453	3.843580
19	15.058498	3.941536
20	15.963149	4.036886

Degrees of Freedom, n	\( LR{T_n} \)	\( Z_n^{M/SD} \)	Difference with \( Z_1^{M/SD} \)(%)	\( Z_n^Q \)	Difference with \( Z_1^Q \) (%)
1	3.841	2.009		3.841
2	5.991	1.996	0.7%	3.855	−0.4%
3	7.815	1.966	2.1%	3.853	−0.3%
4	9.488	1.940	3.4%	3.850	−0.2%
5	11.071	1.920	4.4%	3.847	−0.2%
6	12.592	1.903	5.3%	3.844	−0.1%
7	14.067	1.889	6.0%	3.841	0.0%
8	15.507	1.877	6.6%	3.839	0.0%
9	16.919	1.867	7.1%	3.838	0.1%
10	18.307	1.858	7.5%	3.836	0.1%

Degrees of Freedom, n	\( {\hat{a}_n} \) ([0.99, 0.9999999])	\( {\hat{b}_n} \) ([0.99, 0.9999999])
2	2.024	1.087
3	3.728	1.155
4	5.291	1.213
5	6.771565	1.264396
6	8.196447	1.311452
7	9.581034	1.355043
8	10.93	1.396
9	12.26	1.434
10	13.57	1.471
11	14.86	1.506
12	16.14	1.539
13	17.41	1.572
14	18.66	1.603
15	19.90	1.633
16	21.14	1.662
17	22.37	1.690
18	23.59	1.718
19	24.80	1.744
20	26.01	1.770

Degrees of Freedom, n	\( LR{T_n} \)	\( Z_n^{M/SD} \)	Difference (%)	\( Z_n^{Q\_Full} \)	Difference (%)	\( Z_n^{Q\_[ {0.99,0.9999999} ]} \)	Difference (%)
1	15.137	10.00		15.14		15.14
2	18.421	8.21	17.9%	12.75	15.8%	15.08	0.4%
3	21.108	7.39	26.0%	11.73	22.5%	15.05	0.6%
4	23.513	6.90	31.0%	11.13	26.4%	15.02	0.8%
5	25.745	6.56	34.4%	10.72	29.1%	15.01	0.9%
6	27.856	6.31	36.9%	10.43	31.1%	14.99	1.0%
7	29.878	6.11	38.8%	10.20	32.6%	14.98	1.1%
8	31.828	5.96	40.4%	10.01	33.9%	14.97	1.1%
9	33.720	5.83	41.7%	9.85	34.9%	14.97	1.2%
10	35.564	5.72	42.8%	9.72	35.7%	14.95	1.2%

Degrees of Freedom, n	\( {\hat{a}_n} \) (Extreme Scenario)	\( {\hat{b}_n} \) (Extreme Scenario)
2	9.2342	0.9999
3	15.5549	1.0004
4	19.4130	1.0021
5	25.4821	1.0021
6	29.8352	1.0031
7	35.7912	1.0028
8	40.2895	1.0035
9	42.9536	1.0053
10	47.2076	1.0060
11	51.6892	1.0064
12	56.7695	1.0064
13	61.0236	1.0068
14	64.9559	1.0074
15	69.4390	1.0076
16	72.3630	1.0089
17	76.5250	1.0092
18	80.0481	1.0099
19	84.3088	1.0101
20	87.9563	1.0107

	5%	10%	15%	20%	25%	50%	100%
Intercept	−1.667***	−1.666***	−1.677***	−1.673***	−1.689***	−1.676***	−1.675***
New Product	−0.426***	−0.456***	−0.411***	−0.407***	−0.381***	−0.414***	−0.411***
Upgrade Incentive/Reminder	0.505***	0.512***	0.530***	0.538***	0.521***	0.523***	0.530***
Other	−0.479***	−0.473***	−0.474***	−0.475***	−0.468***	−0.475***	−0.475***
Previous E-mail Opened	1.918***	1.921***	1.906***	1.914***	1.912***	1.918***	1.916***
Middle Tier	0.064**	0.061***	0.073***	0.072***	0.088***	0.075***	0.073***
Top Tier	0.284***	0.268***	0.286***	0.273***	0.293***	0.294***	0.285***
Membership Duration	−0.115***	−0.106***	−0.112***	−0.109***	−0.109***	−0.110***	−0.110***
Company Name	0.326***	0.334***	0.311***	0.314***	0.301***	0.306***	0.308***
Seasonal/Festive	0.239***	0.240***	0.240***	0.239***	0.243***	0.242***	0.239***
New Product X Middle Tier	−0.059*	−0.010	−0.039*	−0.065***	−0.078***	−0.044***	−0.045***
New Product X Top Tier	0.290***	0.342***	0.311***	0.284***	0.288***	0.308***	0.302***
Other X Middle Tier	0.389***	0.390***	0.394***	0.395***	0.374***	0.385***	0.391***
Other X Top Tier	0.321***	0.324***	0.333***	0.343***	0.322***	0.318***	0.330***

	5%	10%	15%	20%	25%	50%	100%	Df	Rank
Membership Duration	462.34	954.99	1399.80	1880.77	2330.16	4712.92	9371.38	1	5
Membership Tier	2219.15	4421.59	6740.06	8937.09	11101.8	22344.7	44733.9	6	2
Message Type	1940.02	4005.05	5993.20	7922.20	9941.82	19707.2	39527.7	7	3
Previous E-mail Opened	43650	87032	131354	174273	218218	435832	871345	1	1
Seasonal/Festive	330.73	646.89	977.96	1271.32	1594.58	3254.57	6453.83	1	6
Company Name	475.11	975.02	1489.92	1955.73	2473.37	4916.54	9751.75	1	4
Full Model	53364.2	107006.6	161096.8	214501.2	268587	536716	1070726	13

	5%	10%	15%	20%	25%	50%	100%	Df
Membership Duration	0.009	0.009	0.009	0.009	0.009	0.009	0.009	1
Membership Tier	0.041	0.041	0.042	0.042	0.041	0.042	0.042	6
Message Type	0.036	0.037	0.037	0.037	0.037	0.037	0.037	7
Previous E-mail Opened	0.824	0.819	0.821	0.818	0.818	0.818	0.819	1
Seasonal/Festive	0.006	0.006	0.006	0.006	0.006	0.006	0.006	1
Company Name	0.009	0.009	0.009	0.009	0.009	0.009	0.009	1

	5%	10%	15%	20%	25%	50%	100%	Df
Membership Duration	0.00306	0.00311	0.00300	0.00308	0.00304	0.00309	0.00304	1
Membership Tier	0.00699	0.00680	0.00687	0.00688	0.00688	0.00688	0.00695	6
Message Type	0.00831	0.00870	0.00853	0.00848	0.00852	0.00850	0.00853	7
Previous E-mail Opened	0.14508	0.14513	0.14539	0.14494	0.14525	0.14497	0.14496	1
Seasonal/Festive	0.00093	0.00084	0.00089	0.00090	0.00085	0.00088	0.00088	1
Company Name	0.00089	0.00094	0.00095	0.00090	0.00097	0.00093	0.00092	1

Pearson Correlation	5%	10%	15%	20%	25%	50%	100%
DAUROC/QM (Full Distribution)	0.9991	0.9991	0.9991	0.9991	0.9991	0.9991	0.9991
DAUROC/QM (0.95 to 0.99999)	0.9997	0.9996	0.9997	0.9997	0.9997	0.9996	0.9996

Spearman Correlation	5%	10%	15%	20%	25%	50%	100%
DAUROC/QM (Full Distribution)	0.77143	0.88571	0.88571	0.88571	0.88571	0.88571	0.88571
DAUROC/QM (0.95 to 0.99999)	0.77143	0.88571	0.88571	0.88571	0.88571	0.88571	0.88571

	5%	10%	15%	20%	25%	50%	100%
LRT Analysis	2.7	5.0	8.5	11.9	14.1	29.7	62.0
ROC Analysis	281.5	602.9	995.0	1339.3	1798.8	3661.9	7240.1

	Estimate	Std. Error	z value	Pr(>\|z\|)
Intercept	0.5806	0.0554	10.4705	<2.22e-16	***
Temperature	0.0830	0.0012	69.9780	<2.22e-16	***
Wind_Chill	−0.0676	0.0010	−64.6209	<2.22e-16	***
Humidity	0.0069	0.0001	59.6018	<2.22e-16	***
Pressure	−0.1147	0.0019	−60.5012	<2.22e-16	***
Visibility	−0.0012	0.0008	−1.4735	0.14062299
Wind_Speed	0.0139	0.0004	31.0866	<2.22e-16	***
Precipitation	0.8354	0.0484	17.2714	<2.22e-16	***
Bump	−0.9891	0.1762	−5.6147	1.97E-08	***
Crossing	−0.6695	0.0127	−52.9124	<2.22e-16	***
Give_Way	0.2831	0.0456	6.2021	5.57E-10	***
Junction	0.2278	0.0071	31.8630	<2.22e-16	***
No_Exit	0.2542	0.0614	4.1410	3.46E-05	***
Railway	0.3592	0.0265	13.5350	<2.22e-16	***
Roundabout	−2.5100	0.7162	−3.5044	0.00045771	***
Station	−0.4722	0.0196	−24.1444	<2.22e-16	***
Stop	−1.0027	0.0250	−40.0614	<2.22e-16	***
Traffic_Calming	1.3308	0.1209	11.0087	<2.22e-16	***
Traffic_Signal	−0.6657	0.0081	−82.0339	<2.22e-16	***
Sunrise_Sunset	−0.4671	0.0051	−92.1781	<2.22e-16	***

	LRT Statistic	Standardized LRT Statistic	Relative Multivariate Effect Size	Absolute Multivariate Effect Size	Average DAUROC	Df
Weather	16085.99	4297.29	16005.39	0.3095	0.0323	7
Location	24522.13	5225.79	24314.83	0.4702	0.0385	11
Time	8699.50	6150.77	8699.50	0.17	0.0105	1

	Estimate	Std. Error	z value	Pr(>\|z\|)
Intercept	53.2693	8.5367	6.24	4.38E-10	***
Host Duration	0.1416	0.0213	6.644	3.07E-11	***
Verified	7.6594	1.4648	5.229	1.71E-07	***
Superhost	−3.5095	1.4316	−2.451	1.42E-02	*
No. of Total Lists	0.0771	0.0102	7.583	3.40E-14	***
Private room	−68.3483	5.2341	−13.058	<2e-16	***
Shared room	56.9390	3.3996	16.749	<2e-16	***
Cape Town	53.0135	5.5498	9.552	<2e-16	***
Hong Kong	−1.0756	3.5631	−0.302	0.762742
Istanbul	−2.6542	3.3221	−0.799	0.424317
Mexico City	72.8116	3.0563	23.823	<2e-16	***
New York	52.7662	2.9360	17.972	<2e-16	***
Paris	18.6970	3.2837	5.694	1.24E-08	***
Rio de Janeiro	38.8591	3.1323	12.406	<2e-16	***
Rome	75.6454	3.1264	24.196	<2e-16	***
Sydney	2.7485	0.7903	3.478	5.06E-04	***
Review_Score_Location	−3.2523	1.0918	−2.979	0.002895	**
Review_Score_Cleanliness	6.3694	0.9057	7.033	2.04E-12	***
Review_Score_Checkin	3.7115	1.1128	3.335	8.53E-04	***
Review_Score_Communication	53.2693	8.5367	6.24	4.38E-10	***
Overall Rating	0.1416	0.0213	6.644	3.07E-11	***

	LRT Statistic	Standardized LRT Statistic	Relative Multivariate Effect Size	Absolute Multivariate Effect Size	Average DRMSE	Df
Host Attributes	150.89	51.93	120.03	0.0256	0.1008	4
Property Attributes	1981.9	989.95	1972.86	0.4206	1.5396	2
Location	2040.3	478.78	1986.82	0.4236	1.5484	9
Online Review Ratings	158.19	48.44	119.76	0.0255	0.1116	5

Identifying the Big Shots—A Quantile-Matching Way in the Big Data Context

ACM Transactions on Management Information Systems

Abstract

1 INTRODUCTION

2 THEORETICAL BACKGROUND

2.1 The P-value and Big Data

2.2 Effect Size

3 MULTIVARIATE EFFECT SIZE

3.1 Log-likelihood Ratio Test Statistic and Multivariate Effect Size

3.2 Quantile-Matching Transformation Method for Multivariate Effect Size

3.2 The Implementation of Quantile-Matching Transformation for LRT Statistics

3.4 The Development of Absolute Multivariate Effect Size

4 EXAMPLE: E-MAIL COMMUNICATION EFFECTIVENESS

4.1 Comparison with Predictively Influential Factors

4.2 Replication on the US Accidents Dataset

4.3 Replication of the Airbnb Listing Dataset

5 CONCLUSIONS

APPENDICES

A APPENDIX

B APPENDIX

Footnotes

REFERENCES

Cited By

Index Terms

Recommendations

BIG DATA ANALYTICS with MATLAB: HYPOTHESIS TESTS, ANALYSIS OF VARIANCE and BAYESIAN OPTIMIZATION

Searching for effects in big data: why p-values are not advised and what to use instead

Embedded Analytics and Statistics for Big Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media