Setting priorities for data accuracy improvements in satisficing decision-making scenarios: A guiding theory

https://doi.org/10.1016/j.dss.2009.11.001Get rights and content

Abstract

This study introduces a mathematical–statistical theory that illustrates the effect of input errors on the accuracy of dichotomous decisions which are implemented through logical conjunction and disjunction of selected criteria. Decision-making instances in this category are often labeled “satisficing.” Mainly, our theory provides criteria for ranking the effect of errors in different inputs on decision accuracy. This ranking can be used to improve the efficiency and effectiveness of resource allocation decisions in data quality management settings. All other things being equal, inputs in which errors exhibit a higher negative effect on the output would naturally earn higher priority.

Introduction

The relationship between data accuracy and the resulting information accuracy is of great interest in numerous problem domains. This relationship has been investigated in many research fields, assuming various information-processing models and data and error characteristics, as well as an assortment of accuracy measures. Some of these research fields are statistics, computer science, the physical sciences, political science, decision sciences, econometric forecasting, accounting, and information systems (e.g., [2], [3], [4], [5], [7], [8], [9], [11], [12], [13], [14], [15], [16], [17], [18], [23], [24], [25], [29], [30], [31], [32], [36], [37], [38], [43], [44], [49], [52], [55]). An understanding of the relationship between input accuracy and output accuracy can improve the efficiency of data management and increase the accuracy and utility of information in problem-solving settings. Nonetheless, our understanding of that relationship is still partial.

This work adds to the literature about the association between input accuracy and output accuracy. An underlying observation that motivates this study originates in models that track error propagation. Such models have been introduced in the statistics literature and elsewhere (e.g., [14], [24]) and have proven to be useful in many areas. In Management Information Systems (MIS), in particular, there are various frameworks that have been developed for tracking data errors and other data quality determinants through an information system (e.g., [7], [11], [49]). These models imply that, for any given application, the effect of input errors on output accuracy varies in magnitude, depending on the choice of the specific input. While errors in one input may have a dramatic effect on output accuracy, a comparable, or even higher, error rate in another input can have a negligible effect. It has been shown that the severity of the effect of errors is influenced by the nature of the manipulations that the data undergo [8]. Notably, characterization of this variation can be useful since it can guide resource allocation decisions in data quality management settings. All other things being equal, inputs in which errors exhibit a higher negative effect on the output would naturally earn higher priority.

Accordingly, this analytical inquiry sheds light on the variation in the effect of input errors on output accuracy in a popular class of applications. These applications consist of dichotomous decisions which are implemented through logical conjunction and disjunction of selected criteria. Decision-making instances in this category are often labeled “satisficing.” The term “satisficing” was coined by Herbert Simon to denote problem-solving and decision-making that aims at satisfying a chosen aspiration level instead of an optimal solution [50]. Research indicates that conjunctive and disjunctive rules agree with human choices and inferences in diverse situations involving complex problems, severe time constraints, or lack of information. Evidence in this direction has been found in consumer choice settings, medical diagnosis, job preference decisions, university admission decisions, residential rental searches, political leaders' decision-making, and in many other domains (e.g., [19], [20], [21], [34], [35], [41], [45], [46]).

Consider, for example, a company decision regarding the rental of a new office. Suppose that a satisficing decision strategy is employed throughout the entire selection process, or else, due to the high number of alternatives, for the initial screening of alternatives (e.g., [34], [45]). Suppose that five of the major decision variables are rental rate, square footage, age of the office building, availability of parking spaces, and distance from a public transportation hub [40], [51]. In particular, the decision rule that combines these variables is the following: the age of the office building must not exceed ten years and the desired office space should be in the range of 3500–5000 ft2 and the monthly rental rate should be $10,000 at most, and, in addition, the office building should have its own covered parking or the building must be located within a half of a mile from a public transportation hub (see Fig. 1). This decision rule can be implemented as follows: The values of each variable are tested against the matching criterion, namely, a subset of the value domain of the variable (i.e., the rental rate of each of the properties in the local commercial rentals database is tested against $10,000, the age of each property is tested against 10, and so on). These tests determine the values of corresponding dichotomous variables (e.g., a test of a rental rate of $7500 would produce the value “true” and a test of a rate of $12,500 would produce the value “false”). Next, the values of the different dichotomous variables which are determined in this way are combined using a suitable sequence of conjunction and disjunction operations to produce the outcome of the decision. The outcome can be either positive or negative (e.g., a rental property that satisfies the entire decision rule would result in a positive decision).

Our inquiry centers on multi-criteria, satisficing decisions like this office rental decision. Mainly, we examine the effect of errors in the classification of data values as fulfilling or not fulfilling a relevant decision criterion on the accuracy of the decision. In other words, we study the effects of errors in the individual dichotomous variables portrayed above on the accuracy of the decision. A decision error is registered whenever a decision based on the available inputs deviates from the outcome of the same decision based on error-free inputs. A decision error can be a Type 1 error (false positive), e.g., when an office that does not satisfy the criteria is included in the short list of suitable properties, or it can be a Type 2 error (false negative), e.g., when an office that has the desired attributes is excluded from that list. Assuming this interpretation of the notion of a decision error, decision accuracy will be measured by decision error probability. Specifically, this work offers tools for ranking the damage that the above described input classification errors inflict on decision accuracy. Our study defines the notion of damage and produces guidelines for ranking the damage. These guidelines can be used for improving the efficiency and effectiveness of resource allocation decisions in data quality management settings. The guidelines would be most useful when the major application that processes the data is equivalent to a set of satisficing decision rules. For instance, the company that plans to rent a new office can benefit from our theory to the extent that management has influence over the accuracy of the data that they are using. Alternatively, the providers of those data can benefit from information about the average damage that errors in each input induce on the entire collection of decisions that use those data. All other things being equal, inputs (e.g., database attributes) in which errors exhibit a higher damage would naturally earn higher priority.

This paper is organized as follows. A review of related literature is given in Section 2. It is followed in Section 3 by a description of the basic research models, which designate a single binary conjunction operation and a single binary disjunction operation and formulate the damage for these operations. We then define the notion of error dominance and apply this notion for characterizing conditions in which the damage of errors in one input is greater than the damage caused by errors in a second input (Section 4). An illustration of the former conditions, which exploits the example of the office rental decision, is provided in Section 5. An extension of the theory to a sequence of binary operations is presented in Section 6. The paper is concluded with a discussion of the implications of the theory to data quality management, limitations of this research, and future research directions.

Section snippets

Literature review

Research on the relationship between input accuracy and output accuracy has a long history. In particular, the classic Jury Theorem [16], which is viewed as theoretical support for group decision-making and democracy, dates back to the late eighteenth century. Condorcet's Jury Theorem asserts that if group members are independent, and if each member judges correctly between a pair of alternatives with probability p > 0.5, and if members' judgments are combined using a majority vote rule, then the

Model

Our investigation of the effect of errors on the outcomes of disjunctive and conjunctive decisions exploits statistical properties of random variables, primarily their expected value. We first examine single binary logical OR and AND operations — each of these operations employs two inputs. This section formulates the damage that is attributed to errors in a given input. The variables in use by our model are listed and defined below:

  • U, V: The correct input classifications as fulfilling or not

Error dominance

Consider a decision rule that is implemented by a single Boolean binary OR or AND operation, where ISA holds true. By applying Eqs. (7) and (8) or Eqs. (12) and (13) we can calculate the damage of errors in each of the observed inputs, i.e., Ua and Va. Subsequently, these equations can be used for ranking the observed inputs of the decision rule according to the damage that they inflict on the output of the decision. In the following sections we will employ our model to discover general

Conditions of dominance under logical conjunction

Consider a company decision regarding the rental of a new office as described in the introduction section. We will illustrate the theoretical results pertaining to a conjunctive decision rule and, later on, the results pertaining to a disjunctive decision rule, using this scenario.

Suppose that the rental decision utilizes the following simple conjunctive rule: the age of the office building must not exceed ten years and the office space has to be in the range of 3500–5000 ft2. Assume a selection

Extension of the error dominance theory: multiple operations

The conclusions of this study can be extended to decision rules involving N  2 binary operations. Given N + 1 inputs, suppose that two of the inputs have the characteristic that one dominates the other under a binary operation that combines them. This section proves that, for the most part, such dominance is preserved throughout successive applications of any mixture of binary operations. Proposition 3 specifies a variety of sufficient conditions in which dominance in a single operation is

Implications for data quality management

Studies indicate that the use of satisficing choice strategies is widespread, regardless of whether this strategy is applied for screening inferior choices or for singling out the best choice [45]. An important lesson of this study is that when a decision is obtained through a satisficing rule, inputs that exhibit the highest error rates are often not the inputs that are most damaging to the outcome of the decision.

Our error dominance theory implies a small set of simple guidelines for

Limitations and future research directions

The findings of this research imply that errors should not all be treated equally. Of course, if resources were unlimited, then ranking the effect of errors would be immaterial. Since resources are indeed often limited, the ability to set priorities while taking into account the intended use of the data can be valuable.

A significant limitation of our model is its statistical independence assumption (SIA) and the related assumptions of Proposition 3. Regrettably, statistical dependencies may be

Irit Askira Gelman is a visiting scholar at the University of Arizona. She received her B.Sc. in Mathematics from Tel Aviv University, Israel, M.Sc. in Information Systems from Tel Aviv University, and Ph.D. in Management Information Systems from the University of Arizona. She is a member of IEEE and AIS. Her research interests include data and information quality, knowledge discovery and data mining, and model management systems.

References (55)

  • D.J. Aigner

    Regression with a binary independent variable subject to errors of observation

    Journal of Econometrics

    (1973)
  • I. Askira Gelman, GIGO or not GIGO: The Accuracy of Multi-Criteria, Satisficing Decisions. ACM Journal of Data and...
  • I. Askira Gelman

    A model of error propagation in satisficing decisions and its application to database quality management

  • I. Askira Gelman

    Simulations of error propagation for prioritizing data accuracy improvements in multi-criteria satisficing decision making scenarios

  • A. Avenali et al.

    Brokering infrastructure for minimum cost data procurement based on quality–quantity models

    Decision Support Systems

    (2008)
  • D. Avison et al.

    Information systems development: methodologies, techniques and tools

    (2008)
  • D.P. Ballou et al.

    Modeling data and process quality in multi-input, multi-output information systems

    Management Science

    (1985)
  • D.P. Ballou et al.

    Implications of data quality for spreadsheet analysis

    DATA BASE

    (1987)
  • D.P. Ballou et al.

    A framework for the analysis of error in conjunctive, multi-criteria, satisficing decision processes

    Decision Sciences

    (1990)
  • D.P. Ballou et al.

    Methodology for allocating resources for data quality enhancement

    Communications of the ACM

    (1989)
  • D.P. Ballou et al.

    Modeling information manufacturing systems to determine information product quality

    Management Science

    (1998)
  • D.P. Ballou et al.

    Sample-based quality estimation of query results in relational database environments

    IEEE Transactions on Knowledge and Data Engineering

    (2006)
  • T.L. Barabash

    On properties of symbol recognition

    Engineering Cybernetics

    (Sept./Oct. 1965)
  • P.R. Bevington

    Data Reduction and Error Analysis for the Physical Sciences, Ch. 4

    (1969)
  • R.T. Clemen et al.

    Limits for the precision and value of information from dependent sources

    Operations Research

    (1985)
  • Nicolas Caritat de Condorcet, Essai sur l'application de l'analyse a la probabilité des décision rendues à la pluralité...
  • T. Cover

    The best two independent measurements are not the two best

    IEEE Transactions on Systems, Man and Cybernetics

    (1974)
  • B.E. Cushing

    A mathematical approach to the analysis and design of internal control systems

    Accounting Review

    (1974)
  • H.J. Einhorn

    The use of nonlinear, noncompensatory models in decision making

    Psychological Bulletin

    (1970)
  • H.J. Einhorn

    The use of nonlinear, noncompensatory models as a function of task and amount of information

    Organizational Behavior and Human Performance

    (1971)
  • H.J. Einhorn

    Expert measurement and mechanical combination

    Organizational Behavior and Human Performance

    (1972)
  • A. Even et al.

    Economics driven data management: an application to the design of tabular data sets

    IEEE Transactions on Knowledge and Data Engineering

    (2007)
  • A.G. Frantsuz

    Influence of correlations between attributes on their informativeness for pattern recognition

    Engineering Cybernetics

    (July-Aug 1967)
  • L.A. Goodman

    On the exact variance of products

    Journal of the American Statistical Association

    (1960)
  • B. Grofman et al.

    Thirteen theorems in search of the truth

    Theory and Decision

    (1983)
  • J. Hipp et al.

    Data quality mining: making a virtue of necessity

  • M. Janson

    Data quality: the Achilles heel of end-user computing

    Omega: International Journal of Management Science

    (1988)
  • Cited by (29)

    • Multivariate data quality assessment based on rotated factor scores and confidence ellipsoids

      2020, Decision Support Systems
      Citation Excerpt :

      The same authors have determined that data quality depends on four different categories, namely: intrinsic data quality, which is related to data accuracy; accessibility data quality, which is the ease of data collection; contextual data quality, which evaluated the data against the context in which it was obtained; and representational data quality, where the data are expected to be presented clearly. Several related studies emphasize data quality in the context of DSSs [6–10]. Another important factor considered in the data quality studies is data assessment.

    • Potential Problem Data Tagging: Augmenting information systems with the capability to deal with inaccuracies

      2019, Decision Support Systems
      Citation Excerpt :

      There can be multiple decision options, hereon referred to as degrees of freedom (DOF), because the same item types are often stored at multiple different locations. Hence, there may be no single right answer with various alternative choices being satisfactory; this is termed “satisficing” decisions, where satisfying a chosen aspiration level, rather than an optimal solution, is the focus [25] as opposed to cases where there is only one optimal decision choice (see [15]). This type of decision is therefore referred to generally as a multi-criteria, satisficing decision.

    • The Holistic Risk Analysis and Modelling (HoRAM) method

      2019, Safety Science
      Citation Excerpt :

      To be supported in their daily activity, decision makers have appealed, “since the oil shocks upset the business world in the 1970s” (Bood and Postma, 1997), to multiple scenario analysis as “the goal of scenarios is to provide a structured means of communicating uncertainty” (Tourki et al., 2013). Despite scenarios might be defined in different ways (Domenica et al., 2007; Gelman, 2010; Pomerol, 2001; Porter, 1985; Schwartz, 1991; Tucker, 1999; Weinstein, 2007), it is commonly understood that they are meant for anticipating the opportunities (as well as the risks) associated with the possible alternatives and not for developing new strategies (Schoemaker, 1993). Nevertheless, given they (ought to) include a wide range of options against known facts and trends (typically prompted in the analysts’ mind in the form of “what if” stories), scenarios might even support strategic thinking (Porter, 1985; Cole, 2014; Means et al., 2005; Schoemaker, 1995; Simon, 1960; Svedung and Rasmussen, 2002; Van der Heijden, 1996; Wright, 2000), increase situation awareness (Endsley, 1995; Wickens and McCarley, 2008) and foster a dubitative attitude (Weick and Sutcliffe, 2001), which is beneficial to increase personal and organisational resilience.

    • Metric-based data quality assessment - Developing and evaluating a probability-based currency metric

      2015, Decision Support Systems
      Citation Excerpt :

      High-quality data in information systems (IS) are needed as a basis for business, innovation, and decision-making processes [1,38]. Thus, poor data quality often results in bad decisions and economic losses [12,15,18,43]. In addition, great effort is required to ease or solve data quality problems [11,30].

    View all citing articles on Scopus

    Irit Askira Gelman is a visiting scholar at the University of Arizona. She received her B.Sc. in Mathematics from Tel Aviv University, Israel, M.Sc. in Information Systems from Tel Aviv University, and Ph.D. in Management Information Systems from the University of Arizona. She is a member of IEEE and AIS. Her research interests include data and information quality, knowledge discovery and data mining, and model management systems.

    View full text