Elsevier

Information Systems

Volume 35, Issue 3, May 2010, Pages 352-374
Information Systems

Empirical evidence for the usefulness of Armstrong relations in the acquisition of meaningful functional dependencies

https://doi.org/10.1016/j.is.2009.11.002Get rights and content

Abstract

Armstrong relations satisfy precisely those data dependencies that are implied by a given set of data dependencies. A common perception is that Armstrong relations are useful in the acquisition of data semantics, in particular since errors during the requirements elicitation have the most expensive consequences.

We report on some first empirical evidence for this perception regarding the class of functional dependencies (FDs). For this purpose, we investigate the usefulness of Armstrong relations with respect to various measures. Soundness measures how many of the as meaningful perceived FDs are actually meaningful. Completeness measures how many of the actually meaningful FDs are also perceived as meaningful.

Our experiment determines what and how much design teams learn about the application domain in addition to what they know prior to using Armstrong relations. The data analysis suggests that in using Armstrong relations it is not more likely to recognize meaningless FDs which are incorrectly perceived as meaningful, but it is more likely to recognize meaningful FDs that are incorrectly perceived as meaningless.

Our measures assess the quality of an FD set with respect to a target FD set, and therefore qualify naturally for the use in automated assessment tools, e.g. for database course exams or assignments.

Introduction

Armstrong relations are of interest in database theory and practice. Let Σ{φ} denote a set of functional dependencies (FDs). We say that Σ implies φ, if every relation that satisfies every FD in Σ also satisfies φ. That is, there is no counterexample relation that satisfies all FDs in Σ and violates φ. We write Σφ to denote that Σ implies φ (and Σφ to denote that Σ does not imply φ). For a set Σ of FDs, let Σ* denote the set of all FDs implied by Σ. For every FD φ that is not in Σ*, there is a counterexample relation rφ that satisfies all FDs in Σ and violates φ. As a consequence of a result by Armstrong [1], there is a single counterexample relation that satisfies all FDs in Σ* and violates all FDs not in Σ*. Following common terminology we call such a relation an Armstrong relation for Σ. The following example illustrates the potential benefits of utilizing Armstrong relations for the acquisition of meaningful FDs.

Let us assume that in developing an information system for some manufacturer of electrical goods we identify the processing of orders by retail sellers as a domain of interest. In particular, we define the relation schema ORDER that consists of the attributes Order#, Product#, Description, Qty and Total. These show for an order (identified by its order number Order#), a product in that order (identified by its unique product number Product#), a description Description of that product, the quantity Qty of that product in that order, and the total value Total (in some fixed currency) of that product in that order.

Suppose the designers of our information system have not been able yet to identify any meaningful FDs for the schema Order, i.e., Σ=. Therefore, they decide to inspect a relation that faithfully represents the initial design draft of an empty FD set. The relation they decide to examine is the one in Table 1. This relation is Armstrong for the empty FD set Σ.

By inspecting the Armstrong relation the designers simply notice that the Oven with Product# 521 is associated with the different quantities of 10 and 20 in the order with Order# 00724. This observation causes the design team to specify the FD Order,ProductQtywhich states that the schema Order records a unique quantity for the same product in the same order. A similar observation causes the design team to specify the FD Order,ProductTotalwhich states that the order number and the product number together uniquely determine the total of the product in the order. Moreover, the design team observes that the product with Product# 521 has two different descriptions Microwave and Oven. This observation causes the design team to ask the domain experts whether different descriptions can be given to any product. Since the experts agree that this cannot be the case, the design team responds by specifying the FD ProductDescriptionwhich states that the description of a product is uniquely determined by the product number. We can see that, by inspecting the Armstrong relation above, the designers have successfully identified three meaningful FDs for the application domain. Furthermore, these three FDs together imply the FD Order,ProductDescription,Qty,Total.Therefore, the design team recommends the attribute set {Order#, Product#} as a candidate key for the schema Order.

In general, a relation that satisfies an FD set Σ but which is not Armstrong for Σ will satisfy some FD that is not in Σ*. Therefore, relations that are not Armstrong for a given FD set may not be able to reveal problems with the current design. For example, the relation in Table 2 is not Armstrong for the empty FD set Σ. While this relation satisfies Σ (as every other relation does in this case), it gives the false impression that the current design, i.e. Σ=, is acceptable. Specifically, the relation is not a faithful representation of the FD set Σ. For example, the relation does not violate the FD Order#, ProductQty, nor the FD Order#, ProductTotal, nor does it violate the FD ProductDescription, even though they are not in Σ*. Intuitively, an inspection of the relation in Table 2 does neither seem to encourage a design team to specify the FDs Order,ProductQty,Order,ProductTotalnor does it seem to encourage the team to ask the domain experts whether different descriptions can be associated with the same product number.

This simple example illustrates the potential benefit of using Armstrong relations in the process of identifying the complete set of FDs that are meaningful for the underlying application domain. Failure to identify such a complete set means that the output of the requirements analysis is afflicted with errors.

Empirical studies show that more than half the errors which occur during systems development are requirements errors [2], [3], [4]. Requirements errors are also the most common cause of failure in systems development projects [2], [5], [6]. The cost of errors increases exponentially over the development life cycle: it is more than 100 times more costly to correct a defect post-implementation than it is to correct it during requirements analysis [7]. This suggests that it would be more effective to concentrate quality assurance efforts in the requirements analysis stage, in order to catch requirements errors as soon as they occur, or to prevent them from occurring altogether [8]. Hence, Armstrong relations appear to be a valuable tool for the requirements analysis of the target database. However, the question remains in what precise sense they are valuable.

Research gap and research questions: In previous work, Armstrong relations were called “user-friendly representations” of sets of data dependencies [9], and it was stated that they are “useful for database design” [9], [10]. However, the phrase “useful for database design” was exclusively justified in terms of the structural and algorithmic properties of Armstrong relations. For instance, this may refer to the fact that FDs enjoy Armstrong relations, i.e., for every set Σ of FDs there is an Armstrong relation for Σ. Note that it is everything but self-evident that a given class of data dependencies enjoys Armstrong relations [11]. Other interpretations of “useful” may refer to either the size of an Armstrong relation for an FD set Σ, e.g. the minimal number of tuples required for a relation to be Armstrong for Σ, or the existence/efficiency of algorithms to compute such an Armstrong relation. These interpretations of “useful” have received considerable interest from the research community, e.g. [9], [12], [13], [14].

The authors are unaware of any research that provides evidence for the perception of usefulness of Armstrong relations in the requirements elicitation phase, as observed in the examples above. Bisbal and Grimson [15, p. 451] state that Armstrong relations are “expected to expose missing or undesirable functional dependencies”. So far, however, there is no empirical evidence that Armstrong relations really do assist design teams to decide whether a functional dependency is either meaningful or meaningless for the underlying application domain. In this paper, we address this research gap and seek answers to the following research questions:

  • 1.

    In what precise sense can Armstrong relations be “useful” for the acquisition of meaningful functional dependencies for a given application domain?

  • 2.

    Given a fixed and precise interpretation of the term “useful”, how “useful” are Armstrong relations for the acquisition of meaningful functional dependencies for a given application domain?

Note that question two subsumes the following question: given a fixed and precise interpretation of the term “useful”, are Armstrong relations “useful” for the acquisition of meaningful FDs for a given application domain. Throughout the paper, an FD is considered to be meaningful if every relation that represents a real-world instance over the given schema will satisfy the FD. This is different from an accidental FD [11], which is satisfied by some real-world instance (it is accidentally satisfied by this instance), but is violated by some other real-world instance. Therefore, the real problem for a database design team is the identification of meaningful FDs, and not the identification of accidental FDs. Note that an FD is perceived as meaningful for a relation schema by a design team that specified an FD set Σ if and only if the FD is satisfied by any Armstrong relation for Σ. Intuitively, this points to the “usefulness” of Armstrong relations for identifying meaningful FDs.

Research contributions: In order to address our research questions we ask how the use of Armstrong relations can potentially contribute to the quality of the set of functional dependencies that a design team perceives as meaningful. For our analysis we measure this quality relative to a target set Σt of functional dependencies that forms a cover of all functional dependencies established as meaningful after consulting a group of domain experts.

In order to say something about the usefulness of Armstrong relations we measure what or how much design teams learn about the application domain in addition to what they already know prior to using Armstrong relations. Therefore, we first measure the quality without the use of an Armstrong relation, and then measure the quality after the use of Armstrong relations. If the quality increases by using Armstrong relations, then the Armstrong relations were indeed useful. Moreover, to measure the increase in quality (a negative increase is a decrease) we only compare the results from the same design team. Note that throughout the experiments the domain experts were present to answer potential questions from the design teams. For phase 1 we asked each design team to specify the set Σ1 of FDs that they perceive as meaningful for the fixed underlying application domain, and then measured the quality of Σ1 relative to the target set Σt. For phase 2, we provided the same design team with an Armstrong relation for Σ1, and ask them to revise Σ1. After a number of repetitions of phase 2, the design team finalized their revised set Σ2, and we then measured the quality of Σ2 relative to the target set Σt. If, on average, the quality of Σ2 relative to Σt was better than the quality of Σ1 relative to Σt, then we concluded that Armstrong relations are useful for the acquisition of meaningful FDs with respect to the quality measure that we apply.

For our research questions we provide different measures of quality: proximity and minimality. Informally, proximity captures how close a set Σ is to the target set Σt, and minimality captures the level of non-redundancy in the representation of Σ. We further divide the measure of proximity into soundness and completeness. Soundness measures how many of the perceived meaningful FDs are actually meaningful, while completeness measures how many of the meaningful FDs were actually perceived as meaningful. Measures are defined with respect to the closure of attribute sets under implication. This guarantees that our measures are independent of the representation of the FD sets provided by the design teams. The usefulness of Armstrong relations is defined in terms of each of the measures and in two variations. For the first variation, we quantify usefulness as the arithmetic mean over the differences in quality (quality of Σ2i minus quality of Σ1i). Therefore, the first variation of usefulness measures the average gain in quality towards the target FD set. For the second variation, we quantify usefulness as the geometric mean over the ratios in quality (quality of Σ2i divided by the quality of Σ1i). Therefore, the second variation of usefulness measures the average growth in quality after using Armstrong relations.

According to our data analysis, we can report an average gain of 5% and an average growth by 7% in minimality after using Armstrong relations. The impact of Armstrong relations on the soundness of the FD sets is close to 0% in both gains and growths. By using Armstrong relations we gain on average 14% in terms of completeness and proximity on the target FD set, and the quality of the FD sets growths on average by 20% in completeness and proximity. A summary of the individual and average gains for each of the 20 design teams that participated in our project is illustrated in Fig. 1. Fig. 2 illustrates the individual and average growths in quality for the design teams.

Our first main result suggests that Armstrong relations help design teams to identify additional meaningful functional dependencies that were previously overlooked. In other words, by using an Armstrong relation it is more likely that meaningful FDs are recognized which were previously perceived as meaningless. The reason is that Armstrong relations violate FDs that are perceived as meaningless, and violations of actually meaningful FDs are likely to be recognized. In this sense, our first main result empirically confirms Bisbal and Grimson's expectation that Armstrong relation “expose missing functional dependencies”.

Our second main result suggests that Armstrong relations do not have an impact on the soundness of the FD sets (on average). In other words, by using an Armstrong relation it is not more likely to recognize an FD which is meaningless but which is perceived as meaningful. The reason is that Armstrong relations satisfy FDs that are perceived as meaningful, and to recognize the satisfaction of actually meaningless FDs appears to be too complex. In this sense, our second main result empirically refutes Bisbal and Grimson's expectation that Armstrong relations “expose undesirable functional dependencies”.

We believe that both of our results are intuitive. Indeed, it can be a complex process for a group of humans to recognize the satisfaction of a meaningless FD (all pairs of distinct tuples need to be examined), but it is a simpler process to recognize the violation of a meaningful FD (only the right pair of distinct tuples needs to be recognized).

Therefore, both results provide new insight into the usefulness of Armstrong relations in the requirements analysis phase of database design. This may help practitioners to use Armstrong relations more effectively in the requirements elicitation process, and motivate further research on this subject. Finally, we are also confident that the new measures of soundness, completeness and proximity can be helpful for the automated assessment and automated feedback of homework or exam questions for undergraduate database courses [16].

Organization: We summarize some of the relevant previous work in Section 2. This provides further motivation for our study. Preliminary definitions necessary to report on our findings are introduced in Section 3. The design and limitations of our experiment are described in Section 4. In Section 5 we introduce the measures to assess the quality of FD sets. A sample run that illustrates the process of our experiment and the application of our measures is given in Section 6. Using the sample run we also indicate how to use our measures to automate the assessment of assignment questions in database courses. The quantitative and qualitative data analysis are given in 7 Quantitative data analysis, 8 Qualitative data analysis, respectively. We conclude in Section 9, and discuss some future work in Section 10.

Section snippets

Related work and motivation

In this section we further motivate the significance of our research questions by highlighting the importance of functional dependencies, and summarizing previous work on Armstrong relations.

Functional dependencies constitute one of the most important classes of data dependencies. According to [17] they make up around two-thirds of all uni-relational data dependencies (dependencies defined over a single relation schema) in practice. They are essential in database modeling [18], [19], design and

Preliminary definitions

We use this section to summarize the basic notions required for our treatment of functional dependencies and Armstrong relations.

Design of the experiment

In this section we describe the stages of our experiment, the general process for measuring the different notions of quality, the application domain and target FD set we utilize, the design teams and domain experts, and the limitations of our experiment. The specific quality measures are introduced in Section 5.

Quality measures

In this section we introduce the formal definitions of our four quality measures. These include the soundness and completeness of an FD set relative to a target FD set as well as the proximity of two FD sets. Finally, we use the minimality of an FD set to measure the quality of the representation of an FD set. The measures of soundness, completeness and proximity have natural complements: unsoundness, incompleteness and distance, respectively. In particular, as we will see, distance defines a

Data collection—a sample

In this section we illustrate a sample run of our experiment based on the solutions that design team 3 provided.

Quantitative data analysis

In this section, we present and analyze the different ratios that the design teams achieved during our experiment.

Qualitative data analysis

In this section, we present and analyze the different quality measures that the design teams achieved during our experiment. We are using the following abbreviations for sake of clearer presentation: C for C_ID, L for L_Name, T for Time, and R for Room.

Conclusions

We have conducted a first empirical investigation about the usefulness of Armstrong relations for the acquisition of meaningful functional dependencies. Specifically, we introduced the three measures of soundness, completeness, and proximity to study the additional insights that one can obtain from the inspection of Armstrong relations. The first main result indicates that Armstrong relations are not useful in terms of soundness, i.e., in using Armstrong relations it is not more likely to

Future directions

There are various avenues that should be explored in future work.

One avenue should address the limitations that we mentioned previously. In particular, empirical evidence should be collected in a range of different application domains, from real database designers, under more realistic time constraints, in the absence or partial availability of domain experts and with differences in the experts’ opinions. Our study can provide exact means to conduct such experiments.

Another direction is the

Acknowledgments

This research is supported by the Marsden fund council from Government funding, administered by the Royal Society of New Zealand.

We would like to thank Tiong Goh, Sven Hartmann, Pavle Mogin and Dion Peszynski for their kind assistance with the data collection for this project as well as their suggestions to improve the clarity of our presentation. We are also grateful to Dennis Shasha for his suggestions and comments that resulted in an improvement of the motivation and presentation of our

References (87)

  • M. Sözat et al.

    A complete axiomatization for fuzzy functional and multivalued dependencies in fuzzy database relations

    ACM Fuzzy Sets Syst.

    (2001)
  • P. Buneman et al.

    Keys for XML

    Comput. Networks

    (2002)
  • S. Kolahi

    Dependency-preserving normalization of relational and XML data

    J. Comput. Syst. Sci.

    (2007)
  • S. Davidson et al.

    Propagating XML constraints to relations

    J. Comput. Syst. Sci.

    (2007)
  • R. Fagin et al.

    Data exchange: semantics and query answering

    Theor. Comput. Sci.

    (2005)
  • R. Fagin et al.

    Armstrong databases for functional and inclusion dependencies

    Inf. Process. Lett.

    (1983)
  • J. Demetrovics et al.

    Partial dependencies in relational databases and their realization

    Discrete Appl. Math.

    (1992)
  • J. Demetrovics et al.

    The characterization of branching dependencies

    Discrete Appl. Math.

    (1992)
  • J. Bisbal et al.

    Database sampling with functional dependencies

    Inf. Software Technol.

    (2001)
  • J. Bisbal et al.

    A formal framework for database sampling

    Inf. Software Technol.

    (2005)
  • H. Noble

    The automatic generation of test data for a relational database

    Inf. Syst.

    (1983)
  • S. Hartmann et al.

    Constraint acquisition for Entity-Relationship models

    Data Knowl. Eng.

    (2009)
  • W.W. Armstrong

    Dependency structures of database relationships

    Inf. Process.

    (1974)
  • A. Enders et al.

    A Handbook of Software and Systems Engineering; Empirical Observations, Laws and Theories

    (2003)
  • S. Lauesen et al.

    Preventing requirement defects: an experiment in process improvement

    Requir. Eng.

    (2001)
  • J. Martin

    Information Engineering

    (1989)
  • Standish Group, The CHAOS Report, The Standish Group International, available on-line at...
  • Standish Group, Unfinished voyages, The Standish Group International, available on-line at...
  • B. Boehm

    Software Engineering Economics

    (1981)
  • R. Zultner, The deming way: total quality management for software, in: Proceedings of Total Quality Management for...
  • J. Baixeries, J. Balcázar, Characterization and Armstrong relations for degenerate multivalued dependencies using...
  • R. Fagin

    Horn clauses and database dependencies

    J. ACM

    (1982)
  • C. Beeri et al.

    On the structure of Armstrong relations for functional dependencies

    J. ACM

    (1984)
  • S. Hartmann

    On the implication problem for cardinality constraints and functional dependencies

    Ann. Math. Artif. Intell.

    (2001)
  • F. De Marchia et al.

    Semantic sampling of existing databases through informative Armstrong databases

    Inf. Syst.

    (2007)
  • J. Bisbal et al.

    Consistent database sampling as a database prototyping approach

    J. Software Maintenance

    (2002)
  • J. Ullman, Improving the efficiency of database-system teaching, in: SIDMOD Conference, 2003, pp....
  • C. Delobel et al.

    Relational Database Systems

    (1985)
  • C. Batini et al.

    Conceptual Database Design: An Entity-Relationship Approach

    (1992)
  • B. Thalheim

    Entity-Relationship Modeling: Foundations of Database Technology

    (2000)
  • S. Abiteboul et al.

    Foundations of Databases

    (1995)
  • C. Beeri et al.

    Computational problems related to the design of normal form relational schemata

    ACM Trans. Database Syst.

    (1979)
  • P. Bernstein

    Synthesizing third normal form relations from functional dependencies

    ACM Trans. Database Syst.

    (1976)
  • Cited by (0)

    View full text