Conceptual modeling from natural language functional specifications

https://doi.org/10.1016/S0954-1810(01)00017-6Get rights and content

Abstract

In this paper we describe a structured method for developing a conceptual data model by starting from a functional model expressed in a natural language. We have used the Conceptual Dependency theory for mapping natural language descriptions to conceptual dependency diagrams. We have developed algorithms to convert these conceptual dependency diagrams into unit conceptual dependency tables, which are then merged to represent the whole context of the application. We also show how transactional requirements can be incorporated into the unit conceptual dependency table, and subsequently convert the unit conceptual dependency table into a corresponding conceptual model. We have developed an augmented transition network (ATN) parser to develop conceptual dependency diagrams from natural language descriptions. A prototype system has been implemented using Oracle8i and developer platforms.

Introduction

Many real-life applications including declarative Website designs [35], data warehousing [34], and traditional database applications require the ability to specify user requirements at one level of abstraction, frequently called the ‘conceptual level’, and develop detailed specifications at other levels, and somehow connect these levels using some metadata so as to provide communication between system designers and implementers and end-users, and also enable design changes in a seamless manner by decoupling these levels. In traditional database applications, this process has been divided into conceptual, logical, and physical design phases [14]. The conceptual design phase consists of two steps: view modeling, and view integration. In the view-modeling phase, each user group analyzes its data requirements and expresses them in terms of a local conceptual schema. In the view integration phase, these independently developed views are integrated into an overall enterprise-wide schema.

There are many methods for determining user requirements: interviewing users, studying existing environments, and analyzing functional specifications of the system [26]. In all of these cases, descriptions in natural languages are common and abundant. Interviews are held in natural languages and transcribed in natural language reports, existing environments can be described in natural language descriptions, and functional specifications have natural language definitions and glossary of terms. Developing the conceptual model involves identifying the various data elements from natural language descriptions, and determining their inter-relationships [9], [11], [12], [13]. Many of these descriptions are not geared towards database design, and thus may be incomplete in their specifications, and may contain redundancies. Hence it is the responsibility of the database designer to elicit the missing pieces of information that are implicit in these ‘informal’ descriptions. No systematic method has been suggested in the literature to identify missing pieces of information in the description of user requirements.

Redundancies in natural language descriptions may include repetition of the same concept in different names (synonyms) or description of different concepts using the same or similar names (homonyms). Typically, the homonym/synonym problem (described by the term interschema multiname anomaly in [4], [5]) is dealt with at the view integration stage [4], [9], [25], with the implicit assumption that within one single view this problem may not exist. However, if views are developed from natural language descriptions, then this problem may need to be addressed at the view modeling stage. Traditional view integration techniques take multiple (possibly independently developed) views v1, v2,…,vn as input, and merge them into a single view, V. Several issues emerge in this context.

  • Integrating two views v1 and v2 may require restructuring [3], [4], [21], [23], [25] during/prior to integration. Examples of such restructuring include transformation of an attribute into an entity and vice versa, and introducing subcategories of entities and relationships [33]. Most traditional view integration methods are binary [4], [5], [7], [8], [9], [15], [16], [19], [24] at any particular point in the process of integration, two views are merged into one. As a result, integrating N views may involve N−1 restructuring steps, in the worst case.

  • A property of the binary integration method is that it is non-associative. In other words, the order in which the integration is carried out will determine the outcome of integration [6], [29].

  • We have previously mentioned that the goal of view integration is to obtain an integrated schema, which is the least upper bound of the individual schemas. It can be shown [4], that view integration can introduce redundant cycles. In fact redundant cycles may also exist in the views prior to integration. Identification of redundancy is a non-trivial problem [10], and no systematic method has been suggested to address this issue.

  • User transaction requirements have been addressed to a limited degree in the view integration literature. In a review of the database schema and view integration methodologies [6], only two works [30], [31], [32] that incorporate queries as processing requirements in the integration process, both of which use the functional data model. Other works that consider processing requirements (only to a limited extent) are Ref. [33], in which queries are dealt with in terms of mapping queries to the integrated schema, and not to determine the result of integration. Hence, this does not ensure that the integrated schema will be complete in terms of supporting user transaction requirements.

In this paper we develop a methodology for designing the conceptual model starting from a functional model expressed in natural language sentences. While we describe the methodology using a traditional database application, it can be extended to other real world design problems such as Web engineering and data warehousing as well. We adapt the following criteria for the design process from [17], [18], [20], [22]:

  • 1.

    Completeness. The data model must include all data elements that are referenced in the functional specifications of the system. Thus, a database user can request, modify, and store information about any data item that may be required to carry out the functions depicted in the functional model.

  • 2.

    Minimality. The database designer has to ascertain that no data item is represented more than once in the data model. Thus, avoiding data redundancy that leads to inconsistencies and waste of storage space.

  • 3.

    Sensitivity. In incremental development of the system, it would be necessary to determine if a new data item is already incorporated into the data model, in order to ensure the completeness and minimality criteria discussed above. Thus, there is a need to go from the data model to the functional model. On the other hand, if an existing system is being modified, the functional model itself may undergo changes, which would require the data model to be updated. Thus, it may also require us to traverse from the functional model to the data model. In either case, a linkage between the two models is necessary. In our proposed method this linkage is established while developing the data model, thereby eliminating the need for any further effort once the two models are created.

  • 4.

    We propose a simultaneous n-ary integration method, in which any number of views can be integrated simultaneously. This reduces the number of restructuring steps to one, as opposed to n−1, as in traditional view integration methods.

  • 5.

    We propose a method for identification of redundant relationships in a conceptual schema. Redundant relationships are those that can be derived from others present in the model, and are not required to support user transaction requirements.

  • 6.

    We propose a method for comparing the conceptual schema with user transaction requirements, and making suitable modifications. This could include adding additional data elements and eliminating redundant ones. This step ensures that the schema is complete and minimal with respect to the user transaction requirements.

We applied the theory of Conceptual Dependencies, developed by Schank [27], [28] to interpret natural language expressions. The theory of conceptual dependencies provides a method for developing conceptual representations of natural language sentences and has primarily been used for developing natural language understanding and generating systems. The methodology we describe identifies data elements from the functional model expressed as natural language description, locate missing pieces of information, combine the individual data elements into an overall conceptual schema, and establish object granularity in the data model.

The rest of the paper is organized as follows: In Section 2 we describe our methodologies and a prototype implementation, and in Section 3 we present our conclusions.

Section snippets

Methodology

In this section we describe the methodology of developing the conceptual schema from natural language specification of functional requirements. We first describe the target conceptual schema to be developed, next we describe the theory of conceptual dependencies, next we describe the steps necessary to develop the conceptual schema from the conceptual dependency diagrams, and subsequently we describe implementation of a prototype system.

Conclusions

In this paper we have described a methodology for developing a conceptual data model from functional specifications expressed in natural languages. The importance of developing such a methodology has been emphasized in the literature time and again. The methodology utilizes the theory of conceptual dependencies that has been applied to multiple natural language understanding systems. In addition to identifying the various data items from natural language specifications, the methodology

Aryya Gangopadhyay is an Assistant Professor of Information Systems at the University of Maryland Baltimore County (USA). He has a B.Tech. from Indian Institute of Technology, and M.S. in Computer Science from New Jersey Institute of Technology, and a PhD in Computer Information Systems from Rutgers University. His research interests include Electronic Commerce, multimedia databases, data warehousing and mining, geographic information systems, and database security. He has authored and

References (37)

  • R.C. Schank

    Conceptual dependency theory

  • N. Adam et al.

    A form-based natural language front-end to a CIM database

    IEEE Trans Knowledge Data Engng

    (1997)
  • N. Adam et al.

    A form-based approach to natural language processing

    J Management Inform Syst

    (1994)
  • Adam N, Gangopadhyay A. An N-ary view integration method using conceptual dependencies. In: Proceedings of the Hawaii...
  • C. Batini et al.

    A methodology for data schema integration in the entity relationship model

    IEEE Trans Software Engng

    (1984)
  • C. Batini et al.

    A comparative analysis of methodologies for database schema integration

    ACM Comput Surv

    (1987)
  • Bouzeghoub M, Gardarin G, Metais E. Database design tools: an expert system approach. In: Proceedings of the...
  • M. Bouzeghoub et al.

    View integration by semantic unification and transformation of data structures

  • P. Buneman et al.

    Theoretical aspects of schema merging

    (1992)
  • Carswell JL, Navathe SB. SA-ER: A methodology that links structured analysis and entity-relationship modeling for...
  • P. Chen

    The entity-relationship model-toward a unified view of data

    ACM Trans Database Syst

    (1974)
  • P. Chen

    From ancient Egyptian language to future conceptual modeling

  • P. Chen

    English sentence structure and entity-relationship diagrams

    Inform Sci

    (1983)
  • P.P. Chen

    English, Chinese, and ER diagram

    Data Knowledge Engng

    (1997)
  • R. Elmasri et al.

    Fundamentals of database systems

    (2000)
  • J. Feng et al.

    The notion of classes of a path in ER schema

  • Geller J, Mehta A, Perl Y, Neuhold E, Sheth A. Algorithms for structural; schema integration. In: Ng P, Ramamoorthy CV,...
  • S. Hatchman

    Practitioner perceptions on the use of some semantic concepts in the entity-relationship model

    Eur J Inform Syst

    (1995)
  • Cited by (23)

    • Estimation of jellyfish abundance in the south-eastern Spanish coastline by using an explainable artificial intelligence model based on fuzzy logic

      2022, Estuarine, Coastal and Shelf Science
      Citation Excerpt :

      Since comments from citizens had a natural language structure, a transition network (TN) was designed for their processing. TNs has been widely used in different fields of knowledge (Woods, 1973; De Carolis et al., 1996; Gangopadhyay, 2001; Gutiérrez-Estrada et al., 2005). These systems are based on a data structure consisting of a set of nodes representing finite states which are connected through a series of arcs.

    • Integrating local environmental data and information from non-driven citizen science to estimate jellyfish abundance in Costa del Sol (southern Spain)

      2021, Estuarine, Coastal and Shelf Science
      Citation Excerpt :

      In this study, data were processed through a transition network (TN) specifically designed for the domain of interest (Fig. 3). TNs are artificial intelligence algorithms that have been used in different fields of knowledge in an effective manner (Woods, 1973; Anderson, 1982; De Carolis et al., 1996; Gangopadhyay, 2001; Sidhom and Hassoun, 2002; Stehno and Retti, 2003; Gutiérrez-Estrada et al., 2005). They are syntax analysis procedures, in which labels or terms may be associated by means connections that define the transitions and relationships between labels, used to extract information from texts generated in natural language (Thorne et al., 1968; Woods, 1970).

    • Detection of naming convention violations in process models for different languages

      2013, Decision Support Systems
      Citation Excerpt :

      Many of the approaches applying NLP techniques on external text material aim for the automatic inference of conceptual models. Examples include the (semi)-automatic creation of process models [31,37,35,34,77], conceptual dependency diagrams [33], entity–relationship models [36,66], and UML diagrams [3,18,19,64]. Some authors also propose inference approaches which are not limited to a single model type, but can be adapted for different kinds of conceptual models [27,63].

    • Knowledge-based systems for data modelling

      2016, International Journal of Enterprise Information Systems
    View all citing articles on Scopus

    Aryya Gangopadhyay is an Assistant Professor of Information Systems at the University of Maryland Baltimore County (USA). He has a B.Tech. from Indian Institute of Technology, and M.S. in Computer Science from New Jersey Institute of Technology, and a PhD in Computer Information Systems from Rutgers University. His research interests include Electronic Commerce, multimedia databases, data warehousing and mining, geographic information systems, and database security. He has authored and co-authored two books, many book chapters, numerous papers in journals such as Decision Support Systems, Quarterly Journal of Electronic Commerce, IEEE Computer, IEEE Transactions on Knowledge and Data Engineering, Journal of Management Information Systems, Journal of Global Information Management, Electronic Markets The International Journal of Electronic Commerce and Business Media, Decision Support Systems, AI in Engineering, and Topics in Health Information Management, as well as presented papers in many national and international conferences.

    View full text