Conceptual modeling from natural language functional specifications

doi:10.1016/S0954-1810(01)00017-6

Artificial Intelligence in Engineering

Volume 15, Issue 2, April 2001, Pages 207-218

https://doi.org/10.1016/S0954-1810(01)00017-6 Get rights and content

Abstract

In this paper we describe a structured method for developing a conceptual data model by starting from a functional model expressed in a natural language. We have used the Conceptual Dependency theory for mapping natural language descriptions to conceptual dependency diagrams. We have developed algorithms to convert these conceptual dependency diagrams into unit conceptual dependency tables, which are then merged to represent the whole context of the application. We also show how transactional requirements can be incorporated into the unit conceptual dependency table, and subsequently convert the unit conceptual dependency table into a corresponding conceptual model. We have developed an augmented transition network (ATN) parser to develop conceptual dependency diagrams from natural language descriptions. A prototype system has been implemented using Oracle8i and developer platforms.

Introduction

Many real-life applications including declarative Website designs [35], data warehousing [34], and traditional database applications require the ability to specify user requirements at one level of abstraction, frequently called the ‘conceptual level’, and develop detailed specifications at other levels, and somehow connect these levels using some metadata so as to provide communication between system designers and implementers and end-users, and also enable design changes in a seamless manner by decoupling these levels. In traditional database applications, this process has been divided into conceptual, logical, and physical design phases [14]. The conceptual design phase consists of two steps: view modeling, and view integration. In the view-modeling phase, each user group analyzes its data requirements and expresses them in terms of a local conceptual schema. In the view integration phase, these independently developed views are integrated into an overall enterprise-wide schema.

There are many methods for determining user requirements: interviewing users, studying existing environments, and analyzing functional specifications of the system [26]. In all of these cases, descriptions in natural languages are common and abundant. Interviews are held in natural languages and transcribed in natural language reports, existing environments can be described in natural language descriptions, and functional specifications have natural language definitions and glossary of terms. Developing the conceptual model involves identifying the various data elements from natural language descriptions, and determining their inter-relationships [9], [11], [12], [13]. Many of these descriptions are not geared towards database design, and thus may be incomplete in their specifications, and may contain redundancies. Hence it is the responsibility of the database designer to elicit the missing pieces of information that are implicit in these ‘informal’ descriptions. No systematic method has been suggested in the literature to identify missing pieces of information in the description of user requirements.

Redundancies in natural language descriptions may include repetition of the same concept in different names (synonyms) or description of different concepts using the same or similar names (homonyms). Typically, the homonym/synonym problem (described by the term interschema multiname anomaly in [4], [5]) is dealt with at the view integration stage [4], [9], [25], with the implicit assumption that within one single view this problem may not exist. However, if views are developed from natural language descriptions, then this problem may need to be addressed at the view modeling stage. Traditional view integration techniques take multiple (possibly independently developed) views v₁, v₂,…,v_n as input, and merge them into a single view, V. Several issues emerge in this context.

•
Integrating two views v₁ and v₂ may require restructuring [3], [4], [21], [23], [25] during/prior to integration. Examples of such restructuring include transformation of an attribute into an entity and vice versa, and introducing subcategories of entities and relationships [33]. Most traditional view integration methods are binary [4], [5], [7], [8], [9], [15], [16], [19], [24] at any particular point in the process of integration, two views are merged into one. As a result, integrating N views may involve N−1 restructuring steps, in the worst case.
•
A property of the binary integration method is that it is non-associative. In other words, the order in which the integration is carried out will determine the outcome of integration [6], [29].
•
We have previously mentioned that the goal of view integration is to obtain an integrated schema, which is the least upper bound of the individual schemas. It can be shown [4], that view integration can introduce redundant cycles. In fact redundant cycles may also exist in the views prior to integration. Identification of redundancy is a non-trivial problem [10], and no systematic method has been suggested to address this issue.
•
User transaction requirements have been addressed to a limited degree in the view integration literature. In a review of the database schema and view integration methodologies [6], only two works [30], [31], [32] that incorporate queries as processing requirements in the integration process, both of which use the functional data model. Other works that consider processing requirements (only to a limited extent) are Ref. [33], in which queries are dealt with in terms of mapping queries to the integrated schema, and not to determine the result of integration. Hence, this does not ensure that the integrated schema will be complete in terms of supporting user transaction requirements.

In this paper we develop a methodology for designing the conceptual model starting from a functional model expressed in natural language sentences. While we describe the methodology using a traditional database application, it can be extended to other real world design problems such as Web engineering and data warehousing as well. We adapt the following criteria for the design process from [17], [18], [20], [22]:

1.
Completeness. The data model must include all data elements that are referenced in the functional specifications of the system. Thus, a database user can request, modify, and store information about any data item that may be required to carry out the functions depicted in the functional model.
2.
Minimality. The database designer has to ascertain that no data item is represented more than once in the data model. Thus, avoiding data redundancy that leads to inconsistencies and waste of storage space.
3.
Sensitivity. In incremental development of the system, it would be necessary to determine if a new data item is already incorporated into the data model, in order to ensure the completeness and minimality criteria discussed above. Thus, there is a need to go from the data model to the functional model. On the other hand, if an existing system is being modified, the functional model itself may undergo changes, which would require the data model to be updated. Thus, it may also require us to traverse from the functional model to the data model. In either case, a linkage between the two models is necessary. In our proposed method this linkage is established while developing the data model, thereby eliminating the need for any further effort once the two models are created.
4.
We propose a simultaneous n-ary integration method, in which any number of views can be integrated simultaneously. This reduces the number of restructuring steps to one, as opposed to n−1, as in traditional view integration methods.
5.
We propose a method for identification of redundant relationships in a conceptual schema. Redundant relationships are those that can be derived from others present in the model, and are not required to support user transaction requirements.
6.
We propose a method for comparing the conceptual schema with user transaction requirements, and making suitable modifications. This could include adding additional data elements and eliminating redundant ones. This step ensures that the schema is complete and minimal with respect to the user transaction requirements.

We applied the theory of Conceptual Dependencies, developed by Schank [27], [28] to interpret natural language expressions. The theory of conceptual dependencies provides a method for developing conceptual representations of natural language sentences and has primarily been used for developing natural language understanding and generating systems. The methodology we describe identifies data elements from the functional model expressed as natural language description, locate missing pieces of information, combine the individual data elements into an overall conceptual schema, and establish object granularity in the data model.

The rest of the paper is organized as follows: In Section 2 we describe our methodologies and a prototype implementation, and in Section 3 we present our conclusions.

Section snippets

Methodology

In this section we describe the methodology of developing the conceptual schema from natural language specification of functional requirements. We first describe the target conceptual schema to be developed, next we describe the theory of conceptual dependencies, next we describe the steps necessary to develop the conceptual schema from the conceptual dependency diagrams, and subsequently we describe implementation of a prototype system.

Conclusions

In this paper we have described a methodology for developing a conceptual data model from functional specifications expressed in natural languages. The importance of developing such a methodology has been emphasized in the literature time and again. The methodology utilizes the theory of conceptual dependencies that has been applied to multiple natural language understanding systems. In addition to identifying the various data items from natural language specifications, the methodology

References (37)

R.C. Schank
Conceptual dependency theory
N. Adam et al.
A form-based natural language front-end to a CIM database
IEEE Trans Knowledge Data Engng
(1997)
N. Adam et al.
A form-based approach to natural language processing
J Management Inform Syst
(1994)
Adam N, Gangopadhyay A. An N-ary view integration method using conceptual dependencies. In: Proceedings of the Hawaii...
C. Batini et al.
A methodology for data schema integration in the entity relationship model
IEEE Trans Software Engng
(1984)
C. Batini et al.
A comparative analysis of methodologies for database schema integration
ACM Comput Surv
(1987)
Bouzeghoub M, Gardarin G, Metais E. Database design tools: an expert system approach. In: Proceedings of the...
M. Bouzeghoub et al.
View integration by semantic unification and transformation of data structures
P. Buneman et al.
Theoretical aspects of schema merging
(1992)
Carswell JL, Navathe SB. SA-ER: A methodology that links structured analysis and entity-relationship modeling for...

P. Chen

The entity-relationship model-toward a unified view of data

ACM Trans Database Syst

(1974)

P. Chen

From ancient Egyptian language to future conceptual modeling

P. Chen

English sentence structure and entity-relationship diagrams

Inform Sci

(1983)

P.P. Chen

English, Chinese, and ER diagram

Data Knowledge Engng

(1997)

R. Elmasri et al.

Fundamentals of database systems

(2000)

J. Feng et al.

The notion of classes of a path in ER schema

Geller J, Mehta A, Perl Y, Neuhold E, Sheth A. Algorithms for structural; schema integration. In: Ng P, Ramamoorthy CV,...

S. Hatchman

Practitioner perceptions on the use of some semantic concepts in the entity-relationship model

Eur J Inform Syst

(1995)

Cited by (23)

Estimation of jellyfish abundance in the south-eastern Spanish coastline by using an explainable artificial intelligence model based on fuzzy logic
2022, Estuarine, Coastal and Shelf Science
Citation Excerpt :
Since comments from citizens had a natural language structure, a transition network (TN) was designed for their processing. TNs has been widely used in different fields of knowledge (Woods, 1973; De Carolis et al., 1996; Gangopadhyay, 2001; Gutiérrez-Estrada et al., 2005). These systems are based on a data structure consisting of a set of nodes representing finite states which are connected through a series of arcs.
Jellyfish swarms have a direct negative impact on human enterprise, specially on places dependent on the sun and beach economy. The local economy and the health of bathers may be at risk from the emergence of these gelatinous organisms. Economic losses can be mitigated by monitoring the occurrence of jellyfish on the coast. Due to the lack of jellyfish data, environmental citizen science is presented as an alternative for data collection. In this study, fuzzy logic-based models have been used to modelize the knowledge from citizen comments collected by the Infomedusa app. The effect of local climatological factors such as wind speed and direction on the incidence of jellyfish on the coast was studied. The fuzzy logic-based models showed that winds perpendicular to the coast lead to a higher occurrence of jellyfish swarms in central and eastern Malaga, while winds parallel to the coast have a greater influence in the westernmost coasts. Wind speed has a different effect on jellyfish incidence depending on the study area and wind direction. Data extracted from the Infomedusa app can help to address the historical scarcity of scientific data on jellyfish. This app presents an opportunity for future studies to expand the knowledge about the occurrence of these organisms on the coasts and may contribute to the prediction of onshore arrival.
Integrating local environmental data and information from non-driven citizen science to estimate jellyfish abundance in Costa del Sol (southern Spain)
2021, Estuarine, Coastal and Shelf Science
Citation Excerpt :
In this study, data were processed through a transition network (TN) specifically designed for the domain of interest (Fig. 3). TNs are artificial intelligence algorithms that have been used in different fields of knowledge in an effective manner (Woods, 1973; Anderson, 1982; De Carolis et al., 1996; Gangopadhyay, 2001; Sidhom and Hassoun, 2002; Stehno and Retti, 2003; Gutiérrez-Estrada et al., 2005). They are syntax analysis procedures, in which labels or terms may be associated by means connections that define the transitions and relationships between labels, used to extract information from texts generated in natural language (Thorne et al., 1968; Woods, 1970).
Tourism, fishing and aquaculture are key economic sectors of Costa del Sol (southern Iberian Peninsula). The management of these activities is sometimes disturbed by the onshore arrival and stranding of jellyfish swarms. In the absence data on the occurrence of these organisms, it may be interesting to explore data from non-driven systems, such as social networks. The present study show how data in text format from a mobile app called Infomedusa can be processed and used to model the relationship between estimated abundance of jellyfish on the beaches and local environmental conditions. The data retrieved from this app using artificial intelligence procedures (transition network or TN algorithm), were used as input for GAM models to estimate the abundance of jellyfish based on wind speed and direction. The analysis of data provided by Infomedusa indicated that only 30.39% of messages provided by the users had information about absence/presence of jellyfishes in the beaches. On the other hand, the TN processing capacity showed an accuracy level to discriminate messages with information on absence/presence of jellyfish slightly higher than 80%. GAM models considering the wind direction and speed of previous day explained between 37% and 77% of the variance of jellyfish abundance estimate from Infomedusa data. In conclusion, this approach may contribute to the development of a system for predicting the onshore arrival of jellyfish in the Costa del Sol.
Detection of naming convention violations in process models for different languages
2013, Decision Support Systems
Citation Excerpt :
Many of the approaches applying NLP techniques on external text material aim for the automatic inference of conceptual models. Examples include the (semi)-automatic creation of process models [31,37,35,34,77], conceptual dependency diagrams [33], entity–relationship models [36,66], and UML diagrams [3,18,19,64]. Some authors also propose inference approaches which are not limited to a single model type, but can be adapted for different kinds of conceptual models [27,63].
Companies increasingly use business process modeling for documenting and redesigning their operations. However, due to the size of such modeling initiatives, they often struggle with the quality assurance of their model collections. While many model properties can already be checked automatically, there is a notable gap of techniques for checking linguistic aspects such as naming conventions of process model elements. In this paper, we address this problem by introducing an automatic technique for detecting violations of naming conventions. This technique is based on text corpora and independent of linguistic resources such as WordNet. Therefore, it can be easily adapted to the broad set of languages for which corpora exist. We demonstrate the applicability of the technique by analyzing nine process model collections from practice, including over 27,000 labels and covering three different languages. The results of the evaluation show that our technique yields stable results and can reliably deal with ambiguous cases. In this way, this paper provides an important contribution to the field of automated quality assurance of conceptual models.
Natural language processing in-and-for design research
2022, Design Science
Natural Language Processing in-and-for Design Research
2021, arXiv
Knowledge-based systems for data modelling
2016, International Journal of Enterprise Information Systems

View all citing articles on Scopus

Aryya Gangopadhyay is an Assistant Professor of Information Systems at the University of Maryland Baltimore County (USA). He has a B.Tech. from Indian Institute of Technology, and M.S. in Computer Science from New Jersey Institute of Technology, and a PhD in Computer Information Systems from Rutgers University. His research interests include Electronic Commerce, multimedia databases, data warehousing and mining, geographic information systems, and database security. He has authored and co-authored two books, many book chapters, numerous papers in journals such as Decision Support Systems, Quarterly Journal of Electronic Commerce, IEEE Computer, IEEE Transactions on Knowledge and Data Engineering, Journal of Management Information Systems, Journal of Global Information Management, Electronic Markets The International Journal of Electronic Commerce and Business Media, Decision Support Systems, AI in Engineering, and Topics in Health Information Management, as well as presented papers in many national and international conferences.

View full text

Conceptual modeling from natural language functional specifications

Abstract

Introduction

Section snippets

Methodology

Conclusions

A form-based natural language front-end to a CIM database

IEEE Trans Knowledge Data Engng

A form-based approach to natural language processing

J Management Inform Syst

A methodology for data schema integration in the entity relationship model

IEEE Trans Software Engng

A comparative analysis of methodologies for database schema integration

ACM Comput Surv

View integration by semantic unification and transformation of data structures

Theoretical aspects of schema merging

The entity-relationship model-toward a unified view of data

ACM Trans Database Syst

From ancient Egyptian language to future conceptual modeling

English sentence structure and entity-relationship diagrams

Inform Sci

English, Chinese, and ER diagram

Data Knowledge Engng

Fundamentals of database systems

The notion of classes of a path in ER schema

Practitioner perceptions on the use of some semantic concepts in the entity-relationship model

Eur J Inform Syst