skip to main content
research-article
Open Access

Learning-based Relaxation of Completeness Requirements for Data Entry Forms

Published:15 March 2024Publication History

Skip Abstract Section

Abstract

Data entry forms use completeness requirements to specify the fields that are required or optional to fill for collecting necessary information from different types of users. However, because of the evolving nature of software, some required fields may not be applicable for certain types of users anymore. Nevertheless, they may still be incorrectly marked as required in the form; we call such fields obsolete required fields. Since obsolete required fields usually have “not-null” validation checks before submitting the form, users have to enter meaningless values in such fields to complete the form submission. These meaningless values threaten the quality of the filled data and could negatively affect stakeholders or learning-based tools that use the data. To avoid users filling meaningless values, existing techniques usually rely on manually written rules to identify the obsolete required fields and relax their completeness requirements. However, these techniques are ineffective and costly.

In this article, we propose LACQUER, a learning-based automated approach for relaxing the completeness requirements of data entry forms. LACQUER builds Bayesian Network models to automatically learn conditions under which users had to fill meaningless values. To improve its learning ability, LACQUER identifies the cases where a required field is only applicable for a small group of users and uses SMOTE, an oversampling technique, to generate more instances on such fields for effectively mining dependencies on them. During the data entry session, LACQUER predicts the completeness requirement of a target based on the already filled fields and their conditional dependencies in the trained model.

Our experimental results show that LACQUER can accurately relax the completeness requirements of required fields in data entry forms with precision values ranging between 0.76 and 0.90 on different datasets. LACQUER can prevent users from filling 20% to 64% of meaningless values, with negative predictive values (i.e., the ability to correctly predict a field as “optional”) between 0.72 and 0.91. Furthermore, LACQUER is efficient; it takes at most 839 ms to predict the completeness requirement of an instance.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Software designers use data entry forms to collect inputs of users who interact with software systems [27, 45]. To correctly collect the necessary information from users, designers typically define the completeness requirements of fields in data entry forms. These completeness requirements specify the fields that are required or optional to fill for different types of users.

However, as the software system and the application requirements change, data entry forms change, too. Such changes may result in some fields, previously marked as required, becoming inapplicable for certain types of users. We call obsolete required fields the fields whose “required” attribute does not remain valid with respect to the current application requirements. Although such fields are set as “required” in the form, the correct completeness requirement should be “optional.”

When obsolete required fields are included in an input form, since the system usually has client-side validation checks [51] to ensure that all the required fields have been filled in, users are obliged to fill the required fields with meaningless values (e.g., “@,” “n/a”) to be able to submit the form [3, 30]. We have observed this phenomenon both on a popular biomedical information collection platform NCBI [4], in which more than half of the users have filled meaningless values in required fields, and in a dataset provided by our industrial partner in the financial domain.

Obsolete required fields represent an extra burden for the users, costing additional time for filling the input form, and might lead to users interrupting the data entry process, with potential loss of business opportunities (e.g., a prospective client giving up during the registration phase due to the complexity of the input form). Moreover, the meaningless values filled through these obsolete required fields are then transferred to the software system using them and may affect the overall data quality of the system [37]. For example, given a categorical field (which is an obsolete required field), the user can choose the first value in a combo box just to skip filling this field. Even though the value is chosen from the list of candidate values, this value is meaningless, since the field should not be filled at the beginning [3]. This value can be used as an input by machine learning-based tools (for example, an automated form-filling tool [5]), which can then lead to more errors (e.g., wrongly predicting the values of some fields).

To automatically relax completeness requirements and avoid meaningless values, existing work has proposed adaptive form tools [6, 19, 49], which enable form designers to set required fields as optional when certain conditions hold. These tools first require form designers to define a complete and final set of completeness requirements, capturing the conditions for which a field should be required or optional. Then, they use intermediate representations such as XML [6] and dynamic condition response graphs [49] to represent the completeness requirements rules and implement adaptive behaviors. In addition, there are commercial tools (e.g., Gravity Forms [42], Google Forms [25]) that assist designers in designing adaptive forms, where fields can be displayed or hidden based on the value of already filled fields in the form. Similar to existing research approaches, these commercial tools assume that designers already have a complete and final set of completeness requirements describing the adaptive behavior of the form during the design phase.

However, due to the complexity of the domain (with hundreds of fields) and the evolution of the software, identifying a priori a comprehensive set of completeness requirements is not a viable solution. Moreover, even if they could be identified, such completeness requirements could become quickly obsolete, limiting the use of existing adaptive form tools.

To solve this problem, we propose LACQUER, a Learning-bAsed Completeness reQUirEments Relaxation approach, to automatically learn the conditions under which completeness requirements can be relaxed (i.e., when a required field can become optional). The basic idea of LACQUER is to build machine learning models to learn the conditions under which users had to fill meaningless values based on the data provided as input in past data entry sessions (hereafter called historical input instances). Using these models, the already-filled fields in a data entry form can then be used as features to predict whether a required field should become optional for certain users. LACQUER can be used during the form-filling process to refactor data entry forms by dynamically removing obsolete required fields at runtime, helping designers identify completeness requirements that should be relaxed.

LACQUER includes three phases: model building, form filling relaxation, and threshold determination. Given a set of historical input instances, the model building phase identifies the meaningless values filled by users and builds Bayesian network (BN) models to represent the completeness requirement dependencies among form fields (i.e., the conditions upon which users fill meaningless values). To improve its learning ability, LACQUER identifies also the cases where a required field is only applicable for a small group of users; it uses the synthetic minority oversampling technique SMOTE to generate more instances on such fields for effectively mining dependencies on them. Once the trained models are available, during the data entry session, the form filling relaxation phase predicts the completeness requirement of a target field based on the values of the already-filled fields and their conditional dependencies in the trained models. The predicted completeness requirement of a field and the corresponding predicted probability (endorsed based on a “threshold” automatically determined) are then used to implement adaptive behaviors of data entry forms.

The overall architecture of LACQUER has been inspired by LAFF [5], our previous work on automated form filling of data entry forms. The main similarities between these approaches derive from their shared challenges associated with the application domain (form filling). These challenges include (1) the arbitrary filling order and (2) partially filled forms. To address the first challenge, similar to LAFF, we use BNs to mine the relationships between filled fields and the target field to avoid training a separate model for each filling order. As for the second challenge, once again similar to LAFF, we use an endorser module to avoid providing inaccurate suggestions to the user when the form does not contain enough information for the model. More details about the similarities and differences between LACQUER and LAFF are provided in Section 6.

We evaluated LACQUER using form-filling records from both a public dataset and a proprietary dataset extracted from a production-grade enterprise information system in the financial domain. The experimental results show that LACQUER can accurately relax the completeness requirements of required fields in data entry forms with a precision value between 0.76 and 0.90 when predicting the truly required fields. In a sequential filling scenario, i.e., when users fill data entry forms in the default order determined by the form tab sequence, LACQUER can prevent users from providing meaningless values in 20% to 64% of the cases, with a negative predictive value (representing the ability of LACQUER to correctly predict a field as “optional”) between 0.72 and 0.91, significantly outperforming state-of-the-art rule-based approaches by 12(withpp = percentage points) to 12(withpp = percentage points) on the two datasets. Furthermore, LACQUER is efficient; it takes at most 839 ms to determine the completeness requirement of an input instance of the proprietary dataset.

To summarize, the main contributions of this article are:

The LACQUER approach, which addresses the problem of automated completeness requirements relaxation—an important challenge in designing data entry forms. To the best of our knowledge, LACQUER is the first work to combine BNs with oversampling and a probability-based endorser to provide accurate completeness requirement suggestions.

An extensive evaluation assessing the effectiveness and efficiency of LACQUER and comparing it with state-of-the-art baselines.1

The rest of the article is organized as follows: Section 2 provides a motivating example and explains the basic definitions of automated completeness requirements relaxation and its challenges. Section 3 introduces the basic machine learning algorithms used in this article. Section 4 describes the different steps and the core algorithms of LACQUER. Section 5 reports on the evaluation of LACQUER. Section 6 surveys related work. Section 7 discusses the usefulness and practical implication of LACQUER. Section 8 concludes the article.

Skip 2COMPLETENESS REQUIREMENT RELAXATION FOR DATA ENTRY FORMS Section

2 COMPLETENESS REQUIREMENT RELAXATION FOR DATA ENTRY FORMS

In this section, we introduce the concepts related to data entry forms, provide a motivating example, precisely define the problem of automated completeness requirement relaxation for data entry forms, and discuss its challenges.

2.1 Data Entry Forms

Data entry forms are composed of fields of different types, such as textual, numerical, and categorical. Textual and numerical fields collect free text and numerical values, respectively (e.g., the name and the age of a private customer of an energy provider); categorical fields provide a list of options from which users have to choose (e.g., nationality). Form developers can mark form fields either as required or optional, depending on the importance of the information to be collected. This decision is made during the design phase of the form based on the application completeness requirements. Such requirements capture the input data that shall be collected for certain types of users; they are fulfilled by setting the required/optional property of the corresponding fields in a data entry form. In other words, the required fields (also called mandatory fields [46]) of a form collect input information considered as important to the stakeholders who plan to use the collected information; the absence of this information could affect the application usage. On the contrary, optional fields collect information that is nice to have but whose absence is acceptable. For example, an energy provider cannot open a customer account when the customer name is missing; hence, the corresponding input field in a data entry form should be marked as “required.” At the same time, an energy provider does not need to know the education level of a new private customer (though it could be useful for profiling), so the corresponding input field can be marked as “optional.”

Some required fields can be further classified conditionally required, i.e., they are required only if certain conditions hold. For example, the field “marriage date” is required only if the value of the categorical field “civil status” is set to “married.” Data entry forms that support “conditionally required fields” are generally called adaptive forms [6] or context-sensitive forms [3], since they exhibit adaptive behaviors based on the values filled by users. More specifically, these types of forms are programmed so a field can be set from “required” to “optional” during the form-filling session, based on the input data; a change of this property also toggles the visibility of the field itself in the form. Such adaptive behaviors make the data entry form easier to use [3], since users can focus on the right fields they need to fill in.

Before submitting a data entry form, the form usually conducts a client-side validation check [51]—using some scripting language or built-in features of the environment where the form is visualized, like HTML attributes—to ensure that all the required fields have been filled in.

In this work, we consider a simple representation of an input form, with basic input fields that can have only a unique value that can be selected or entered, such as a text box (e.g., <input type=“text”> or <textarea> in HTML), a drop-down menu (e.g, <select> with single selection), or a radio button (e.g., <input type=“radio” in HTML). This allows us to assume that a field can only have one completeness requirement; in other words, a field cannot be optional and required at the same time.

We do not support forms with more sophisticated controls or fields that can handle multiple selections (e.g., a checkbox group for multiple-choice answers or a drop-down menu with multiple selection), as often found in surveys and questionnaires. Note that in this case a field can be both optional and required at the same time, depending on the number of selected values in the group.2 We plan to support this kind of complex controls as part of future work.

2.2 Motivating Example

Data entry forms are difficult to design [17] and subject to frequent changes [54]. These two aspects of data entry form design and development negatively impact the way developers deal with application completeness requirements in data entry forms.

For example, let us consider a data entry form in an energy provider information system, used for opening an account for business customers. For simplicity, we assume the form has only three required fields: “Company type” (categorical), “Field of activity” (categorical), and “Tax ID” (textual). Sometime after the deployment of the initial version of the system, the energy provider decides to support also the opening of customer accounts for non-profit organizations (NPOs). The developers update the form by adding (a) a new option “NPO” to the field “Company type” and (b) additional fields denoting information required for NPOs. After the deployment of the new form, a data entry operator of the energy provider (i.e., the end-user interfacing with the data entry form) notices a blocking situation when filling in the form for an NPO. Specifically, the form flags the field “Tax ID” as required; however, the company representative cannot provide one, since the company is exempted from paying taxes. The clerk is then obliged to fill in the required field with a meaningless value (e.g., “@”) to pass the validation check and be able to submit the form. Several weeks later, after noticing some issues in the generation of customers’ reports, the data quality division of the energy provider reviews the data collected through the data entry form, detecting the presence of meaningless values. A subsequent meeting with IT analysts and developers reveals that those values have been introduced because the data entry form design has not been updated to take into account the new business requirements (i.e., opening accounts for NPOs) and the corresponding completeness requirements (i.e., some NPOs in certain fields of activity do not have a tax ID). For example, the current (but obsolete) form design always flags “tax ID” as a required field; however, when the “Company type” field is set to “NPO” and the “Field of activity” field is either “charity” or “education,” the field “tax ID” should be optional.

These meaningless values filled during form filling negatively affect data quality [3], since they are considered as data entry errors and may lead to error propagation:

Data entry errors: Users fill obsolete required fields with incorrect data (meaningless values) to proceed quickly in the workflow of the data-entry form [3].

Error propagation: Meaningless value errors can propagate and create more errors [37], especially when these values are used in ML-based tools.

Meaningless value errors are difficult to identify, because such values can pass all validation checks of the data entry form. A business may establish the practice of using specific values (e.g., “@” and “-1”) when users do not need to fill some fields, as in the aforementioned example. However, even in this case the data quality team needs to carefully check the filled fields to ensure that all the data entry operators follow this convention, which is a time-consuming process.

Currently, there are some simple but rather impractical solutions to address the issue of filling meaningless values, including rule-based solution and dictionary-based solution:

Rule-based solution: This solution defines for each field some rules capturing the conditions for which a required field can become optional, based on the values of the other form fields.

Dictionary-based solution: This solution sets all fields containing meaningless values as optional. More specifically, the data quality division could first create a dictionary of meaningless values (e.g., “@,” “$”). Users can then use such values when a field is not applicable in a certain form-filling scenario. Finally, the data quality division could analyze the historical input instances and mark a field as optional when users assign a value to it from the meaningless values dictionary. Such information could then be used to refactor the data entry form, setting the corresponding input field as optional.

However, the two solutions are not practical. Given the evolving nature of software [22, 52], the rule-based solution is not scalable and maintainable, especially when the number of fields (and their possible values, for categorical fields) increases. Moreover, as is the case for our industrial partner, it is difficult also for domain experts to formulate the completeness requirement of new fields, since they have to decide the exact impact of different field combinations on the new fields. Regarding the dictionary-based solution, the completeness requirement of a field usually depends on the values of other filled fields [3] (such as the aforementioned example of Tax ID) and cannot be detected only by looking at special/meaningless characters. This simple solution cannot help domain experts identify these useful conditions.

Therefore, we contend it is necessary to develop automated methods to learn such conditions directly from the data provided as input in past data entry sessions, so completeness requirements of form fields can be automatically relaxed during new data entry sessions. Moreover, the learned conditions could also help designers identify completeness requirements that should be relaxed.

2.3 Problem Definition

In this article, we deal with the problem of completeness requirement relaxation for data entry forms. The problem can be informally defined as deciding whether a required field in a form can be considered optional based on the values of the other fields and the values provided as input in previous data entry sessions for the same form. We formally define this problem as follows:

Let us assume we have a data entry form with n fields \(F=\lbrace f_1, f_2, \ldots , f_n\rbrace\). Taking into account the required/optional attribute of each field, the set of fields can be partitioned into two groups: required fields (denoted by R) and optional fields (denoted by \(\bar{R}\)), where \(\bar{R} \cup R=F\) and \(\bar{R} \cap R = \emptyset\). Let \(\mathit {VD}\) represent a value domain that excludes empty values. Each field \(f_i\) in F can take a value from a domain \(V_i\), where \(V_i = \mathit {VD}_i\) if the field is required and \(V_i = \mathit {VD}_i \cup \bot\) if the field is optional (\(\bot\) is a special element representing an empty value).

Let \(R^{c}\subseteq R\) be the set of conditionally required fields, which are required only when a certain condition \(\mathit {Cond}\) is satisfied. For a field \(f_k\in R^{c}\), we define the condition \(\mathit {Cond}_k\) as the conjunction of predicates over the value of some other fields; more formally, \(\mathit {Cond}_k=\bigwedge _{1 \le i \le n, i \ne k} h(f_i, v_i^c)\), where \(f_i \in F, v_i^c \in V_i\), and h is a predicate over the field \(f_i\) with respect to the value \(v_i^c\).

During form filling, at any time t the fields can be partitioned into two groups: fields that have been filled completely (denoted by \(C_{t}\)) and unfilled fields (denoted by \(\bar{C_{t}}\)); let G be the operation that extracts a field from a form during form filling \(G(F)= f\), such that \((f \in C_t)\vee (f\in \bar{C_{t}})\) and \(C_{t} \cap \bar{C_{t}}= \emptyset\). By taking into account also the required/optional attribute, we have: filled required fields \((C_{t}\cap R)\), filled optional fields \((C_{t}\cap \bar{R})\), unfilled required fields \((\bar{C_{t}} \cap R)\), and unfilled optional fields \((\bar{C_{t}} \cap \bar{R})\).

When a form is about to be submitted (e.g., to be stored in a database), we define an input instance of the form to be \(I^F=\lbrace \langle f_1, v_1 \rangle , \ldots , \langle f_n, v_n\rangle \rbrace\) with \(f_i \in F\) and \(v_i \in V_i\); we use the subscript \(t_j\) as in \(I^F_{t_j}\) to denote that the input instance \(I^F\) was submitted at time \(t_j\). We use the notation \(I^F(t)\) to represent the set of historical input instances of the form that have been submitted up to a certain time instant t; \(I^F(t)=\lbrace I^F_{t_{i}}, I^F_{t_{j}}, \ldots , I^F_{t_{k}}\rbrace\), where \(t_i \lt t_j \lt t_k \lt t\). Hereafter, we drop the superscript F when it is clear from the context.

The completeness requirement relaxation problem can be defined as follows: Given a partially filled form \(F=\lbrace f_1, f_2, \ldots , f_n\rbrace\) for which, at time t, we know \(\bar{C_{t}} \ne \emptyset\), \(C_{t}\), and \(R^c\), a set of historical input instances \(I^F(t)\), and a target field \(f_{p}\in (R^{c}\cap \bar{C_{t}})\) to fill, with \(p \in 1 \dots n\), we want to build a model M predicting whether, at time t, \(f_p\) should become optional based on \(C_{t}\) and \(I^F(t)\).

Framing the problem definition scope.

In this problem definition, our goal is to relax the completeness requirements of a form by determining which obsolete required fields should become optional to avoid filling meaningless values. We do not consider the case in which optional fields could become required; we leave extending LACQUER to automatically decide the completeness requirement of all fields as part of future work.

Moreover, as mentioned in the motivating example, in this definition, we mainly focus on the case of filling data entry forms from scratch. We do not consider the case in which an existing instance in the database is updated (including an update of the timestamp); for example, following our motivating example, if a company changes its “Field of activity” to “charity,” then some fields like “tax ID” may become optional and do not need to be filled. LACQUER can be adapted to support this scenario and check if the completeness requirement of some fields needs to be changed; we also leave this adaption as part of future work.

Application to the running example.

Fig. 1 is an example of a data entry form used to fill information needed to open an account for business customers with an energy provider The form F is composed of five fields, including \(f_1\):“Company Name,” \(f_2\):“Monthly revenue,” \(f_3\):“Company type,” \(f_4\):“Field of activity,” and \(f_5\):“Tax ID.” All the fields are initially required (i.e., \(R=\lbrace f_1,f_2,f_3,f_4, f_5\rbrace\)). Values filled in these fields are then stored in a database. An example of the database is shown on the right side of Fig. 1. These values are collected during the data entry session with an automatically recorded timestamp indicating the submission time. Each row in the database represents an input instance (e.g., \(I^F_{20180101194321}=\lbrace \langle {\it ``Company Name^{\prime \prime }}, {\it UCI} \rangle , \ldots , \langle {\it ``Tax ID^{\prime \prime }}, {\it T190}\rangle \rbrace\)), where the column name corresponds to the field name in the form. The mapping can be obtained from the existing software design documentation or software implementation [5]. Using the data collected from different users, we can build a model M to learn possible relationships of completeness requirement between different fields. Let us assume a scenario where during the creation of a customer account using F, the energy provider clerk has entered Wish, 20, NPO, and education for fields \(f_1\) to \(f_4\), respectively. The field \(f_5\) (“Tax ID”) is the next field to be filled. Our goal is to automatically decide if field \(f_5\) is required or not based on the values filled in fields \(f_1\) to \(f_4\).

Fig. 1.

Fig. 1. The automated form-filling relaxation problem.

2.4 Towards Adaptive Forms: Challenges

Several tools for adaptive forms have been proposed [6, 19, 49]. These approaches use intermediate representations such as XML [6] and dynamic condition response graphs [49] to represent the completeness requirements rules and implement adaptive behaviors. Existing tools for adaptive forms usually assume that form designers already have, during the design phase, a complete and final set of completeness requirements, capturing the conditions for which a field should be required or optional.

However, this assumption is not valid in real-world applications. On the one hand, data entry forms are not easy to design [17]. Data entry forms need to reflect the data that need to be filled in an application domain. Due to time pressure and the complexity of the domain (e.g., the number of fields needed to be filled and their interrelation), it is difficult to identify all the completeness requirements when designing the data entry form [2, 12]. On the other hand, data entry forms are subject to change: A recent study [54] has shown that 49% of web applications will modify their data constraints in a future version. The frequent changes in data constraints may also make the existing completeness requirements obsolete.

Hence, the main challenge is how to create adaptive forms when the set of completeness requirements representing the adaptive behavior of a form is incomplete and evolving.

Skip 3PRELIMINARIES Section

3 PRELIMINARIES

Before illustrating our approach, we first briefly introduce two basic machine-learning algorithms we rely on.

3.1 Bayesian Networks

Bayesian networks (BNs) are probabilistic graphical models (PGM) in which a set of random variables and their conditional dependencies are encoded as a directed acyclic graph: Nodes correspond to random variables and edges correspond to conditional probabilities.

The use of BNs for supervised learning [20] typically consists of two phases: structure learning and variable inference.

During structure learning, the graphical structure of the BN is automatically learned from a training set. First, the conditional probability between any two random variables is computed. Based on these probabilities, optimization-based search (e.g., hill climbing [21]) is applied to search the graphical structure. The search algorithm initializes a random structure and then iteratively adds or deletes its nodes and edges to generate new structures. For each new structure, the search algorithm calculates a fitness function (e.g., Bayesian information criterion, BIC [40]) based on the nodes’ conditional probabilities and on Bayes’ theorem [20]. Structure learning stops when it finds a graphical structure that minimizes the fitness function.

Fig. 2 shows an example of the BN structure learned based on the data submitted by the data entry form used in our example in Section 2.2. This BN contains three nodes corresponding to three fields in the data entry form: variable \(\mathit {Revenue}\) depends on variable \(\mathit {Company \ type}\); variable \(\mathit {Tax \ ID}\) depends on variables \(\mathit {Company \ type}\) and Revenue. For simplicity, we assume that the three variables are Boolean where a, b, and c denote that fields \(\mathit {Company \ type}\), \(\mathit {Revenue}\), and \(\mathit {Tax \ ID}\) are “required,” respectively, and \(\bar{a}\), \(\bar{b}\), and \(\bar{c}\) denote that these fields are optional.

Fig. 2.

Fig. 2. An example of BN and the probability functions of its nodes.

In the PGM, each node is associated with a probability function (in this case, encoded as a table), which represents the conditional probability between the node and its parent(s). For example, in Fig. 2 each variable has two values; the probability table for \(\mathit {Revenue}\) reflects the conditional probability \(P(\mathit {Revenue}\mid \mathit {Company \ type})\) between \(\mathit {Company \ type}\) and \(\mathit {Revenue}\) on these values.

Variable inference infers unobserved variables from the observed variables and the graphical structure of the BN using Bayes’ theorem [20]. For example, we can infer the probability of \(\mathit {Tax \ ID}\) to be required (i.e., \(\mathit {Tax \ ID} =c\)) when the completeness requirement of \(\mathit {Company \ type}\) is required (denoted by \(P(c \mid a)\)) as follows: \(\begin{equation*} \begin{aligned}P(c \mid a) &= \frac{P(a, c)}{P(a)} = \frac{P(a, b, c) + P(a, \overline{b}, c)}{P(a)} \\ &= \frac{P(c \mid a, b) P(b \mid a) P(a) + P(c \mid a, \overline{b}) P(\overline{b} \mid a) P(a)}{P(a)} \\ &= \frac{0.9*0.4*0.2 + 0.4*0.6*0.2}{0.2} =0.6. \end{aligned} \end{equation*}\)

BNs have been initially proposed for learning dependencies among discrete random variables. They are also robust when dealing with missing observed variables; more specifically, variable inference can be conducted when some conditionally independent observed variables are missing [20]. Recently, they have been applied in the context of automated form filling [5].

3.2 Synthetic Minority Oversampling Technique (SMOTE)

A frequently encountered problem for training machine learning models using real-world data is that the number of instances per class can be imbalanced [32, 48]. To address this problem, many imbalanced learning approaches have been proposed in the literature. One of them is SMOTE [9]; it uses an oversampling method to modify the class distribution in a dataset (i.e., the ratio between instances in different classes). It synthesizes new minority class instances to improve the learning ability of machine learning algorithms on the minority class. SMOTE conducts the instance synthesis by means of interpolation between near neighbors. Initially, each instance in the dataset is represented as a feature vector. SMOTE starts by randomly selecting a minority class instance i from the dataset. It determines the k nearest neighbors of i from the remaining instances in the minority class by calculating their distance (e.g., the Euler distance) based on their feature vectors. SMOTE synthesizes new instances using n instances randomly selected from the k neighbors. The selection is random to increase the diversity of the generated new instances. For each selected instance, SMOTE computes a “difference vector” that represents the difference of the feature vectors between the selected instance and instance i. SMOTE synthesizes new instances by adding an offset to the feature vector of instance i, where the offset is the product of the difference vector with a random number between 0 and 1. SMOTE stops generating new instances until a predefined condition is satisfied (e.g., the ratio of instances in the majority and minority classes is the same).

Fig. 3 illustrates the application of SMOTE to create new minority class instances. As shown in the table on the right, instances \(i_1,i_2\), and \(i_3\) belong to the minority class “Optional” of our target field. As a preliminary step, SMOTE computes the Euclidean distance between all the minority instances: \(d({i_1},{i_2})= \sqrt {\left(39-42\right)^2} = 3\), \(d({i_1},{i_3})= \sqrt {\left(39-25\right)^2} = 14\), and \(d({i_2},{i_3})= \sqrt {\left(42-25\right)^2} = 17\). SMOTE starts by randomly picking one instance from the minority class (e.g., \(i_2\)). Assuming that the value of k is equal to 1, SMOTE selects the nearest instance to \(i_2\), which in our example is the instance \(i_1\). To create a new instance \(i_8\), SMOTE computes the \(\mathit {Difference \ vector}\) based on the feature vectors \(\mathit {Monthly \ revenue}_{i_2}\) and \(\mathit {Monthly \ revenue}_{i_1}\) and multiplies it by a random value \(\lambda\) between 0 and 1. The value of the “Monthly revenue” column in the synthetically created instance \(i_8\) is equal to \(\mathit {Monthly \ revenue}_{i_2}\)+ \(\mathit {Difference \ vector}\). In our example, assuming that the value of \(\lambda\) is equal to 0.7, the new value of the “Monthly revenue” field for \(i_8\) is equal to \(42+((39 - 42) * 0.7)= 40\).

Fig. 3.

Fig. 3. An example of SMOTE interpolation.

Skip 4APPROACH Section

4 APPROACH

In this section, we present our machine learning approach for data entry form relaxation named LACQUER(Learning-bAsed Completeness reQUirEments Relaxation).

As shown in Fig. 4, LACQUER includes three phases: model building, form filling relaxation, and threshold determination. LACQUER preprocesses the historical input instances related to a data entry form and identifies the meaningless values in them. The historical input instances are divided in two parts: historical input instances for training (training input instances) and historical input instances for tuning (tuning input instances) used for threshold determination. In the first phase, LACQUER builds BN models on the preprocessed training input instances to represent the completeness requirement dependencies between form fields. This phase occurs offline before deploying LACQUER as a completeness requirement relaxation tool for data entry. The form-filling relaxation phase occurs during the data entry session and assumes that all the models have been built. During this phase, given a target field, LACQUER selects the BN model related to the target from all the BN models and predicts the completeness requirement of the target, taking into account the values of the filled fields captured during the form-filling process. To improve prediction accuracy, LACQUER includes an endorser module that seeks to only provide users with predictions whose confidence level is higher than a minimum threshold. The value of the threshold is automatically determined in the threshold determination phase.

Fig. 4.

Fig. 4. Main steps of the LACQUER approach.

LACQUER is inspired by our previous work on automated form filling [5]; the main differences between the two approaches are discussed in Section 6.

4.1 Pre-processing

The first two phases of LACQUER include a preprocessing step to improve the quality of the data in historical input instances as well as the current input instance. As mentioned in Section 2.1, data entry forms can contain fields that are not applicable to certain users; this is the main cause of the presence of missing values and meaningless values in historical input instances. Missing values occur when users skip filling an (optional) field during form filling. A meaningless value is defined as any value filled into a form field that can be accepted during the validation check but does not conform with the semantics of the field. For example, given a data entry form with a textual field “Tax ID,” if a user fills “n/a” in this field, then the value can be accepted during the submission of the instance3; however, it should be deemed meaningless, since “n/a” does not represent an actual “Tax ID.”

For missing values, we replace them with a dummy value “Optional” in the corresponding field. As for the meaningless values, we first create a dictionary containing possible meaningless values based on domain knowledge. This dictionary is used to match possible meaningless values in historical input instances; we replace the matched values with “Optional.” The rationale for this strategy is that it is common practice, within an enterprise, to suggest data entry operators some specific keywords when a field is not applicable for them. For example, our industrial partner recommends users to fill such fields with special characters such as “@” and “$.” The overarching intuition behind replacing missing values and meaningless values with “Optional” is that, when data entry operators skip filling a field (resulting in a missing value in the form) or put a meaningless value, it usually means that this field is not applicable in the current context.

After detecting missing values and meaningless values, we preprocess other filled values. For textual fields, we replace all valid values with a dummy value “Required,” reflecting the fact that data entry operators deemed these fields to be applicable. After preprocessing, all values in textual fields are therefore either “Required” and “Optional” to help the model learn the completeness requirement based on this abstract presentation. Numerical fields can be important to decide the completeness requirement of other fields. For example, companies reaching a certain monthly revenue can have some specific required fields. For this reason, we apply data discretization to numerical fields to reduce the number of unique numeric values. Each numeric value is represented as an interval, which is determined using the widely used discretization method based on information gain analysis [7]. We do not preprocess categorical fields, since they have a finite number of candidate values. We keep the original values of categorical fields, since users who select the same category value may share common required information. At last, we delete all the fields that are consistently marked as “Required” or “Optional,” because such fields do not provide any discriminative knowledge to the model.

During the data entry session, similar preprocessing steps are applied. We skip values filled in fields that were removed in historical input instances. We replace values in textual fields with “Required” and “Optional,” as described above. We also map numerical values onto intervals and keep values in categorical fields.

The historical input instances are then divided in two parts that will be used separately for training (training input instances) and for the threshold determination (tuning input instances).

Application to the running example.

Fig. 5 shows an example of historical input instances collected from the data entry form presented in Fig. 1. During the preprocessing phase, LACQUER identifies meaningless values in different fields (e.g., “n/a” and “@”) and replaces them by the dummy value Optional. For the remaining “meaningful” values, LACQUER replaces values in the textual field “Company name” to the dummy value Required; values in the field “Monthly revenue” are discretized into intervals. In addition to historical input instances, LACQUER also preprocesses the input instance filled during the data entry session. For example, as shown in Fig. 1, a user fills values Wish, 20, NPO, and Education in fields “Company name,” “Monthly revenue,” “Company type,” and “Field of activity,” respectively. LACQUER will replace the value filled in the field “Company name” to “Required,” since it is a meaningful value. LACQUER also maps the value in the field “Monthly revenue” into the interval [20, 22).

Fig. 5.

Fig. 5. Example of pre-processed historical input instances.

4.2 Model Building

The model building phase aims to learn the completeness requirement dependencies between different fields from training input instances related to a data entry form.

During the data entry session, we consider the filled fields as features to predict the completeness requirement of the target field (i.e., optional or required). However, in our previous work [5], we have shown that in an extreme scenario, users could follow any arbitrary order to fill the form, resulting in a large set of feature-target combinations. For example, given a data entry form with n fields, when we consider one of the fields as the target, we can get a total number of up to \(2^{n-1}-1\) feature (i.e., filled fields) combinations. Based on the assumption of identical features and targets [14] to train and test a machine learning model, a model needs to be trained on each feature-target combination, which would lead to training an impractical large number of models.

To deal with this problem, we select BNs as the machine learning models to capture the completeness requirement dependencies between filled fields and the target field, without training models on specific combinations of fields. As already discussed in our previous work [5], the reason is that BNs can infer the value of a target field using only information in the filled fields and the related PGM (see Section 3.1); BNs automatically deal with the missing conditionally independent variables (i.e., unfilled fields).

In this work, LACQUER learns the BN structure representing the completeness requirement dependencies from training input instances. Each field in the data entry form represents a node (random variable) in the BN structure; the edges between different nodes are the dependencies between different fields. To construct the optimal network structure, BN performs a search-based optimization based on the conditional probabilities of the fields and a fitness function. As in our previous work [5], we use hill climbing as the optimizer to learn the BN structure with a fitness function based on BIC [40].

Algorithm 1 illustrates the main steps of this phase. LACQUER takes as input a set of preprocessed historical input instances \(I^F(t)_{\mathit {train}}^\prime\) for training and learning the completeness requirement dependencies (e.g., the input instances in block of Fig. 6). Initially, for each field \(f_i\) in the list of fields extracted from \(I^F(t)_{\mathit {train}}^\prime\) (line 2), we create a temporary training set where we consider the field \(f_i\) as the target (line 4). Since we aim to predict whether the target field is required or optional during form filling, in the temporary training set, we keep the value “Optional” in the target field \(f_i\) and label other values as “Required” (block in Fig. 6). These two values are the classes according to which to predict \(f_i\).

Fig. 6.

Fig. 6. Workflow of the model building phase.

However, we may not train effective classification models directly on this temporary training set. This is caused by the imbalanced nature of input instances for different classes. Users commonly enter correct and meaningful values during form filling. They only fill meaningless values in certain cases. As a result, the number of input instances having meaningless values (i.e., in the “Optional” class) is usually smaller than the number of input instances in the “Required” class. This can make the learning process inaccurate [29], since machine learning models may consider the minority class as noise [43]. The trained models could also over-classify the majority class due to its increased prior probability [29]. For example, in block of Fig. 6, considering that the column “\(f_5\): Tax ID” is the current target, the number of instances in class “Required” is three, which is higher than the single instance in class “Optional.” If we train a model on such imbalanced dataset, then it might be difficult to learn the conditions (or dependencies) to relax this field as optional due to the small number of “Optional” instances.

To solve this problem, we apply SMOTE (line 5) on the temporary training set \(\mathit {train}_{f_{i}}\) to generate an oversampled training set \(\mathit {train}_{f_{i}}^{\mathit {oversample}}\) (as shown in block in Fig. 6). After oversampling, both classes have the same number of input instances. We train a BN model \(M_i\) based on the oversampled training set for the target field \(f_i\) (line 6). For example, block in Fig. 6 represents the model built for the target field “Tax ID.” Following this step, we can obtain a BN model for each field. We save all the BN models in the dictionary \(\mathcal {M}\) (line 7), where the key represents the name of the field and the value is the corresponding trained BN model. The output of Algorithm 1 is the dictionary \(\mathcal {M}\).

Application to the running example.

Given the preprocessed training input instances shown in block in Fig. 6, LACQUER creates a temporary training set for each target (e.g., the field “Tax ID”), where LACQUER replaces the meaningful and meaningless values of the target field to Required and Optional, respectively (in block ). The temporary training set is oversampled using SMOTE to create a balanced training set where the number of instances of both Required and Optional classes is the same (block of Fig. 6). This oversampled training set is used to train a BN model for the target field “Tax ID.” An example of the trained BN model is presented in block in Fig. 6. After the model building phase, LACQUER outputs a model for each target. For the example of training input instances related to Fig. 1, LACQUER returns five distinct models where each model captures the completeness requirement dependencies for a given target.

4.3 Form-filling Relaxation

The form-filling relaxation phase is an online phase that occurs during the data entry session. In this phase, LACQUER selects the model \(M_{p}\) \(\in \mathcal {M}\) corresponding to the target field \(f_p\). This model is then used to predict the completeness requirement of the target \(f_p\) based on the filled fields \(C_{t}\). The main steps are shown in Algorithm 2.

The inputs of the algorithm are the dictionary of trained models \(\mathcal {M}\), the set \(C_{t}\) representing the filled fields during the entry session and their values, the target field \(f_p\), and the endorsing threshold \(\theta _p\) for \(f_p\). The algorithm starts by applying the preprocessing techniques outlined in Section 4.1 to the set of the filled fields in \(C_{t}\) to obtain a new set of preprocessed filled fields \(C_{t}^{\prime }\) (line 1). LACQUER then selects the model \(M_p\) from \(\mathcal {M}\) (line 2), since this model is trained for the target field \(f_p\) based on the oversampled data with a balanced number of instances for each class of \(f_p\). With the selected model, LACQUER predicts the completeness requirement for \(f_p\) (line 3) and gets the top-ranked completeness requirement based on the prediction probability (line 4).

Endorsing.

During the data entry session, values in filled fields do not always provide enough knowledge for the model to accurately predict the completeness requirement of a given target field. This happens because when training BN models, there may not be enough information in the training input instances to learn the dependencies among some fields with specific values.

However, in the context of form-filling relaxation, it is important to provide accurate completeness requirement suggestions. On the one hand, wrongly predicting optional fields as required adds more constraints to the data entry form; users will be obliged to fill fields with meaningless values. On the other hand, wrongly predicting a required field as optional can result in missing information. To prevent this situation, LACQUER includes a heuristic-based endorser module that decides if the predicted completeness requirement is correct or not. Since our main goal is to relax the completeness requirement by predicting when a required field should be optional, we mainly use the endorser to endorse the prediction where the target field is predicted as “Optional.” If the prediction is endorsed, we set the field to “Optional”; otherwise, we use its previous setting (“Required”).

Specifically, LACQUER checks if the top-ranked completeness requirement is equal to “Optional”; it saves the result in the Boolean flag \(\mathit {checkOP}_p\) (line 5). If the value of \(\mathit {checkOP}_p\) evaluates to true, then LACQUER analyzes the probability distribution of the predicted completeness requirement of the target field, since it reflects whether LACQUER has enough “confidence” in the prediction based on current information. We check if the probability for the field to be “Optional” is lower than a threshold \(\theta _p\) for target \(f_p\) (line 7), saving the result in the Boolean flag checkProb. If the value of checkProb evaluates to true, then we change the value of the Boolean flag \(\mathit {checkOP}_p\) to false (line 9), since it implies the model does not have enough “confidence” for variable inference and prediction; otherwise, LACQUER keeps the prediction as “Optional.” The threshold \(\theta\) is automatically determined; its value differs for different targets (as discussed in Section 4.4). We use different threshold values because the prediction is done by models trained on different data and the variance of the data can have a significant effect on prediction accuracy [53].

Application to the running example.

Fig. 7 shows the process of predicting the completeness requirement of the field “Tax ID” based on the input values in Fig. 1. LACQUER first selects the model related to the current target for prediction (block in Fig. 7). Let us assume that, based on the BN variable inference, LACQUER predicts that field “Tax ID” has a probability of 0.80 to be Optional. Since the top predicted value is Optional, LACQUER activates the endorser module (block in Fig. 7) to decide whether the level of confidence is acceptable. For example, let us assume the automatically decided threshold value for field “Tax ID” is 0.70 (i.e., \(\theta _{\mathit {tax ID}}\)=0.70). Since the probability value of the “Optional” class (0.80) is higher than this threshold, the Boolean flag \(\mathit {checkOP}_\mathit {TaxID}\) remains true. LACQUER decides to set the field “Tax ID” to Optional.

Fig. 7.

Fig. 7. Workflow for form relaxation phase.

4.4 Endorser Threshold Determination

We automatically determine the value of the threshold in the endorser module for each target. This step occurs offline and assumes that the models in \(\mathcal {M}\) built during the model building phase are available. The threshold \(\theta _i\) for the target field i is determined with the set of preprocessed tuning input instances. The basic idea is that for each historical input instance in this subset, we consider all fields except field i to be filled and use the model trained for field i to predict its completeness requirement with different values of \(\theta _i\). We determine the value of \(\theta _i\) based on the value that achieves the highest prediction accuracy on tuning input instances.

The main steps are shown in Algorithm 3. The algorithm takes as input the set of preprocessed historical input instances for tuning \(I^F(t)_{\mathit {tune}}^\prime\) and the trained models \(\mathcal {M}\). For each field \(f_i\) in the list of fields extracted from \(I^F(t)_{\mathit {tune}}^\prime\) (line 2), we generate a temporary dataset \(I^F(t)_{\mathit {tune}_i}^\prime\) where the value of field \(f_i\) is transformed into “Optional” and “Required” using the method presented in Fig. 6(B) (line 5). We select the model corresponding to \(f_i\) from \(\mathcal {M}\) (line 6) and use the selected model to predict the completeness requirement of field \(f_i\) based on the values of other fields in \(I^F(t)_{\mathit {tune}_i}^\prime\) (line 8). While predicting, we try different thresholds, varying from 0 to 1 with a step equal to 0.05. For each threshold value, we compare the predicted completeness requirement with the actual completeness requirement of field \(f_i\) in each input instance of \(I^F(t)_{\mathit {tune}_i}^\prime\) to calculate the prediction accuracy (line 9). LACQUER selects the value of \(\theta _i\) that achieves the highest prediction accuracy value in \(I^F(t)_{\mathit {tune}_i}^\prime\) as the threshold for \(f_i\) (line 11). The algorithm ends by returning a dictionary containing the thresholds of all fields.

Skip 5EVALUATION Section

5 EVALUATION

In this section, we report on the evaluation of our approach for automated completeness requirement relaxation. First, we assess the overall accuracy of LACQUER when predicting the completeness requirement of fields in data entry forms and compare it with state-of-the-art baselines. We then assess the performance of LACQUER, in terms of training time and prediction time, for practical applications. Last, we perform an ablation study to evaluate how the use of SMOTE (in the model building phase) and the heuristic-based endorser (in the form-filling relaxation phase) affects the accuracy of LACQUER.

More specifically, we evaluated LACQUER by answering the following research questions (RQs): RQ1 Can LACQUER provide accurate predictions for completeness requirement relaxation, and how does it compare with baseline algorithms? RQ2 Is the performance of LACQUER, in terms of training and prediction time, suitable for practical applications? RQ3 What is the impact of using SMOTE and the heuristic-based endorser on the effectiveness of LACQUER?

5.1 Dataset and Settings

Datasets.

We selected the datasets used for the evaluation of LACQUER according to the following criteria: (1) data should be collected from a real data entry form; (2) the form fields should have different completeness requirements (i.e., required and optional); and (3) the data entry form should have obsolete required fields, where users could use meaningless values to pass the validation checks.

We identified two datasets meeting these criteria: one publicly available in the biomedical domain (dubbed NCBI) and another proprietary dataset, extracted from a production-grade enterprise information system, provided by our industrial partner (dubbed PEIS). Each dataset consists of data collected from one real-world data entry form.

Other datasets used in related work on adaptive data entry forms (see also Section 6) were either not mentioned [19, 50], unavailable [6], or confidential [49]. In addition, we also analyzed datasets from surveys conducted in countries with transparency policies (e.g., the USA “National Survey on Drug Use and Health” [8]). However, these surveys do not contain a detailed specification defining the completeness requirement of each field and thus the corresponding dataset does not meet our selection criterion #2.

Both datasets are represented by a data table where each row corresponds to an input instance filled by a user and each column represents a specific field in the data entry form; an input instance represents all the field values as submitted by a user.

The NCBI dataset is composed of metadata for diverse types of biological samples from multiple species [4]; it has been used in previous work on automated form filling [5, 33]. This dataset provides the design of the corresponding data entry form in the CEDAR workbench [24] with the list of completeness requirements for different fields. Following the evaluation methodology described in previous work [33], we considered a specific subset from the NCBI dataset related to the species “Homo sapiens” for evaluation. We downloaded the dataset from the official NCBI website.4

As shown in Table 1, the NCBI dataset contains 235 538 instances5 and has 26 fields, 6 of which are required. These 6 fields are always required and are not subject to any additional conditions. We identify the meaningless values in the required fields using the strategy presented in Section 4.1, i.e., mapping the actual value in the data with the dictionary of meaningless values obtained from the domain knowledge. In Table 1, next to each field, we indicate the ratio of instances having missing or meaningless values. The ratio of meaningless values6 varies from 0.1 (for biomaterial-provider) to 0.543 (for age). The case when the ratio of meaningless values is equal to 0 (i.e., sample-name) represents the situation where the field was correctly filled for all the instances in the dataset.

Table 1.
Dataset# of# of# ofName of required fields
fieldsinstancesrequired fields(% of missing and meaningless values)
NCBI26235,5386sample-name(0), tissue(0.130), isolate(0.351), sex(0.351)
biomaterial-provider(0.1), age(0.543)
PEIS3373,08219legal name(0), contact name(0), first name(0.113),
place of birth(0.127), native country(0.127), status(0),
year of study(0.94), function(0), employer name(0.35),
name of school/university(0.84), type of contract(0),
contract start date(0.668), date of end of contract(0.974),
field of activity (0), code NACE(0.123), primary activity(0),
country of activity(0),percentage of activity(0)

Table 1. Information about the Fields in the Datasets

Based on the ratio of meaningless values in Table 1, we find that the number of instances for meaningless and valid values is imbalanced for most of the fields. For example, the ratio of meaningless values for tissue is 0.130. The field age has more meaningless values with a ratio of 0.543. The reason for this relatively high ratio could be that the completeness requirement (i.e., “Required”) of this field does not conform with the actual need in the real world; that is, the field age is not required when the actual concept of “age” does not apply to a certain type of biomaterial (e.g., for protein TM-1 [47]).

The PEIS dataset contains the data filled through the web-based data entry form during the process of creating a new customer account. The dataset was extracted from the database of our industrial partner. Similar to the NCBI dataset, each row in the table represents an instance and each column represents a form field. We identified the mapping between column names in the table and field names in the data entry form using the available software documentation.

As shown in Table 1, the PEIS dataset has 33 fields, 19 of which are required (including conditionally required). In this dataset, 9 of the required fields do not have missing/meaningless values (i.e., the ratio of meaningless values is 0). For the rest of the fields, the ratio of instances with missing or meaningless values ranges from 0.113 to 0.974. The reason behind having a high ratio of meaningless values in some fields is that those fields are conditionally required. They are rarely to be required in real scenarios, which leads to many missing values.

Dataset Preparation.

For the two datasets, we consider all the required fields as targets, since we aim to learn the conditions to relax them as optional (for avoiding meaningless values and improving the overall data quality). However, we do not consider fields where the ratio of missing and meaningless values is 0, as they have no relaxation conditions to learn. We split the dataset into three subsets containing 80%, 10%, and 10% of input instances based on their submission time used, respectively, for training, tuning, and testing. The input instances (excluding submission time) in the training set are used to train LACQUER. The validation set is used to decide the endorser threshold for each field following the strategy explained in Section 4.4.

As for the testing input instances, since there is no information on the actual form-filling order, we simulated two form-filling orders for data entry, including “sequential filling” and “partial random filling.”

The former corresponds to filling data entry forms in the default order, as determined by the form tab sequence, e.g., the navigation order determined by the HTML attribute tabindex in web UI designs [18]. It simulates the logical order many users follow to fill out forms, especially when they use a keyboard to navigate form fields [35]. The latter represents the case when designers group some semantics-related fields together and add controls to force users filling a group of fields sequentially [10]; outside these groups, users can fill fields randomly.

We simulated partial random filling by randomly generating a field order for each testing input instance while respecting the sequential order of the fields in the same group. In the case where there is no grouping or controls in the form, the partial random filling scenario turns into a (complete) random filling scenario. The reason to simulate the partial random filling scenario is that by capturing the fields’ grouping information, this scenario is more realistic compared to a (complete) random filling scenario.

In both form filling scenarios, the filled fields considered by LACQUER are the fields that precede each target. For each target field, we labeled as “Optional” the instances in which the target field contains missing or meaningless values; otherwise, they are labeled as “Required.” “Optional” and “Required” are the two classes that we consider as ground truth.

Dataset Preparation - Application Example.

Fig. 8 illustrates an example of application of our dataset preparation method. The table on the left-hand side of the picture represents the information submitted during the data entry session through the data entry form introduced in our motivating example in Section 2.3. We split this dataset into a training set (80% of instances), a tuning set (10% of instances), and a testing set (10% of instances); let us assume the last row in the table is an instance in the testing set. The testing set is then processed to simulate the two form-filling scenarios. The sequential filling scenario uses the filling order following the tabindex value of the form fields. Assuming the tabindex order for the example is \(f_1\rightarrow f_2 \rightarrow f_3 \rightarrow f_4 \rightarrow f_5\), we can generate two test instances S1 and S2 (shown in the top right box of Fig. 8) to predict the completeness requirement of \(f_2\) and \(f_5\), respectively. As for the partial random filling scenario, this scenario takes into account the controls or grouping of fields specified by the designer. For example, let us assume that “\(f_3\) : company type” and “\(f_4\): field of activity” belong to the same group of fields named “Business activities”: this means that \(f_3\) and \(f_4\) should be filled sequentially. A possible filling order, randomly generated taking into account this constraint, is then \(f_1 \rightarrow (f_3 \rightarrow f_4) \rightarrow f_2 \rightarrow f_5\). The bottom right box in the figure shows the corresponding generated test instances PR1 and PR2.

Fig. 8.

Fig. 8. Example of filling orders.

Implementation and Settings.

We implemented LACQUER as a Python program. We performed the experiments on the NCBI dataset with a computer running macOS 10.15.5 with a \(2.30\ {\rm GHz}\) Intel Core i9 processor with \(32\ {\rm GB}\) memory. As for the experiments on the PEIS dataset,7 we performed them on a server running CentOS 7.8 on a \(2.60\ {\rm GHz}\) Intel Xeon E5-2690 processor with \(125\ {\rm GB}\) memory.

5.2 Effectiveness (RQ1)

To answer RQ1, we assessed the accuracy of LACQUER in predicting the correct completeness requirement for each target field in the dataset. To the best of our knowledge, there are no implementations of techniques for automatically relaxing completeness requirements; therefore, we selected as baselines two rule-based algorithms that can be used to solve the form-filling completeness requirements relaxation problem: association rule mining(ARM) [33] and repeated incremental pruning to produce error reduction(Ripper); these rule-based algorithms can provide form-filling relaxation suggestions under different filling orders. ARM mines association rules having the format “if antecedent then consequent” with a minimal level of support and confidence, where the antecedent includes the values of certain fields and the consequent shows the completeness requirement of a target field for a given antecedent. ARM matches the filled fields with the antecedents of mined association rules and suggests the consequent of the matched rules. Ripper is a propositional rule-based classification algorithm [11]; it creates a rule set by progressively adding rules to an empty set until all the positive instances are covered [31]. Ripper includes also a pruning phase to remove rules leading to bad classification performance. Ripper has been used in a variety of classification tasks in software engineering [23, 48]. Similar to ARM, Ripper suggests the consequent of the matched rule to users.

Methodology. We used Precision (\(\mathit {Prec}\)), Recall (\(\mathit {Rec}\)), Negative Predictive Value (\(\mathit {NPV}\)), and Specificity (\(\mathit {Spec}\)) to assess the accuracy of different algorithms. These metrics can be computed from a confusion matrix that classifies the prediction results into true positive (TP), false positive (FP), true negative (TN), and false negative (FN). In our context, TP means that a field is correctly predicted as required, FP means that a field is misclassified as required, TN means that a field is correctly predicted as optional, and FN means that a field is misclassified as optional. Based on the confusion matrix, we have \(\mathit {Prec}=\frac{\mathit {TP}}{\mathit {TP}+\mathit {FP}}\), \(\mathit {Rec}=\frac{\mathit {TP}}{\mathit {TP}+\mathit {FN}}\), \(\mathit {NPV}=\frac{\mathit {TN}}{\mathit {TN}+\mathit {FN}}\), and \(\mathit {Spec}=\frac{\mathit {TN}}{\mathit {TN}+\mathit {FP}}\). Precision is the ratio of correctly predicted required fields over all the fields predicted as required. Recall is the ratio of correctly predicted required fields over the number of actual required fields. NPV represents the ratio of correctly predicted optional fields over all the fields predicted as optional. Finally, specificity represents the ratio of correctly predicted optional fields over the number of actual optional fields.

We chose these metrics because they can evaluate the ability of an algorithm in predicting both required fields (using precision and recall) and optional fields (using NPV and specificity). A high value of precision and recall means that an algorithm can correctly predict most of required fields (i.e., the positive class); hence, we can avoid business loss caused by missing information. A high value of NPV and specificity means that an algorithm can correctly predict most of the optional fields (i.e., the negative class); users will have fewer unnecessary constraints during form filling. In other words, we can avoid users filling meaningless values that may affect the data quality.

In our application scenario, we aim to successfully relax a set of obsolete required fields to “optional,” while keeping the real required fields. Therefore, LACQUER needs to get high precision and recall values, which can preserve most of real required fields to avoid business loss. Meanwhile, the NPV value should be high, which means LACQUER can correctly avoid users filling meaningless values by relaxing the completeness requirements. Concerning the specificity, a relatively low value is still useful. For instances, a specificity value of 50% means LACQUER can reduce by half the data quality issues caused by meaningless values.

In the case of ARM, we set the minimum acceptable support and confidence to 5 and 0.3, respectively, as done in previous work [5, 33] in which it was applied in the context of form filling.

Results.

Table 2 shows the accuracy of the various algorithms for the two form-filling scenarios. LACQUER substantially outperforms Ripper in terms of precision and recall scores (i.e., columns \(\mathit {Prec}\) and \(\mathit {Rec}\)) for both sequential filling and partial random filling scenarios in both datasets (ranging from +13 pp to +32 pp in terms of precision score and from +15 pp to +35 pp in terms of recall score). When we compare LACQUER with ARM, they have similar results in terms of precision and recall scores on the NCBI dataset; however, LACQUER performs much better than ARM on the PEIS dataset (by at least \(+16\ {\rm p}\)p in terms of precision score and \(+17\ {\rm p}\)p in terms of recall score).

Table 2.
Alg.SequentialPartial RandomTrainPredict (ms)
PrecRecNPVSpecPrecRecNPVSpec(s)avgmin–max
Ripper0.630.790.170.200.690.830.250.16349.290.180.18–0.19
NCBIARM0.750.980.810.160.820.860.390.2811.985.063–12
LACQUER0.760.980.910.200.840.980.760.37145.9875.8333–144
Ripper0.660.730.600.290.580.620.840.56240.370.240.15–0.54
PEISARM0.720.800.240.240.720.800.250.25153.781.592–20
LACQUER0.880.980.720.620.900.970.750.641,210.70307179–839

Table 2. Effectiveness for Form-filling Relaxation

When looking at the \(\mathit {NPV}\) and specificity scores, on the NCBI dataset LACQUER and Ripper have the same specificity value for sequential filling; however, LACQUER can provide more accurate suggestions, since it outperforms Ripper in terms of NPV score with an improvement of \(+74\ {\rm p}\)p. Concerning the partial random filling scenario on the NCBI dataset, LACQUER outperforms Ripper by \(+51\ {\rm p}\)p and \(+21\ {\rm p}\)p in terms of NPV and specificity scores, respectively. On the same dataset, when comparing LACQUER with ARM, the results show that LACQUER always outperforms ARM for both form-filling scenarios from +10 pp to +37 pp in terms of NPV score and from +4 pp to +9 pp in terms of specificity score. As for the PEIS dataset, for sequential filling LACQUER substantially outperforms the two baselines from +12 pp to 48 pp in terms of NPV score and from +33 pp to +38 pp in terms of specificity score. For partial random filling, Ripper achieves the highest NPV score, outperforming LACQUER by \(+9\ {\rm p}\)p; however, LACQUER outperforms both baselines in terms of specificity score by +8 pp to +39 pp.

Looking at the specificity score when applying LACQUER on PEIS and NCBI datasets, we can notice a difference ranging from +27 pp to +42 pp. This difference means that LACQUER can find more optional values in the PEIS dataset than in the NCBI dataset. We believe the main reason behind this difference is the quality of the training set. We recall PEIS is a proprietary dataset from the banking domain. Data entry operators in the bank follow corporate guidelines for recommended values to be used when a field is not applicable, e.g., special characters such as “@” or “$” (see Section 4.1), resulting in higher-quality data than the NCBI dataset. The latter, in fact, is a public dataset where anyone can submit data using the corresponding data entry form. Users do not follow any rule to insert special values when a field is not applicable. For this reason, the endorser module of LACQUER tends to remove more likely inaccurate suggestions, predicting only optional fields with high confidence. This explains the high value of NPV in the NCBI dataset, which is \(+19\ {\rm p}\)p higher than that in the PEIS dataset for the sequential filling scenario and \(+1\ {\rm p}\)p higher for the random filling scenario.

We applied Fisher’s exact test with a level of significance \(\alpha = 0.05\) to assess the statistical significance of differences between LACQUER and the baselines. The null hypothesis is that there is no significant difference between the prediction results of LACQUER and a baseline algorithm on the test instances. Given the output of each algorithm on the test instances we used during our evaluation, we created contingency tables summarizing the decisions of LACQUER vs. ARM and LACQUER vs. Ripper for each form-filling scenario. Each contingency table represents the relationship between LACQUER and the other baseline in terms of frequency counts of the possible outputs (0: “Optional” and 1: “Required”). In other words, the contingency table counts the number of times both algorithms provide the same prediction (i.e., both predict a test instance as 0 or 1) and the number of times they have different prediction outputs (i.e., one algorithm predicts as 1 but the other predicts 0 and vice versa). These contingency tables are then used by Fisher’s exact test to compute the p-value to reject or accept the null hypothesis. The result of the statistical test shows that LACQUER always achieves a significantly higher number of correct predictions than the baselines for the two form-filling scenarios on both datasets (\({\it p-value}\lt 0.05\)).

These results have to be interpreted with respect to the usage scenario of a form-filling relaxation tool. Incorrect suggestions can affect the use of data entry forms and the quality of input data. The \(\mathit {NPV}\) and specificity values achieved by LACQUER show that its suggestions can help users accurately relax the completeness requirement by 20 % to 64% of the fields in data entry forms. Meanwhile, LACQUER can correctly preserve most (\(\ge 97\%\)) of the required fields required to be filled to avoid missing information (as indicated by the high precision and recall scores).

The answer to RQ1 is that LACQUER performs significantly better than the baseline algorithms. LACQUER can correctly relax at least 20% of required fields (with an NPV value above 0.72), while preserving the completeness constraints on most of the truly required fields (with a recall value over 0.98 and precision over 0.76).

5.3 Performance (RQ2)

To answer RQ2, we measured the time needed to perform the training and predict the completeness requirement of target fields. The training time evaluates the ability of LACQUER to efficiently update its models when new input instances are added daily to the set of historical input instances. The prediction time evaluates the ability of LACQUER to timely suggest the completeness requirement during the data entry phase.

Methodology.

We used the same baselines and form-filling scenarios used for RQ1. The training time represents the time needed to build BN models (for LACQUER) or learn association rules (for ARM and Ripper). The prediction time is the average time needed to provide suggestions for target fields. We deployed LACQUER and baselines locally to avoid the impact of the data transmission time when assessing the prediction time.

Results.

The results are presented in columns Train and Predict in Table 2. Column Train represents the training time in seconds. Column Predict contains two sub-columns representing the average time and the minimum/maximum time (in milliseconds) needed to make a prediction on one test instance.

As shown in Table 2, Ripper has the highest training time for the NCBI dataset with 349.29 s. The training time of LACQUER (149.58 s) is between that of Ripper (349.29 s) and ARM (11.98 s) on the NCBI dataset. For the PEIS dataset, the training time of Ripper and ARM is equal to 240.37 s and 153.78 s, respectively; the training time of LACQUER is the highest: 1,210.70 s (less than 20 minutes).

In terms of prediction time, LACQUER takes longer than ARM and Ripper to predict the completeness requirement of a field. On average, LACQUER takes 75.83 ms and 307 ms on the NCBI and PEIS datasets, respectively. The prediction time of ARM and Ripper depends on the number of rules used for matching the filled fields: The smaller the number of rules, the shorter the prediction time. For LACQUER, the prediction time depends mostly on the complexity of BNs used when predicting. Such complexity can be defined in terms of the number of nodes and the number of dependencies among the different nodes in the BNs.

Taking into account the usage of our approach, the results can be interpreted as follows: Since the training phase occurs offline and periodically to train different BN models, the training time of 1,210.70 s is acceptable from a practical standpoint; it allows for the daily (or even hourly) training of LACQUER in contexts (like enterprise software) with thousands of entries every day. Since LACQUER needs to be used during data entry, a short prediction time is important to preserve the interactive nature of a form-filling relaxation tool. The prediction time of LACQUER is acceptable according to human-computer interaction principles [26], which prescribe a response time lower than \(1\ {\rm s}\) for tools that provide users with a seamless interaction. In addition, this prediction time is also comparable to the one achieved by our previous work on automated form filling [5]. Hence, LACQUER can be suitable for deploying in real enterprise systems.

The answer to RQ2 is that the performance of LACQUER, with a training time per form below 20 minutes and a prediction time of at most \(839\ {\rm ms}\) per target field, is suitable for practical application in data-entry scenarios.

5.4 Impact of SMOTE and Endorser (RQ3)

LACQUER is based on two main modules: (1) SMOTE oversampling module, which tries to solve the class imbalance problem by synthetically creating new minor class instances in the training set (Section 4.2), and (2) the endorsing module, which implements a heuristic that aims to keep only the optional predicted instances with a certain level of confidence. To answer this RQ, we assessed the impact of these two modules on the effectiveness of LACQUER.

Methodology.

We compared the effectiveness of LACQUER with three variants representing all the possible configurations of LACQUER: LACQUER-S, LACQUER-E, and LACQUER-SE. LACQUER-S represents the configuration where the SMOTE oversampling module is disabled and LACQUER provides predictions based on the imbalanced training set. LACQUER-E denotes the configuration where the endorser module is disabled and LACQUER directly returns the predictions to the user without checking whether the predictions have the required confidence in predicting fields as optional. LACQUER-SE is the configuration where both modules are disabled; this variant corresponds to the case where we use a plain BN. The different configurations are shown in Table 3 under column Module, where the two sub-columns S and E refer to the two modules “Smote” and “Endorser.” We used symbols “✓” and “✗” to specify whether a variant includes or not a certain module. LACQUER was run in its vanilla version as well as the additional variants using the same settings and evaluation metrics as in RQ1.

Table 3.
IDModuleNCBIPEIS
SequentialPartial RandomSequentialPartial Random
SEPrecRecallNPVSpecPrecRecallNPVSpecPrecRecallNPVSpecPrecRecallNPVSpec
LACQUER-SE0.780.880.360.320.840.890.440.330.900.920.700.670.960.960.750.58
LACQUER-E0.780.820.560.320.850.770.390.650.890.940.670.760.920.880.570.85
LACQUER-S0.760.980.510.190.830.980.660.220.880.990.700.550.910.990.770.52
LACQUER0.760.980.910.200.840.980.760.370.890.990.740.640.900.970.750.64

Table 3. Effectiveness of LACQUER with Different Modules

Results.

As shown in Table 3, both modules have an impact on the effectiveness of LACQUER. The SMOTE oversampling module improves the ability of BNs to identify more optional fields; it improves the specificity score of a plain BN by at least \(+9\ {\rm p}\)p on the two datasets (LACQUER-E vs. LACQUER-SE), except for the sequential filling scenario in the NCBI dataset where the specificity score stays the same. The endorser module mainly removes inaccurate optional predictions and keeps them as required to prevent missing information. This module leads to an increase in the recall value compared to the plain BN (LACQUER-SE vs. LACQUER-S); it increases by at least \(+9\ {\rm p}\)p for the NCBI dataset in both scenarios. The improvement is smaller for the PEIS dataset where the recall increases by \(+7\ {\rm p}\)p and \(+3\ {\rm p}\)p for sequential and random filling scenarios, respectively. The endorser module affects also specificity, which decreases by at most \(13\ {\rm p}\)p for both datasets when the endorser is used. The reason behind such decrease is that the endorser module removes possibly inaccurate predictions.

Comparing the results of LACQUER (with both modules enabled) with a plain BN (i.e, LACQUER-SE) on the NCBI dataset, the former improves NPV by \(+55\ {\rm p}\)p (0.91 vs. 0.36) for the sequential filling scenario and by \(+32\ {\rm p}\)p (0.76 vs. 0.44) for the random filling scenario. Since the endorser module considers the non-endorsed instances as required, it also increases recall by\(+10\ {\rm p}\)p and \(+9\ {\rm p}\)p for sequential and random filling scenarios, respectively. For the PEIS dataset, we find a slight increase in NPV of \(+4\ {\rm p}\)p and an increase of \(+6\ {\rm p}\)p for recall with sequential filling. For the partial random filling scenario, we notice that both LACQUER and LACQUER-SE have similar results, except for a higher specificity value \(+6\ {\rm p}\)p and a lower precision value of \(-6\ {\rm p}\)p for LACQUER. This loss in precision is expected, since LACQUER keeps the default completeness requirement (i.e., required) for an instance for which the prediction confidence is low (i.e., the probability is lower than a threshold in endorser). These instances may include some truly optional cases with low confidence in the prediction; hence, considering them as optional may slightly reduce the precision value.

The answer to RQ3 is that the SMOTE oversampling module and the endorser module improve the effectiveness of LACQUER.

5.5 Threats to Validity

To increase the generalizability of our results, LACQUER should be further evaluated on different datasets from different domains. To partially mitigate this threat, we evaluated LACQUER on two datasets with different data quality: the PEIS dataset, which is proprietary and of high quality, and the NCBI dataset, which is public and was obtained from an environment with looser data quality controls.

The size of the pool of training sets is a common threat to all AI-based approaches. We do not expect this problem to be a strong limitation of LACQUER, since it targets mainly enterprise software systems that can have thousands of entries per day.

Since LACQUER needs to be run online during the data entry session, it is important to ensure seamless interaction with users. In our experiments (Section 5.3), LACQUER was deployed locally. The response time of its prediction complies with human-computer interaction standards. However, the prediction time depends on the deployment method (e.g., local deployment or cloud-based). This is not necessarily a problem, since different engineering methods can help reduce prediction time such as parallel computing and a cache for storing previous predictions.

5.6 Data Availability

The implementation of LACQUER, the NCBI dataset, and the scripts used for the evaluation are available at https://figshare.com/articles/software/LACQUER-replication-package/21731603 ; LACQUER is distributed under the MIT license. The PEIS dataset cannot be distributed due to an NDA.

Skip 6RELATED WORK Section

6 RELATED WORK

In this section, we discuss the work related to our approach. First, we review the existing approaches dealing with adaptive forms. Next, we provide a detailed comparison between LACQUER and LAFF. We conclude the section by presenting some tangential works that use BN to solve software engineering problems.

6.1 Adaptive Forms

The approach proposed in this article is mainly related to approaches that implement adaptive forms for producing context-sensitive form-based interfaces. These approaches progressively add (remove) fields to (from) the forms, depending on the values that the user enters. They use form specification languages [19] or form definition languages [6] to allow form designers to describe the dynamically changing behavior of form fields. Such a behavior is then implemented through dedicated graphical user interface programming languages (such as Tcl/Tk) [50] or through server-side validation [6]. The dynamic behavior of a form has also been modeled using a declarative, business process-like notation (DCR—Dynamic Condition Response graph [49]), where nodes in the graph represent fields and edges show the dynamic relations among fields (e.g., guarded transitions); the process declarative description is then executed by a process execution engine that displays the form. However, all these works assume that designers already have a complete and final set of completeness requirements describing the adaptive behavior of the form during the design phase, which can be expressed through (adaptive) form specification/definition languages or tools. In contrast, LACQUER can automatically learn the different completeness requirements from the historical input instances filled by users without requiring any knowledge from the form designers.

Although some approaches [1, 16] try to automatically generate data entry forms based on the schema of the database tables linked to a form (e.g., using column name and primary keys), they can only generate some “static” rules for fields. For example, if a column is “not null” in the schema, then they can set the corresponding field in the form as (always) required. In contrast, LACQUER aims to learn conditions from the data so completeness requirements of form fields can be automatically and dynamically relaxed during new data entry sessions.

Table 4.
LAFFLACQUER
Goal–Providing form-filling suggestions for the values to be filled in categorical fields–Determining the completeness requirements of form
Challenge–Arbitrary filling order–Partial filling–Arbitrary filling order–Partial filling–Highly imbalanced dataset
Preprocessing–Textual fields are removed–Values in textual fields are transformed into binaryvalues (“Required” or “Optional”)–Meaningless values are identified and replaced with the value “Optional”
Model building–Creates k + 1 models including a global model and k local models (one model for each cluster of data)–Creates n models, one model for each field (target)
PredictionTarget–Categorical field–LAFF can predict the value for both optional and required fields–All textual, numerical, and categorical fields can be targets–Required field
Endorser–Use two heuristics based on prediction confidence and dependencies between filled fields and the target–The value of the threshold is manually decided based on domain expertise–The endorser is based only on the prediction confidence–The value of the threshold is automatically determined during the threshold determination

Table 4. Main Differences between LAFF and LACQUER

6.2 Comparing LACQUER with LAFF

The overall architecture (including the use of the endorser module) of LACQUER has been inspired by LAFF, a recent approach for automated form filling of data entry forms [5]. In this subsection, we explain the similarities and differences between the two approaches.

Similarities between LAFF and LACQUER.

Both LAFF and LACQUER are approaches that can be used during the form-filling process. The main similarities between these approaches derive from the main challenges of form filling, i.e., dealing with (1) an arbitrary filling order and (2) partially filled forms.

The first challenge arises from the fact that users can fill a data entry form following an arbitrary order. Therefore, the filled fields (i.e., the features in our ML models) and the target field keep changing, leading to a large number of feature-target combinations. To avoid training a separate machine learning model on each feature-target combination, in this work, we are inspired by LAFF and use BNs to mine the relationships between filled fields and the target field.

As for the second challenge, LAFF addresses it using an endorser module. The main idea of the endorser module is to avoid providing inaccurate suggestions to the user when the form does not contain enough information for the model. Avoiding inaccurate suggestions is important for both approaches to gain the trust of users; for example, wrongly determining to relax a required field by making it optional may lead to missing information, thus hindering data completeness. For this reason, the second similarity between LAFF and LACQUER is the use of an endorser module.

Differences between LAFF and LACQUER.

Table 4 shows the main differences between LACQUER and LAFF in terms of goal, challenges, preprocessing, model building, and prediction.

The main goal of LACQUER is to determine the completeness requirements of form fields. In contrast, LAFF provides form-filling suggestions for the values to be filled in categorical fields. Concerning the challenges, in addition to the shared ones discussed above, the relaxing completeness requirement problem has its own challenge when the dataset is highly imbalanced. We addressed this challenge in LACQUER by applying SMOTE.

The preprocessing step of the two approaches is completely different. Specifically, LAFF removes all textual fields from the data. In contrast, LACQUER transforms the values in textual fields into binary values. After the preprocessing, textual fields can only have one of two values: “Required” and “Optional.” Moreover, the preprocessing step of LACQUER identifies meaningless values and replaces the matched values in the data with the value “Optional” (see Section 4.1).

As for the model building phase, LAFF and LACQUER create a different set of BN models. LAFF creates \(k+1\) models, including a global model and k local models. The global model represents the BN created on the whole training data; the k local models are the BNs created based on the clusters of training data that share similar characteristics. The optimal number of clusters k is automatically determined with the elbow method. LACQUER creates n models where n represents the number of fields (targets) in the data entry form.

Finally, the differences regarding the prediction phase can be viewed from two perspectives: the type of targets and the endorser module. Concerning the target, LAFF only predicts possible values for categorical fields, no matter whether this field is optional or required. In contrast, LACQUER targets all types of required fields (e.g., textual, numerical, and categorical fields) to relax their completeness requirements. The endorser modules of LAFF and LACQUER differ as follows:

The endorser module of LAFF endorses predictions based on two heuristics: the prediction confidence and the dependencies between the filled fields and the target. In contrast, the endorser of LACQUER is based only on the prediction confidence.

LAFF uses a threshold to be determined manually, based on domain expertise, to endorse the prediction, whereas LACQUER includes a phase to automatically determine the threshold for each target.

6.3 Using Bayesian Networks in Software Engineering Problems

Besides LAFF, BNs have been applied to different software engineering problems spanning over a wide range of software development phases, such as project management (e.g., to estimate the overall contribution that each new software feature to be implemented would bring to the company [34]), requirement engineering (e.g., to predict the requirement complexity to assess the effort needed to develop and test a requirement [44]), implementation (for code auto-completion [39]), quality assurance (e.g., for defect prediction [13, 28]), and software maintenance [41].

The main reason to use BN in software engineering (SE) problems is the ability of BN to address the challenges of dealing with “large volume datasets” and “incomplete data entries.” First, software systems usually generate large amounts of data [41]. For instance, to improve software maintenance, companies need to analyze large amounts of software execution data (e.g., traces and logs) to identify unexpected behaviors such as performance degradation. To address this challenge, Rey Juárez et al. [41] used BN to build an analysis model on the data, since BN can deal with large datasets and high-dimensional data while keeping the model size small and the training time low. Second, incomplete data is a common problem in SE [15, 38]. For example, some metrics in defect prediction datasets might be missing for some software modules. To solve this challenge, Okutan and Yıldız [38] and Del Águila and Del Sagrado [15] used BN to train prediction models, because of its ability to perform inference with incomplete data entries. These two challenges confirm our choice of using BN to solve the relaxing completeness problem. Specifically, these two challenges are aligned with the challenges of form filling. During data entry sessions, a form is usually partially filled and LACQUER needs to provide decisions on incomplete data. Besides, in our context, we need to deal with large datasets, since we mainly target enterprise software systems that can collect a huge number of entries every day.

Skip 7DISCUSSION Section

7 DISCUSSION

7.1 Usefulness

The main goal of LACQUER is to prevent the entering of meaningless values by relaxing the data entry form completeness requirements. To assess the capability of LACQUER, we evaluated it with two real-world datasets, including a public dataset from the biomedical domain and a proprietary dataset from the banking domain. These two datasets are related to existing data entry forms.

Experiment results show that LACQUER outperforms baselines in determining completeness requirements with a specificity score of at least 0.20 and an NPV score higher than 0.72. In the context of completeness requirement relaxation, these results mean that LACQUER can correctly (i.e., NPV \(\ge\) 0.72) prevent the filling of at least 20% meaningless values. In addition, LACQUER can correctly determine (with precision above 0.76) when a field should be required with a recall value of at least 0.97. This recall value means that LACQUER can almost determine all the required fields. The high precision value shows that LACQUER rarely incorrectly predicts optional fields as required. In other words, LACQUER will not add much extra burden to users by adding more restrictions during the form-filling process.

As discussed in Section 5.2, LACQUER can determine more optional fields (i.e., a higher specificity) in the PEIS dataset than in the NCBI dataset due to the higher data quality of the former. Since we target data entry functionalities in enterprise software, we expect to find similar conditions in other contexts in which data entry operators follow corporate guidelines for selecting appropriate values that should be filled when a field is not applicable. In such contexts, LACQUER is expected to provide results that are similar to those achieved on the PEIS dataset.

7.2 Practical Implications

This subsection discusses the practical implications of LACQUER for different stakeholders: software developers, end-users, and researchers.

7.2.1 Software Developers.

LACQUER can help developers refactor data entry forms, which typically have many historical input instances and obsolete completeness requirements. LACQUER does not require developers to define a complete set of rules regarding the completeness requirement of form fields. Developers can integrate LACQUER into a given data entry form as an independent tool. Deploying LACQUER into a data entry form requires providing a mapping between a data entry form and field names and column names in the dataset. The mapping needs only to be provided once and can be easily identified from Object Relational Mapping (ORM) and software design documentation. In addition to the mapping, deploying LACQUER requires a dictionary of meaningless values, i.e., the values that should be used during the data entry process when a field is not applicable. We expect this dictionary to be found in the user manual of the data entry software or in corporate guidelines, as it was the case for the PEIS dataset.

7.2.2 End-users.

During the form-filling process, obsolete required fields in the data entry form can affect the data accuracy, since users have to enter meaningless values to skip filling these obsolete fields. LACQUER can automatically decide when a field should be required or not based on the filled fields and historical input instances. Our experiments show that LACQUER can correctly determine between 20% and 64% of optional fields, which reduces the user effort and the time taken during the form-filling process.

7.2.3 Researchers.

To avoid predicting required field as optional, LACQUER includes an endorser module to decide if the prediction is accurate enough to be provided to the user. We propose a novel strategy to automatically determine the threshold used in the endorser module. Hence, our endorser module does not require any configuration from the domain expert. We believe that such an endorser module can be adopted by other researchers in other recommender systems.

7.3 Combining LACQUER with LAFF

Despite the differences explained in Section 6, LACQUER and LAFF are complementary in practice. Both approaches can be combined as an AI-based assistant for form filling to help users fill forms and ensure better data quality.

Fig. 9 shows a possible scenario that uses both approaches together during a form-filling session. In this example, we assume that the user follows the sequential filling order. First, after filling in the company name field, LACQUER can already check whether the “monthly income” field is required or not. Since “monthly income” is a numerical field, LAFF cannot perform a prediction (LAFF only supports categorial fields). In this example, LACQUER determines that the field is required, hence the user should fill it out. The “Company type” and “Field of activity” fields are both categorical. For these two fields, based on the filled fields, first LACQUER determines the completeness requirement for each field. Once the user clicks on a field, LAFF is enabled to provide a ranked list of possible values that can be used for this field. If the decision of LACQUER on a field is optional, then LAFF can still be activated to provide suggestions as long as the user wants to fill in the field. Finally, let us assume that the “Tax ID” field (a numerical one) is optional by design. In this case, both LAFF and LACQUER are not enabled, since there is no need for LACQUER to relax a completeness requirement and the field is numerical and thus not compatible with LAFF.

Fig. 9.

Fig. 9. Use case to combine LACQUER and LAFF together during form filling.

Skip 8CONCLUSION Section

8 CONCLUSION

In this article, we proposed LACQUER, an approach to automatically relax the completeness requirement of data entry forms by deciding when a field should be optional based on the filled fields and historical input instances. LACQUER applies Bayesian Networks on an oversampled dataset (using SMOTE) to learn the completeness requirement dependencies between fields. Moreover, LACQUER uses a heuristic-based endorser module to ensure that it only provides accurate suggestions.

We evaluated LACQUER on two datasets, one proprietary dataset from the banking domain and one public dataset from the biomedical domain. Our results show that LACQUER can correctly determine 20% to 64% of optional fields and determine almost all the required fields (with a recall value of 0.97). LACQUER takes at most 839 ms to provide a suggestion, which complies with human-computer interaction principles to ensure a seamless interaction with users.

As a part of future work, we plan to conduct a user study to analyze the effect of LACQUER in reducing the meaningless values and the effort spent by users during the form-filling process. We plan also to add an automated module that can detect meaningless values entered by the users during form filling, when such values have not been specified by the form designer. Furthermore, we plan to integrate LACQUER into platforms for the design of data entry forms [25, 36, 42] to help designers perform form refactoring. These platforms currently rely on rules defined by designers to specify completeness requirements during the design phase. LACQUER can be used to relieve designers from the task of defining such rules, since it only requires to indicate the required fields; during form filling, LACQUER will automatically suggest the completeness requirement of the required fields. LACQUER can also be extended to support sophisticated input fields that can handle multiple selections such us drop-down menus and checkbox groups. Finally, we plan to extend LACQUER to support updates of existing data entries as well as to determine whether fields previously marked as optional should become required.

ACKNOWLEDGMENTS

We thank Anne Goujon, Michael Stanisiere, and Fernand Lepage for their help with the PEIS dataset; we thank Clément Lefebvre Renard and Andrey Boytsov for their comments on earlier drafts of the article.

Footnotes

  1. 1 The implementation of LACQUER and the scripts used for evaluation are available at https://figshare.com/s/0fdeae041e728e6d0a01 ; see also Section 5.6.

    Footnote
  2. 2 An example of complex input control is the case where users need to select at least three options from a multiple choice answer field (e.g., a checkbox group). Any option chosen before reaching the minimum number of selected values would be considered “required”; however, the same option chosen after the first three would be considered “optional.”

    Footnote
  3. 3 We assume the validation check does not check for the well-formedness of the string corresponding to the Tax ID.

    Footnote
  4. 4 https://ftp.ncbi.nlm.nih.gov/biosample/

    Footnote
  5. 5 The number of instances is different from that indicated in our previous work [5], since the preprocessing step in that work retained only instances with at least three fields being filled. In contrast, in this work, we keep fields with missing values to analyze completeness requirements.

    Footnote
  6. 6 Required fields in the NCBI dataset have no missing values, since they are always required.

    Footnote
  7. 7 Due to the data protection policy of our partner, we were obliged to run the experiments on the PEIS dataset using an on-premise, dedicated server that, however, could not be used to store external data (like the NCBI dataset).

    Footnote

REFERENCES

  1. [1] Albhbah Atia M. and Ridley Mick J.. 2010. Using RuleML and database metadata for automatic generation of web forms. In ISDA’10. IEEE, 790794.Google ScholarGoogle Scholar
  2. [2] Aldekhail Maysoon and Ziani Djamal. 2017. Intelligent method for software requirement conflicts identification and removal: Proposed framework and analysis. Int. J. Comput. Sci. Netw. Secur. 17, 12 (2017), 9195.Google ScholarGoogle Scholar
  3. [3] Avidan Alexander and Weissman Charles. 2012. Record completeness and data concordance in an anesthesia information management system using context-sensitive mandatory data-entry fields. Int. J. Medic. Inform. 81, 3 (2012), 173181.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Tanya Barrett, Karen Clark, Robert Gevorgyan, Vyacheslav Gorelenkov, Eugene Gribov, Ilene Karsch-Mizrachi, Michael Kimelman, Kim D. Pruitt, Sergei Resenchuk, Tatiana Tatusova, Eugene Yaschenko, and James Ostell. 2012. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Research 40, D1 (2012), D57–D63. Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Belgacem Hichem, Li Xiaochen, Bianculli Domenico, and Briand Lionel. 2022. A machine learning approach for automated filling of categorical fields in data entry forms. ACM Trans. Softw. Eng. Methodol. 32, 2 (Apr.2022), 40 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Bohøj Morten, Bouvin Niels Olof, and Gammelmark Henrik. 2011. AdapForms: A framework for creating and validating adaptive forms. In ICWE’11. Springer Berlin, 105120.Google ScholarGoogle Scholar
  7. [7] Breiman Leo, Friedman Jerome, Stone Charles J., and Olshen Richard A.. 1984. Classification and Regression Trees. CRC Press, Boca Raton, FL.Google ScholarGoogle Scholar
  8. [8] Lupari Brooklyn. 2015. national-survey-on-drug-use-and-health. Retrieved from https://catalog.data.gov/dataset/national-survey-on-drug-use-and-health-nsduh-2015Google ScholarGoogle Scholar
  9. [9] Chawla Nitesh V., Bowyer Kevin W., Hall Lawrence O., and Kegelmeyer W. Philip. 2002. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16 (2002), 321357.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Chen Kuang, Chen Harr, Conway Neil, Hellerstein Joseph M., and Parikh Tapan S.. 2011. Usher: Improving data quality with dynamic forms. IEEE Trans. Knowl. Data Eng. 23, 8 (2011), 11381153.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Cohen William W.. 1995. Fast effective rule induction. In Machine Learning Proceedings 1995. Elsevier, San Francisco, CA, 115123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Dalpiaz Fabiano, Schalk Ivor Van Der, Brinkkemper Sjaak, Aydemir Fatma Başak, and Lucassen Garm. 2019. Detecting terminological ambiguity in user stories: Tool and experimentation. Inf. Softw. Technol. 110 (2019), 316.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Dejaeger Karel, Verbraken Thomas, and Baesens Bart. 2012. Toward comprehensible software fault prediction models using Bayesian network classifiers. IEEE Trans. Softw. Eng. 39, 2 (2012), 237257.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Dekel Ofer, Shamir Ohad, and Xiao Lin. 2010. Learning to classify with missing and corrupted features. Mach. Learn. 81, 2 (2010), 149178.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Águila Isabel M. Del and Sagrado José Del. 2016. Bayesian networks for enhancement of requirements engineering: A literature review. Requir. Eng. 21 (2016), 461480.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Elbibas A. and Ridley M. J.. 2004. Developing web entry forms based on METADATA. In Proceedings of the International Workshop on Web Quality in Conjunction with ICWE. Citeseer, Trinity College Dublin, Dublin, 113118 pages.Google ScholarGoogle Scholar
  17. [17] Firmenich Sergio, Gaits Vincent, Gordillo Silvia, Rossi Gustavo, and Winckler Marco. 2012. Supporting users tasks with personal information management and web forms augmentation. In ICWE’12. Springer, Berlin, 268282.Google ScholarGoogle Scholar
  18. [18] Fowler Susan and Stanwick Victor. 2004. Web Application Design Handbook: Best Practices for Web-based Software. Morgan Kaufmann, Amsterdam, Boston, MA.Google ScholarGoogle Scholar
  19. [19] Frank Martin R. and Szekely Pedro. 1998. Adaptive forms: An interaction technique for entering structured data. Knowl.-based Syst. 11, 1 (1998), 3745.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Friedman Nir, Geiger Dan, and Goldszmidt Moises. 1997. Bayesian network classifiers. Mach. Learn. 29, 2-3 (1997), 131163.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Gámez José A., Mateo Juan L., and Puerta José M.. 2011. Learning Bayesian networks by hill climbing: Efficient methods based on progressive restriction of the neighborhood. Data Min. Knowl. Discov. 22, 1-2 (2011), 106148.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Ghezzi C.. 2017. Of software and change. J. Softw.: Evolut. Process 29, 9 (2017), e1888. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Ghotra Baljinder, McIntosh Shane, and Hassan Ahmed E.. 2015. Revisiting the impact of classification techniques on the performance of defect prediction models. In ICSE’15, Vol. 1. IEEE, 789800.Google ScholarGoogle Scholar
  24. [24] Gonçalves Rafael S., O’Connor Martin J., Martínez-Romero Marcos, Egyedi Attila L., Willrett Debra, Graybeal John, and Musen Mark A.. 2017. The CEDAR workbench: An ontology-assisted environment for authoring metadata that describe scientific experiments. In ISWC’17(LNCS, Vol. 10588). Springer International Publishing, Cham, 103110.Google ScholarGoogle Scholar
  25. [25] LLC Google. 2016. Google Forms. Retrieved from https://docs.google.com/forms/Google ScholarGoogle Scholar
  26. [26] Heeter Carrie. 2000. Interactivity in the context of designed experiences. J. Interact. Advert. 1, 1 (2000), 314.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Jarrett Caroline and Gaffney Gerry. 2009. Forms that Work: Designing Web Forms for Usability. Morgan Kaufmann, Amsterdam, Boston, USA.Google ScholarGoogle Scholar
  28. [28] Jeet Kawal, Bhatia Nitin, and Minhas Rajinder Singh. 2011. A Bayesian network based approach for software defects prediction. ACM SIGSOFT Softw. Eng. Notes 36, 4 (2011), 15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Johnson Justin M. and Khoshgoftaar Taghi M.. 2019. Survey on deep learning with class imbalance. J. Big Data 6, 1 (2019), 154.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Kulyk Oksana, Reinheimer Benjamin Maximilian, and Volkamer Melanie. 2017. Sharing information with web services–A mental model approach in the context of optional information. In HAS’17. Springer, Cham, 675690.Google ScholarGoogle Scholar
  31. [31] Linares-Vásquez Mario, McMillan Collin, Poshyvanyk Denys, and Grechanik Mark. 2014. On using machine learning to automatically classify software applications into domain categories. Empir. Softw. Eng. 19, 3 (2014), 582618.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Malhotra Ruchika and Khanna Megha. 2017. An empirical study for software change prediction using imbalanced data. Empir. Softw. Eng. 22, 6 (2017), 28062851.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Martínez-Romero Marcos, O’Connor Martin J., Egyedi Attila L., Willrett Debra, Hardi Josef, Graybeal John, and Musen Mark A.. 2019. Using association rule mining and ontologies to generate metadata recommendations from multiple biomedical databases. Datab. J. Biol. Datab. Curat. 2019 (2019), 25 pages.Google ScholarGoogle Scholar
  34. [34] Mendes Emilia, Perkusich Mirko, Freitas Vitor, and Nunes João. 2018. Using Bayesian network to estimate the value of decisions within the context of value-based software engineering. In EASE’18. Association for Computing Machinery, New York, NY, 90100.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Microsoft. 2013. Change the Default Tab Order for Controls on a Form. Retrieved from https://support.microsoft.com/en-us/office/change-the-default-tab-order-for-controls-on-a-form-03d1599a-debf-4b66-a95b-e3e744210afeGoogle ScholarGoogle Scholar
  36. [36] Inc. Momentive1999. Survey Monkey. Retrieved from https://www.surveymonkey.com/Google ScholarGoogle Scholar
  37. [37] Muşlu Kıvanç, Brun Yuriy, and Meliou Alexandra. 2015. Preventing data errors with continuous testing. In ISSTA’15. Association for Computing Machinery, New York, NY, 373384.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Okutan Ahmet and Yıldız Olcay Taner. 2014. Software defect prediction using Bayesian networks. Empir. Softw. Eng. 19 (2014), 154181.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Proksch Sebastian, Lerch Johannes, and Mezini Mira. 2015. Intelligent code completion with Bayesian networks. ACM Trans. Softw. Eng. Methodol. 25, 1 (2015), 131.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Raftery Adrian E.. 1995. Bayesian model selection in social research. Sociol. Methodol. 25 (1995), 111163.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Juárez Santiago del Rey, Fernández Silverio Juan Martínez, and Cerdán Antonio Salmerón. 2023. Bayesian network analysis of software logs for data-driven software maintenance. IET Softw. 3, 17 (2023), 119.Google ScholarGoogle Scholar
  42. [42] Inc. Rocketgenius2007. Graviy Forms. Retrieved from https://www.gravityforms.com/Google ScholarGoogle Scholar
  43. [43] Roshan Seyed Ehsan and Asadi Shahrokh. 2020. Improvement of Bagging performance for classification of imbalanced datasets using evolutionary multi-objective optimization. Eng. Applic. Artif. Intell. 87 (2020), 103319.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Sadia Halima, Abbas Syed Qamar, and Faisal Mohammad. 2022. A Bayesian network-based software requirement complexity prediction model. In ICCMDE’21. Springer, 197213.Google ScholarGoogle Scholar
  45. [45] Sears Andrew and Zha Ying. 2003. Data entry for mobile devices using soft keyboards: Understanding the effects of keyboard size and user tasks. J. Hum.-comput. Interact. 16, 2 (2003), 163184.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Seckler Mirjam, Heinz Silvia, Bargas-Avila Javier A., Opwis Klaus, and Tuch Alexandre N.. 2014. Designing usable web forms: Empirical evaluation of web form improvement guidelines. In CHI’14. Association for Computing Machinery, New York, NY, 12751284.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Song Jiuling, Zhou Yonghe, Zhang Juren, and Zhang Kewei. 2017. Structural, expression and evolutionary analysis of the non-specific phospholipase C gene family in Gossypium hirsutum. BMC Genom. 18, 1 (2017), 115.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Song Qinbao, Guo Yuchen, and Shepperd Martin. 2018. A comprehensive investigation of the role of imbalanced learning for software defect prediction. IEEE Trans. Softw. Eng. 45, 12 (2018), 12531269.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Strømsted Rasmus, López Hugo A., Debois Søren, and Marquard Morten. 2018. Dynamic evaluation forms using declarative modeling. BPM (Dissert./Demos/Industr.) 2196 (2018), 172179.Google ScholarGoogle Scholar
  50. [50] Thistlewaite Paul and Ball Steve. 1996. Active forms. Comput. Netw. ISDN Syst. 28, 7-11 (1996), 13551364.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Vassilakis Costas, Laskaridis Giorgos, Lepouras Giorgos, Rouvas Stathis, and Georgiadis Panagiotis. 2003. A framework for managing the lifecycle of transactional e-government services. Telemat. Inform. 20, 4 (2003), 315329.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Wan-Kadir Wan M. N. and Loucopoulos Pericles. 2004. Relating evolving business rules to software design. J. Syst. Archit. 50, 7 (2004), 367382.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Wu Zeqing and Chu Weishen. 2021. Sampling strategy analysis of machine learning models for energy consumption prediction. In SEGE’21. IEEE, 7781.Google ScholarGoogle Scholar
  54. [54] Yang Junwen, Sethi Utsav, Yan Cong, Cheung Alvin, and Lu Shan. 2020. Managing data constraints in database-backed web applications. In ICSE’20. IEEE, Association for Computing Machinery, New York, NY, 10981109.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning-based Relaxation of Completeness Requirements for Data Entry Forms

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Software Engineering and Methodology
          ACM Transactions on Software Engineering and Methodology  Volume 33, Issue 3
          March 2024
          943 pages
          ISSN:1049-331X
          EISSN:1557-7392
          DOI:10.1145/3613618
          • Editor:
          • Mauro Pezzé
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 15 March 2024
          • Online AM: 6 December 2023
          • Accepted: 20 November 2023
          • Revised: 2 October 2023
          • Received: 18 January 2023
          Published in tosem Volume 33, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)408
          • Downloads (Last 6 weeks)329

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader