Rough set-based rule generation and Apriori-based rule generation from table data sets II: SQL-based environment for rule generation and decision support

: This study follows the previous study entitled ‘ Rough set-based rule generation and Apriori-based rule generation from table data sets: A survey and a combination ’ , and this is the second study on ‘ Rough set-based rule generation and Apriori-based rule generation from table data sets ’ . The theoretical aspects are described in the previous study, and here the aspects of application, an SQL-based environment for rule generation and decision support, are described. At first, the implementation of rule generator defined in the previous study is explained, then the application of the obtained rules to decision support is considered. Especially, the following two issues are focused on, (i) Rule generator from table data sets with uncertainty in SQL, (ii) The manipulation in decision support below: (ii-a) In the case that an obtained rule matches the condition, (ii-b) In the case that any obtained rule does not match the condition. The authors connect such cases with decision support and realised an effective decision support environment in SQL.


Introduction
Recently, there are several table data sets that store instances of previous occurrences. Data mining techniques allow us to investigate instances in table data sets, and we know trends and characteristics of data sets. In addition, rule-based decision support seems to be a very active research field. In fact, we found over 7700 papers from the Scopus system [1] by a keyword 'rule-based decision support'. The Scopus system also responded the amount of papers in some research areas, and we had the following composition ratios, 35% in computer science, 24% in engineering, 13% in medicine, and 11% in mathematics, 5% for decision-making science, 5% for social science, 4% for business and management, 3% for biological science etc. In these papers, fuzzy sets and rough sets seem to be very important.
When NIS was proposed, the research purpose was to establish effective question-answering from tables with information incompleteness. However, we are here focusing on rule generation from NISs and decision support based on the obtained rules. RNIA framework maintains logical aspects. In other words, the core rule generation algorithm termed the NIS-Apriori algorithm [16,19]i ssound and complete [20,21] for minimal rules. Therefore, this algorithm does not miss any minimal rule. Any minimal rule is certainly generated by this algorithm. This is assured theoretically. There seems less data mining algorithm with such logical properties.
In the RNIA framework, we define certain rules and possible rules based on possible world semantics [20,21]. These definitions seem to be natural and standard, but the amount of possible worlds may exceed 10 100 in the Mammographic data set and the Congressional Voting data set [22]. It will be impossible to examine the constraints depending upon more than 10 100 cases sequentially. Even though this examination seems to be difficult, the NIS-Apriori algorithm provides a solution to this problem. Namely this algorithm is independent from the amount of the possible worlds [15,16]. Without such property, it will be hard to compute rules defined by the possible world semantics.
The main issue in this paper is to propose three phases (i), (ii), (iii) in Figs. 1 and 2 and its prototype system in SQL.
(i) Rule generation phase: We specify two threshold values a and b and generate rules. We will know the trends and characteristics of the data set. Theoretical discussion and the novelty about rule generation phase are described in the twinned paper [23], and we trace rule generation process by using the Lenses data set in UCI machine-learning repository [22] for showing the overall flow. (ii) Search phase for the obtained rules: For the specified condition part, if a rule matches the specified condition part, we have the rule's decision part as the decision. (Since we usually consider b . 0.5as a constraint, we have one decision for the specified condition part.) (iii) Question-answering phase: If there is no rule with the same condition part, all implications with the condition part are searched, and we conclude one decision from the searched results with support(t) and accuracy(t).
If phase (ii) is applicable to the specified condition part, the execution of decision making is much faster than the execution in phase (iii). Therefore, it is useful to apply phase (ii), but there is a possibility that there is no rule which matches the specified condition part. Therefore, it is necessary to prepare phase (iii). Phase (iii) may take time to execute, but this phase responds all implications with the specified condition part and support(t), accuracy(t) values. That is all information for decision making from the table data set.
This paper is twinned with the previous paper entitled 'Rough set-based rule generation and Apriori-based rule generation from table data sets: A survey and a combination' [23]. Thus, in order to avoid the duplication, we may omit some definitions and issues in this paper. We will mainly describe the rule generator in SQL and the application of the obtained rules to decision support. This paper is organised as follows: Section 2 describes rules and the rule generator in DISs, and Section 3 describes rules and the rule generator in NISs. Section 4 clarifies the problem on decision support functionality in DISs and NISs, and investigates decision support procedures in SQL. Section 5 concludes this paper.

Rules from DISs and rule generator in SQL
This section focuses on rule generation from DISs. We especially handle the Lenses data set as an example of DIS.

Rules from DISs
The Lenses data set stores the symptoms on eyes and diagnosis for patients. This table data set consists of 24 objects (tuples), four attributes: age, spec, asti, and tear, one decision attribute class with three attribute values 1 (hard contact lenses), 2 (soft contact lenses), and 3 (no contact lenses).
Each attribute value can be regarded as a classified value; so, it may be difficult to consider the variance of the average value or statistics. With such a table data set, we will consider rule-based decision support. The Lenses data set in Fig. 3 is an example of DIS, and each attribute value is one value. There is no missing value [24].
A pair [A, val A ] (an attribute A and its attribute value val A )i s called a descriptor. For the decision attribute Dec and a set CON of condition attributes, we term a formula t:^A [CON [A, val A ] ⇒ [Dec, val]a nimplication. For specified threshold values 0 , a, b ≤ 1, we say t is (a candidate of) a rule,i ft satisfies the next two criterion values [6,10,13,25].
Here, OB is a set of objects, eq( * ) is an equivalence class defined by the formula * , and | * | is the cardinality of a set * .
Since t 1 is located in the rule area, we see this t 1 is a candidate of a rule. Similarly, for an implication t 2 :[asti,2]^[spec,2]⇒ [class, 1], |eq(t 2 )|=1, |eq([asti,2]^[spec, 2])|=6, support(t 2 ) = 1/24 ≒ 0.04, accuracy(t 2 ) = 1/6 ≒ 0.17. (2) As for t 2 , it is not located in the rule area; so, we see it is not a candidate of a rule. The support(t) value means the occurrence ratio of the implication t.I ft occurs frequently, this t is reliable. On the other hand, the accuracy(t) value means the consistency ratio of t. If the accuracy(t) value is higher, this t is more reliable, too.

Rule generator in SQL for DISs
We employed the DIS-Apriori algorithm, which is closely described in the twinned paper. In this sub-section, we trace the following procedure apriv2 by using the Lenses data set. This trace will be useful explanation of rule generation.
In each table data set, the number of attribute values and the names of attribute are different. We at first translate each table to a table of the rdf format [26][27][28]. Fig. 5 shows a part of the rdf table. In this rdf table, each attribute name is handled as a value instead of attribute name. In SQL procedure, we need to specify conditions by using attribute names. So, if we employ the original table directly, the procedure depends upon such attribute names. Namely, we need one procedure for each original table. However, in the rdf table we can employ only three fixed attribute names, i.e. object, attrib, and value. Actually, we implemented the DIS-Apriori algorithm in SQL by using this rdf table. We termed this procedure apriv2 and   apriv2(dec attrib, |OB|, a, b), tapri(dec attrib, dec attrib value, |OB|, a, b). ( In the procedure tapri, we can specify a decision attribute value (a target attribute value). Fig. 6 is the process of rule generation. We execute a procedure by 'call procedure_name' statement in the SQL prompt 'mysql>'. The lenses_rdf procedure generates an rdf table, and apriv2 generates tables. Rules are stored in tables rule1, rule2, and rule3, and some temporary tables are remained. In the following, we describe the role of some temporary tables. Now, we trace the execution by using the Lenses data set. The apriv2 procedure takes three steps.
DIS_Step 2) Generation of tables con2, rule2, and rest2. In the  Fig. 7. The condition descriptors satisfying support ≥ a are stored in the table con1. The total number |OB|=24 and a = 0.2 are specified; so, if a descriptor occurs more than five (≥ 24 × 0.2 = 4.8) times, this descriptor satisfies the constraint support ≥ a. Thus, apriv2 handles such a descriptor, and we say such a descriptor is a considerable descriptor. The apriv2 procedure ignores other descriptor whose occurrence is less than four times. Actually, the descriptor [class, 1] occurs four times, and [class, 1] is not stored in the table deci. Therefore, we have no rule whose decision part is [class, 1].
The apriv2 procedure combines each descriptor in tables con1 and deci, and examines 18 (=9 × 2) implications. If an implication t satisfies the constraint, this t is added to the table rule1. If t satisfies support(t) ≥ 0.2 and accuracy(t) , 0.6, this t is added to the table rest1 in Fig. 8. We say such an implication is a considerable implication.    In DIS_Step 2, apriv2 picks up considerable implications in rest1 and generates considerable implications in con2. Similarly, apriv2 generates two tables rule2 and rest2 in Fig. 9. Since rest2 is empty, the rule generation process substantially finishes in this step.
As we have described in the twinned paper, there is no missing minimal rule by the procedure apriv2. We had totally 6 implications satisfying support(t) ≥ 0.2 and accuracy(t) ≥ 0.6. Fig. 10 is a part of apriv2 in SQL. This part generates the table rule1.

Rules from NISs and rule generator in SQL
This section focuses on rule generation from NISs. We especially handle the Credit Approval data set in UCI machine learning repository [22] as an example of NIS.

Rules from NISs
This table data set consists of 690 objects (tuples), 15 attributes: a 1 , a 3 , ..., a 15 , one decision attribute a 16 with two attribute values 1 and 2.
In the Credit Approval data set, there are 67 missing values expressed by the '?' symbol in Fig. 11. The criterion values support(t) and accuracy(t) are affected by the value assigned to '?' symbol [15,24]. Of course, rules depend upon the missing values; so, rule generation from such table data sets [11,12,18,21] is different from rule generation from DISs. We have dealt with this problem in RNIA.
We introduced the concept of derived DISs to NIS. If we replace each '?' symbol with a possible value, then we have one DIS. We termed such a DIS a derived DIS. An example of NIS and derived DISs is given in the twinned paper [23]. We employ the usual definition of a rule in DIS [13], and extended it to a certain rule and a possible rule in NIS below [15,16]: A rule in DIS) An implication t is a rule in DIS, if t satisfies support(t) ≥ a and accuracy(t) ≥ b for the given a and b. A certain rule in NIS) An implication t is a certain rule,ift is a rule in each derived DIS from NIS for the given a and b. A possible rule in NIS) An implication t is a possible rule,i f t is a rule in at least one derived DIS from NIS for the given a and b.
Generally, it will be hard to generate certain rules and possible rules, if there are a huge number of possible worlds. In the Mammographic data set, the number of possible worlds exceeds 10 100 . However, we have proved the following [16,19] and described them in the twinned paper.
(i) There exists a derived DIS which causes both the support(t) value and the accuracy(t) value to be minimum. We term such values minsupp and minacc for t.
(ii) There exists a derived DIS which causes both the support(t) value and the accuracy(t) value to be maximum. We term such values maxsupp and maxacc for t.
(iii) There is a method, which does not depend upon the amount of possible worlds, to calculate minsupp, minacc, maxsupp, and maxacc for t. The formulas are in the twinned paper. (v) The complexity of NIS-Apriori is more than twice of complexities (or linear-order time complexity) of the DIS-Apriori algorithm. Since the DIS-Apriori algorithm is an adjusted Apriori algorithm, the NIS-Apriori algorithm will be linear-order time complexity of the Apriori algorithm by Agrawal. Even though the NIS-Apriori algorithm can handle rules defined by possible worlds, which increase exponentially, the NIS-Apriori algorithm can escape from the exponential time-order complexity.
In the subsequent sections, we employ the above results and realise a rule generator from NISs. Furthermore, we realise some decision support procedures by using the obtained rules.

Rule generator in SQL for NISs
For rule generation from NISs, we realised procedures step1, step2, and step3 below: In step1, the DIS-Apriori algorithm with the minsupp and minacc values is called for certain rule generation, and then the DIS-Apriori algorithm with the maxsupp and maxacc values is called for possible rule generation. In step2 and step3, the similar procedure is executed. Fig. 12 shows the generated tables. Tables c * * store certain rules and tables p * * store possible rules. Fig. 13 shows certain rules in c1 rule and possible rules in p2 rule. In the

Decision support environment in SQL
In Sections 2 3, we described rule generator for DISs and NISs, respectively, and clarified phase (i) in Figs. 1 Figs. 1 and 2, namely the environment for decision support by the obtained rules.

Redundancy between rules
We say that an implication A^B ⇒ C is redundant to the implication A ⇒ C. After obtaining rules, we see them logical implications in propositional logic. If A ⇒ C holds, A^B ⇒ C automatically holds, because (A^B ⇒ C)=( A ⇒ C)_ (B ⇒ C). Thus, we see (A^B ⇒ C) is a rule, if A ⇒ C is recognised as a rule. (If we consider large numbers of descriptors in the condition part, the constraint support ≥ a may not be satisfied. So, actually such an implication may not be a rule.) We employ this redundancy concept, and generate rules with the minimal condition part (minimal complex)i n [ 24]. Otherwise, there will be large numbers of redundant rules to one rule, like A^B ⇒ C, A^D ⇒ C, A^E ⇒ C, ..., to a rule A ⇒ C. We are interested in rules with minimal condition part.
In Fig. 14 In decision support, we need to pay attention to this redundancy.

(5)
These procedures search rules stored in tables rule1, rule2, and rule3. Let us consider Figs. 6 and 8. For such tables, we apply sd_rule1 and sd_rule2 in Fig. 14 The role of the described procedures may not be important for small number of rules. However, if we employ lower a and lower b values, we have large number of rules. To examine the constraints manually seems to be ineffective. In such case, the realised sd_rule1, sd_rule2, and sd_rule3 become very important.

(6)
The procedures sn rule1, sn rule2, and sn rule3 are almost the same procedures of sd rule1, sd rule2, and sd rule3. Three procedures search rules stored in tables c1 rule, c2 rule, c3 rule, p1 rule,  For such tables, we apply sn rule1 and sn rule2 in Fig. 15.
The first query is 'to decide decision for [a5, p] by the obtained rules'. Since t:[a5, p] ⇒ [a16, 1] is stored in c1 rule and p1 rule, the decision [a16, 1] is obtained by using t.W e know the degree of t by minsupp(t), minacc(t), maxsupp(t), and maxacc(t). The second query is 'to decide decision for [a5, g]^[a7, h] by the obtained rules'. In this case, rules with [a5, g] and [a7, h] are not stored in c1 rule nor p1 rule, and a rule with [a5, g]^[a7, h] is stored in p2 rule. Even though the support value may not be enough, we will probably conclude [a16, 2].
Like this, the proposed and implemented procedures will be effective for decision support. The procedures search the obtained rules in tables; so, it takes less execution time. However, if the condition does not match the obtained rules, we have no information for the condition.

(7)
Each procedure in the formulas (7) searches the rdf table instead of tables with the obtained rules. Therefore, it takes more execution time than that by sd rule1, sd rule2, and sd rule3. However, we have all information about the specified condition.
In Fig. 16, the procedure sd rule2 is executed at first, and we have no information on the condition [

Decision support environment for NISs: phase (iii)
Similarly to the previous subsection, we implemented the procedures snrdf 1, snrdf 2, and snrdf 3. Each procedure in the formulas (8) searches the nrdf table instead of tables with the obtained rules. Therefore, it takes more execution time than that by sn rule1, sn rule2, and sn rule3. However, we have all information about the implication with the specified condition.
In Fig. 17, the procedure sn rule1 is executed at first, and we have no information on the condition [a1, a]. Namely, any rule

Decision support in DISs: A case of the phishing data set
This subsection shows an example on the Web Phishing data set in UCI machine learning repository [22]. This data set stores characteristics of web sites. This data set consists of 1353 objects, 9 condition attributes, and one decision attribute Result with three attribute values 1 (Legitimate), 0 (Suspicious), −1 (Phishy). The apriv2 procedure generated 21 rules for support(t) ≥ 0.1 and accuracy(t) ≥ 0.7. It took about 400 s. Fig. 19 shows decision support by using the obtained results.

Decision support in NISs: A case of the congressional voting data set
This subsection shows an example on the Congressional Voting data set. This data set stores congressional voting in US congress. This data set consists of 435 objects, 16 condition attributes, and one decision attribute a1 whose attribute values are rep(ublic) political party and dem(ocrat) political party.
The step1 procedure generated 18 certain rules in c1 rule and 21 possible rules in p1 rule for support(t) ≥ 0.3 and accuracy(t) ≥ 0.7. It took 18.71 s. Similarly, step2 did two certain rules in c2 rule and eight possible rules in p2 rule. It took 17.43 s. No rule was generated by step3. The certain rules satisfy support(t) ≥ 0.3 and accuracy(t) ≥ 0.7 for each of more than 10 100 derived DISs. Especially, two certain rules [a5, n] ⇒ [a1, dem] and [a5, y] ⇒ [a1, rep] are very strong. If we have a person's answer to the attribute a5, we will easily conclude his supporting party by using two tables in Fig. 20.

Discussion about the decision support functionality
We applied the implemented procedures to the obtained tables, and we had one curious case in Fig. 21. In Fig. 21 The reason was the definition of rule. The constraint was support(t) ≥ 0.2 and accuracy(t) ≥ 0.5, and the accuracy value seems to be lower. In possible rule generation, we calculate the maxacc(t) value in the most suitable case for t. Therefore, we have two inconsistent possible rules in Fig. 22, i.e.    . This is not wrong, and such case may occur based on the definition of possible rules. Actually, we changed the condition to accuracy(t) ≥ 0.7, and such curious cases were solved. In decision support, the application of phases (ii) and (iii) in Figs. 1 and 2 will be effective. In the RNIA framework, rules and decisions are evaluated by the criterion values support and accuracy. The proposed and implemented procedures provide all information on support(t) and accuracy(t). However, if there is less difference on the support and accuracy values like in Figs. 21 and 22, it may be difficult to have one decision. We may need other criterion values in such cases.

Concluding remarks
This paper clarified rule-based decision support in RNIA, and reported its prototype system in SQL. The definition of the certain rules and the possible rules seems to be natural, however there is less software tool for handling them, because the rules are defined by all derived DISs whose number may exceed 10 100 . Without effective property, it will be hard to obtain rules. The NIS-Apriori algorithm affords a solution to this problem, and we implemented the prototype termed NIS-Apriori in SQL. This algorithm takes the core part for handling the uncertainty, and we applied it to decision support environment. Rule generation and decision support based on NISs have not been investigated previously; therefore, each research related to the NIS-Apriori algorithm and NISs will be novel research.
In the application of the obtained rules, we proposed some procedures and implemented them in SQL. If we have large number of rules, it will be hard to examine phase (ii) in Figs. 1 and 2 manually. We gave a solution to this problem. Actually, to search tables with rules is less time-consuming, but we cannot have one decision for any condition always. Probably, in most case, we have no information. Thus, we realised procedures to search either rdf or ndrf tables. The implemented procedures will be effective for decision support. If his response is 'no' to the attribute a5, he will support democrat party. If his response is 'yes' to the attribute a5, he will support republic party