Mining Frequent Itemsets of Novel Characters Based on Association Rules

This article belongs to the field of Data Mining. It selected the novel “White Deer Plain” as the research object, using the association rules mining method to research the book characters and show the character of interpersonal relationship. In this paper, weka was used as an auxiliary tool, and Apriori algorithm was used in association rules. The whole work text was input into the database, and each chapter of the article was separated. Through co-occurrence analysis, the name nodes in the text were extracted as keywords, and weight was assigned. Keywords appearing in a chapter were treated as a record, and a list of keywords from all chapters was combined to form a data set. Multiple scans of the database are made, and frequent item sets were found from the constructed data set, and association rules between people were found. The method assists the reader to clarify and grasp the relationship between the characters before reading the full text. It greatly saves the time of reading.


Overseas Research Status
Association rules include two basic algorithms: Apriori algorithm and FP-growth algorithm. In 1993, association rules were first proposed by R. Agrawal, and then Apriori algorithm. Apriori algorithm was first proposed to solve the shopping basket problem, also known as apriori algorithm. In 1994, they established the theory of Item Set Lattice Space and put forward the famous Apriori algorithm based on the above two theorems. Up to now, Apriori is still widely discussed as a classical algorithm for mining association rules. Later, many researchers have done a lot of research on the mining of association rules [1] . The main research directions of Apriori algorithm up to now include: multi-cycle mining algorithm (hierarchical mining algorithm), incremental updating algorithm, distribution, parallel mining algorithm, multi-level association rules mining algorithm, multi-value association rules mining algorithm, Association Rules Mining Algorithm Based on concept lattice, etc.

Domestic Research Status
The research of association rules in China is mainly concentrated on three aspects: algorithm, theory and practical application. There are fewer research areas and examples than abroad. The most classical Apriori algorithm has been improved by Wang Zhigang and other scholars, and its efficiency has been greatly improved. Combining the advantages of FP-tree [2] and Apriori algorithm, the distributed parallel algorithm is studied.

Basic Principles of Association Rules
Association rules belong to one of the data mining methods. Give a setＩ={i1, i2, …, im}, Transaction Data SetＴ={t1, t2, …, tn}. Each item tn contains a subset of I. In Association analysis, a set containing 0 or more items is called item set. If item set L contains k items, then L is called k-item set [3] . Association rules are determined by support and confidence. Suppose there is a database with D, which includes the percentage of transactions A and B at the same time, that is, support (A=>B) = P (A_B); Confidence is the percentage of transaction D containing A, the same conditional probability of transaction D containing B. The threshold of minimum support and minimum confidence are set artificially. If the minimum support and minimum confidence are satisfied, the association rules are interesting. The process of mining association rules is divided into two steps: mining frequent itemsets in the first step and finding association rules in the second step. The process of mining association rules is shown in Figure 1 below.

Data sources
Data base Data processing

Apriori algorithm
This paper uses Apriori algorithm in association rules to mine. The basic idea of the algorithm [4] is to find all the frequency sets first, and the frequency of these itemsets is at least the same as the predefined minimum support. Then strong association rules are generated from frequency sets, which must satisfy minimum support and minimum confidence. Then use the frequency set found in step 1 to generate the desired rules, generating all the rules that contain only the items of the set, with only one item on the right of each rule. Once these rules are generated, only those rules that are greater than the minimum credibility given by the user are left behind. The Apriori algorithm first scans the data set to determine the support degree of each item, and then gets the set L1 of all frequent-itemsets. Then it generates candidate binomial sets by using the generated frequent-itemsets, and scans the database to count the support degree of all candidate binomial sets, and gets the set L2 of frequent binomial sets. Repeat the above procedure until no new frequent K itemsets are generated.

Mining results
The research object of this paper is the novel White Deer Plain, which uses data mining tool Weka to mine association rules. This paper chooses the first 10 chapters of Bailuyuan as corpus research. The first 10 chapters are extracted from each chapter, each chapter is treated as a data item, and finally the data set is synthesized. Figure keywords are shown in Table 1 below.
Write the script according to the. ARFF format. Mining frequent itemsets according to Apriori algorithm, the parameters set are shown in Figure 3 below.
Mining frequent itemsets includes 14 frequent itemsets, as shown in Table 2 below. There are 9 frequent binomial sets, as shown in Table 3 below. There are four frequent trinomial sets, as shown in Table 4 below. Frequent Quadrinomial sets, as shown in Table 5 below and five best Association rules, are shown in Table 6