Exploring the relationship between hadith narrators in Book of Bukhari through SPADE algorithm

As one of the law resources of Muslim society, hadith is very important to learn. Unlike most hadith-related research, which studies more about content, we examine the relationship pattern between hadith narrators. In the study of hadith science, a series of hadith narrators who narrate a hadith is referred to as a sanad. This hadith sanad must be connected to the Prophet as the primary source of a hadith. Therefore, research related to the relationship between narrators is fundamental because it affects the quality and validity of a hadith. This paper analyzes the pattern of hadith narrators using Sequential Pattern Discovery using Equivalence Classes (SPADE). We separate the data of the narrators from the content, whereas, in the hadith books we use, the two are still mixed. This study, therefore, provides detailed information on the steps in the analysis of the patterns of hadith narrators. Some of the highlights of this paper are:• Algorithm 1 provides the detailed steps in data preprocessing to obtain the “clean data” needed in analyzing the pattern of narrator relationships.• Algorithm 2 provides a detailed description of analyzing the pattern between hadith narrators using SPADE.

relates to the continuity of the sanad in a hadith. A study to find the relationship between narrators has been conducted using the FP-Growth algorithm and ECLAT [11] . Another study has told about making judgments on hadiths according on their sanad (quality) using various computational and Natural Language Processing (NLP) methods [12] . Although previous hadith experts have carried out complex methods based on data and studies of religious knowledge in determining the continuity of the sanad, the development of data analysis science can be applied to study the results of research that hadith experts have carried out.
From a data mining point of view, the relationship among hadith narrators forms an "if-then" rule. The rule, in this case, means that if the narrator X narrates a hadith, then the narrator Y and Z also narrates the hadith in the same sanad. It occurs not only in one hadith but can also be repeated in other hadiths. This pattern is similar to sales transaction data, where customers buy an itemset simultaneously. However, the data of hadith narrators in one sanad has a sequential location according to the generation of the narrators. Thus, to examine more deeply related to the data of hadith narrators, a method to find association patterns that observe the order between items is needed. The conventional association rules cannot accommodate the data pattern which allows transactions that occurred sequentially. One method in data mining to look for patterns of relationships between items that consider the time sequence of events in large amounts of data is the Sequential Pattern Mining [13] . This pattern can be found if large amounts of data and successive events occur several times. The algorithm used in this Sequential Pattern Mining is Sequential Pattern Discovery using Equivalence classes (SPADE). SPADE is one of the Sequential Pattern Mining techniques that uses a vertical data format sequentially based on time [14] .
Throughout this paper, we use SPADE to study the relationship among narrators in all hadiths in Book of Bukhari. However, because the hadith data we use is still mixed between sanad and matan, we need to do data preprocessing first. We carry out several stages in the data preprocessing, including selecting the data to be used, transforming the data, cleaning the data chosen from unnecessary elements, and formatting the data to suit the needs in the analysis using SPADE. This data preprocessing stage is reasonably complex because the available information is still very "dirty." In addition to the data of sanad and matan, which are still mixed, there are also some inconsistent writings of names for the same narrators and writing titles or personal pronouns. This inconsistency can interfere with the pattern analysis that will be carried out because the same narrator can be considered different. Furthermore, there are hadiths which consist of multiple sanad chain which should be separated. Therefore, at this data preprocessing stage, we performed as much detailed data cleaning as possible to maximize pattern analysis results using SPADE. We continue the analysis process using the SPADE algorithm once we get the "clean enough" data. In this analysis stage, we conduct three main processes: determining frequent 1-sequence, determining frequent 2-sequences, and determining the overall frequent k -sequences to obtain patterns of narrator relationships from the books of hadith that we analyzed. After getting the relationship pattern we expect, we evaluate our analysis using the minimum support, confidence and lift ratio. The flow of the research is more simply shown in Fig. 1 . Furthermore, we provide the details of each step of our analysis in the following sections, so that our study can be easily reproduced.

Data collection
Hadith data used in this research are taken from the Book of Bukhari, two of the nine books of hadith, consisting of 6638 hadiths. The raw data can be accessed at https://github.com/sutanlab/ hadith-api . We chose the Book of Bukhari because it is one of the two best collections among Sunni Muslims and are hadith collections that have the highest level of authenticity among the nine books of hadith. The raw data still consist of matan and sanad.

Data preprocessing
In the database obtained from https://github.com/sutanlab/hadith-api , Book of Bukhari hadith collections are of type .json . There are two types of sanad chain, i.e. single and multiple sanad chain. Fig. 2 provides the difference between single and multiple sanad chain. For the multiple sanad chain, we separate it into n single sanad chain, where n is the number of chain, and treat them like the other ordinary single sanad chain. After the separation process, we construct the data into a dataframe consist of three variables, i.e., "number", "arab", and "id". Variable "number" is the serial number of hadith from hadith number 1 to hadith number 6638, from the Book of Bukhari. Variable "arab" is the matan and sanad, each hadith number in Arabic. Meanwhile, variable "id" contains the same thing as variable arab but in Bahasa (translation of hadith in Bahasa, an Indonesian language). In this research, we only use variable number and id . Then, specifically for the variable id, which consist of Matan and sanad (sequential flow of hadith narrators), we conduct a data reduction (preprocessing data) into a hadith sanad only.
In general, there are five steps taken in data preprocessing for the hadith data, which are as follows: 1. Separation process. The hadith data which consist of multiple sanad chain is separated into n single sanad chain. 2. Data selection. The data that will be included in the analysis process is the variable "number" and the variable "id". Cleaning data for variable "id". We only used the sanad of hadiths or only the sequences of each hadith number; other data, namely the matan of the hadith or the editorial content of the hadith, will be omitted. 3. Variable transformation. We transformed the variable "number" and "id" into a data frame table consisting of i rows and j columns. The number of rows is the number of hadiths in the database. At the same time, the number of columns is the number of names of narrators in each hadith sequentially from the 1st column to the jth column. 4. Cleaning data for hadith narrators. In this phase, we performed a cleaning process on the name of the hadith narrators on the variable "id". We omitted words that are personal pronouns and prayer sentences after the names of the narrators, such as "radliyallahuanhu" (and other similar prayers), "the wife of the Prophet sallallahu alaihi wasallam", "his uncle", "my uncle", "the son of your prophet's uncle", "nephew", "his grandfather", "his father", "father", "his mother", "my mother", etc. In addition, we removed the spaces in the name of the narrators and changed the name of the narrators in the database to lowercase. All of the cleaning processes in this narrator data are carried out to avoid data redundancies that can affect analytical decisions. 5. Data formatting. We modified the data into a transaction data format and formed a vertical sequence Based on Table 1 , the first hadith in the Book of Bukhari ( id.bukhari 1) is narrated by umarbinalkhaththab as a first narrator ( id.rawi.bukhari 1). Then the second narrator of the first hadith ( id.rawi.bukhari 2) is alqamahbinwaqashallaitsi , and so on. The data formatting step needs to be done as a mandatory step for sequential pattern mining using the SPADE algorithm.
In simple terms, the data preprocessing steps are summarized in Algorithm 1 .

SPADE analysis
After obtaining the "clean data'' of hadith sanad, we conducted a study on the pattern of relationships between narrators using Sequential Pattern Discovery using Equivalent classes (SPADE) -one of the sequential pattern mining algorithms which use vertical data format-. The sequence databases turn into a series of sequences in the vertical data format in the following format; [itemset: (sequence_ID, event_ID)]. In this context, sequence_ID represents the hadith number, while event_ID represents the order of the narrators in the same hadith, which is ordered by generational proximity to the Prophet Muhammad PBUH. In other words, each itemset has a corresponding sequence and event identifier. The event identifier is used as a timestamp of the itemset. A pair of Determine the frequent 2-sequence and frequent 3-sequence (minconf = 0.001) ruleInduction(cseq, confidence = 0.001, control = list(verbose = TRUE)) Determine the strongest rules sort(subset(rules, subset = lift > 1), by = "lift") Table 2 Example of vertical ID_list data transaction. Size  Item   1  1  1  Narrator_A  1  2  1  Narrator_B  1  3  1  Narrator_C  2  1  1  Narrator_B  2  2  1  Narrator_C  2  3  1  Narrator_D  3  1  1  Narrator_A  3  2  1  Narrator_B  3  3  1  Narrator_C  3  4  1  Narrator_D  3  5 1 Narrator_E (sequence_ID, event_ID) for each itemset formed an ID_list [ 14 , 15 ]. Table 1 is the example of the vertical data format, where id.bukhari column is the sequence_ID and id.rawi.bukhari column is the event_ID. Furthermore, size.bukhari column represents the number of items for each event_ID and rawi.bukhari column represents the item of the hadith narrators. The steps in finding frequent sequences and determining the frequent rules are [14] .

Sequence_ID Event_ID
1 Determine the frequent 1-sequence Frequent 1-sequence from a database with a vertical format (sequence database) can be determined by observing each itemset in the sequence database. Each itemset has an ID_list -a pair of (sequence_ID, event_ID)-. A support value-determined using Eq. (1) -will be added in each new sequence_ID. The sequences with a greater support value than the minimum support will be included in the frequent 1-sequence. For example, suppose that we have the vertical ID_list data transaction provided in Table 2 and we determine the minimum support value of 0.5.

Determine the frequent 2-sequences
The data from the frequent 1-sequence are used to search the frequent 2-sequences. Each frequent 1-sequence is combined with other frequent 1-sequence. For example, 1-sequence X is combined with the 1-sequence Y ; the possible 2-sequences are X, Y ; meaning that X and Y appear together in the transaction. X → Y means that item Y appears after X, and vice versa. For each frequent 1sequence combination, the ID_list is checked to determine whether the sequence_ID is the same. If the sequence_ID is the same, then it is checked whether the event_ID of 1-sequence X is equal to, less than, or more than the event_ID of 1-sequence Y . If it is the same, then the ID_list is included in 2-sequences X, Y . If event_ID Y is greater than X then the ID_list is included in 2-sequences X → Y , and vice versa. As with frequent 1-sequence, we add a support value to each new sequence_ID. 2sequences with a support value more significant than the minimum support will be included in the frequent 2-sequences. Based on example in step 1, we obtain the frequent 2-sequences from the combination of Narrator_A, Narrator_B, Narrator_C, and Narrator_D, with the support value as follows. The periodic ( k − 1 ) sequences with the same prefix is concatenated to find common k -sequences.
To discover 3-sequences, for example, combine frequent sequences from 2-sequences with the same prefix, and to find 4-sequences, combine frequent sequences from 3-sequences with the same prefix, Furthermore, check the support value for each conceivable outcome. The frequent k -sequence will include k -sequences with a support value more significant than the minimal support. If there are no more frequent ( k − 1 ) seq uences that may be combined or if frequent sequences are no longer identified, the frequent sequence search is terminated.
The following algorithm summarizes the steps we conducted in analyzing the narrator data using the SPADE algorithm.

Formation of rules
Sequences that have fulfilled the minimal support limit are used to create rules. There is only one item in a 1-sequence. Therefore, it is not utilized to build a rule. The first item in a 2-sequences is the antecedent, and the second item is the consequent. Meanwhile, in sequences with more than two items ( k -sequences), the last item is a consequent, while others are antecedent. For example, in 4-sequences A → B → C → D , the resulting rule is A → B → C ⇒ D . Next, the confidence value is calculated ( Eq. (3) ). The accepted rules have a confidence value more significant than the specified minimum confidence. Then for the accepted rule, the lift ratio value is calculated using the formula in Eq. (4) .
The level of importance of a rule is determined by the following parameters [ 16 , 17 ]: 1 Support Support is the percentage of item combinations in the database. Support for the " X ⇒ Y " rule is the probability of an attribute or set of a co-occurring X and Y attributes. The support value for one item is as follows where P (X ) : probability of occurrence of X n (X ) : the number of transactions containing X n (S) : total transaction amount The support value of the two items is obtained from the following formula: where P ( X ∩ Y ) : probability of X and Y occurred simultaneously n ( X ∩ Y ) : the number of the simultaneous occurrences of X and Y Support in this study is defined as the probability of several items (narrators) narrating the hadith in one sanad (interrelated) from the entire hadith narrated. To produce items from a dataset of narrators that contribute to narrating the hadith, it is necessary to determine the minimum support value. Minimum support is a parameter used as a limit on the frequency of events or support count that must be met by a data group to be used as a rule.

Confidence
Confidence is defined as the likelihood of numerous events co-occurring when only one of them is definite. The confidence value of a group of objects is calculated using the formula below: where P (X| Y ) : the conditional probability of event Y given by X The probability of numerous items (narrators) reciting the hadith in one sanad (interrelated) when one of the narrators has undoubtedly reported the hadith is termed as confidence in this study. The minimal confidence value was also calculated in this investigation.

Lift Ratio
The Lift Ratio is a metric for determining how potent the rules generated by a sequential pattern mining algorithm are. The lift ratio might be anything between 0 and infinity. Unlike support and confidence, where a minimum value is given, the lift ratio does not have a minimum value. When the lift ratio is less than one, the antecedent rule ( X) has a negative impact on the subsequent rule ( Y ). When the lift ratio is one, the rules frequently appear combined but are separate. An independent rule is when the result ( Y ) is not reliant on the antecedent. If the lift ratio is more significant than one, the recommended rule is that the antecedent ( X) influences the consequent favorably [ 18 , 19 ]. The formula for the lift ratio is as follows:

Results
With the minimum support of 0.001 and minimum confidence of 0.001, we obtained 1374 rules. We then summarize the rules by selecting those with a confidence value of 1, meaning that all hadiths narrated by Y (consequent), previously narrated by X. Based on this reasoning, we obtained ten rules provided in Table 3 .
The first rule is a frequent 2-sequences rule with a support value of 0.0024, meaning that 0.24% or 16 hadiths in the Book of Bukhari are narrated by Malik bin Al Huwairits and Abu Qilabah. The confidence value of 1 means that if Malik Al Huwairits narrated a hadith, then it will also be narrated by Abu Qilabah with a confidence level of 100%. The lift ratio of 119.6452 (greater than one) means that the relationship between the hadith narrators is strong. The historical fact strengthens that the relationship between the two narrators is a teacher-student, where Malik bin Al Huwairits is the teacher of Abu Qilabah. Moreover, Malik bin Al Huwairits is a hadith narrator from the generation of the Prophet's companions.
Meanwhile, Abu Qilabah, whose real name is Abdullah bin Zaid 'Amru bin Nabil, is a hadith narrator from the generation of tabi'in -early Muslims who lived after the companions of the Prophet and did not experience the life of the Prophet Muhammad-. This fact shows that the relationship between the two hadith narrators is strong. Different patterns are shown by the second, third, fifth, sixth, seventh, eighth, and tenth rules, where the sanad in these rules consist of more than two people. For example, in the fifth rule, Ma'mar narrated a hadith which Hammam bin Munabbih and Abu Hurairah previously narrated. Furthermore, for other rules with more than two narrators, the nature of the relationship is the same as for the fifth rule.
Based on the acquired rules, we collect the information about the themes narrated by the hadith narrators in each rule and summarize it into Fig. 4 . Fig. 4 shows the distribution of themes in the hadiths narrated by the narrators on the ten acquired rules. Based on the results, we obtain the following information. Rule 1 and rule 9 dominantly narrated the themes of the call to prayer, while rule 2 dominantly narrated about Tawheed. The dominant themes of other rules can be identified from the highest bar in Fig. 4 .

Conclusion and perspective
This research discussed the relationship pattern between hadith narrators in the Book of Bukhari using the SPADE algorithm. This study describes the work steps in detail to easily reproduce this procedure by the related researchers. From 6638 recorded hadiths in the Book of Bukhari, we obtained ten rules which describe the relationship pattern between the narrators, with the minimum support of 0.001 and a confidence value of 1. We also identified the themes and summarized each rule's dominant theme narrated by the hadith narrators.
For further research, the SPADE algorithm can be developed to build web-based application. Users can just enter the name of the narrator along with the name of the book of hadith, also the minimum support and minimum confidence. The output of the web-based application can show the sanad chain of narrators, the rules along with each support value, confidence value, and the lift ratio, they can also obtain what themes were narrated by the narrators. As a convenience for users, the webbased application can also display graphs that illustrate the themes narrated by narrator rules that are formed based on the analysis using SPADE algorithm.