1 Introduction

Traditional information filtering (IF) models were developed based on a term-based user profile approach (see [15, 20, 23]). The advantage of term-based profiles is efficient computational performance as well as mature theories for term weighting, which have emerged over the last couple of decades from the information retrieval (IR) and machine learning communities. However, term-based profiles suffer from the problems of polysemy and synonymy. As IF systems are sensitive to data sets, it is still a challenging issue to significantly improve the effectiveness of IF systems.

Over the years, people have often held the hypothesis that phrases would perform better than words, as phrases are more discriminative and arguably carry more “semantics”. This hypothesis has not fared too well in the history of IR [11, 27, 28] in the beginning. Recently, language modeling approaches went beyond the term based model that underlies BM25 by considering term dependencies in phrases (N-grams) for information retrieval [18, 35]. Although phrases are less ambiguous and more discriminative than individual terms, the likely reasons for the discouraging performance include: (1) phrases have inferior statistical properties to words since they have low frequency of occurrence, (2) the theory of computing probabilities based on term dependencies is not practical, (3) some language model-based feedback methods cannot naturally handle negative feedback, and (4) there are large numbers of redundant and noisy phrases among them.

To overcome the limitations of term-based approaches, pattern mining based techniques have been used for information filtering since data mining has developed some techniques (e.g., maximal patterns, closed patterns and master patterns) for removing redundant and noisy patterns. One special filtering task was to extract usage patterns from Web logs [4, 47]. Other promising techniques were pattern taxonomy models (PTM) [32, 37] that discovered closed sequential patterns in text documents, where a pattern was a set of terms that frequently appeared in paragraphs.

Pattern based approaches have shown encouraging improvements on effectiveness [36]. However, two challenging issues have arisen when pattern mining techniques were introduced for IF systems. The first one is how to deal with low frequency patterns because the measures used for data mining (e.g., “support” and “confidence”) to learn the patterns turn out be not suitable in the filtering stage [15]. The second issue is how to effectively use negative feedback to revise extracted features (including patterns and terms) for information filtering.

Many people believe that there are plenty negative information available and negative documents are very useful because they can help users to search for accurate information [35]. However, whether negative feedback can indeed largely improve filtering accuracy is still an open question. The existing methods of using both positive and negative feedback for IF can be grouped into two approaches. The first approach is to revise terms that appear in both positive samples and negative samples (e.g., Rocchio based models and SVM [23] based filtering models). This heuristics is obvious when people assume that terms are isolated atoms. The second approach is based on how often terms appear or do not appear in positive samples and negative samples (e.g., probabilistic models [2], and BM25 [23]). However, usually people view terms in multiple perspectives when they attempt to find what they want. They normally use two dimensions (“specificity” and “exhaustivity”) for deciding the relevance of documents, paragraphes or terms. For example, “JDK” is a specific term for “Java Language”, and “LIB” is more general than “JDK” because it is also frequently used for C and C++ as well.

Based on this observation, this paper proposes a pattern mining based approach for using both positive and negative feedback. It firstly extracts an initial list of terms from positive documents and selects some constructive negative documents (or called offenders). It then extracts terms from negative patterns in selected negative documents. It also classifies all terms into three categories: the positive specific terms, general terms, and negative specific terms. In this way, multiple revising strategies are used for terms in different categories. In the implementation, it recommends to increment positive specific terms’ weights only and declines negative specific terms’ weights based on their occurrences in discovered negative patterns. Substantial experiments show that the proposed approach achieves exciting performance.

The remainder of this paper is organized as follows. Section 2 introduces a detailed overview of the related works. Section 3 reviews the concepts of pattern taxonomy mining. Section 4 introduces the equations for evaluating term weights based on discovered patterns. Section 5 describes the proposed method of using negative feedback. The empirical results and discussion are reported in Sect. 6, and the last section describes concluding remarks.

2 Related work

Different from IR systems, IF systems were commonly personalized to support long-term information needs of users [3]. The main distinct difference between IR and IF was that IR systems used “queries” but IF systems used “user profiles”. The tasks of the filtering included adaptive filtering, and batch or routing filtering. In this paper, the focus is on the breakthrough for batch or routing filtering. Adaptive filtering involves feedback to dynamically adapt IF systems [9, 17, 33, 42, 44]. The popular way is to update training sets in a batch classifier fashion. In this paper, we also evaluate the performance of the proposed approach for adaptive filtering.

Normally, IF systems tended to learn a map \({rank:\mathbb{D}\rightarrow \mathbb{R}}\) such that rank(d) corresponded to the relevance of a document d, where \({\mathbb{D}}\) denoted a set of documents, \({\mathbb{R}}\) was the set of real numbers. In [20], rank was divided into two functions, such that \(rank = f_{1} \circ f_{2}\), where f 1 (\(f_1: D\rightarrow \{C_1, \ldots, C_m\})\) and f 2 (\({f_2: \{C_1, \ldots, C_m\} \rightarrow \mathbb{R})}\) were maps, respectively; and \(C_1, C_2, \ldots, C_m\) were clusters. This method used a set of clusters based on a kind of classification method, e.g., the neural network [19]. The aim of the filtering track in TREC [23] was to measure the ability of IF systems to build profiles using sets of training documents to separate relevant and non-relevant documents. The basic term-based IF models used in TREC 2002 were SVM, Rocchio’s algorithm, probabilistic models, and BM25.

Feedback techniques are frequently used in IR community to improve the accuracy of filtering. Normally, there are different strategies for considering users feedback information for information retrieval. They are relevance feedback, pesudo-relevance feedback, implicit feedback and negative feedback [6, 29, 34, 38]. One of the common objectives of these strategies is to design IR models in order to obtain more accurate term weights based on user feedback for a given query.

Term-based models are most widely used approaches. A term-based model is based on the bag of words or N-grams, which uses terms as elements and evaluates term weights based on terms’ appearances or frequencies in feedback. For example, Rocchio-style classifiers [12], ranking SVM [22]; and BM25 for structured documents [25] are popular IF systems. They can also naturally handle both positive and negative feedback information. However, the research on term-based models has arguably hit somewhat of a wall in terms of effectiveness improvement possibly due to the ambiguity problem mentioned earlier. In addition, modeling the real dependencies between terms is very difficult.

Language models have been developed for considering term dependencies. In a language model, the key elements are the probabilities of word sequences which include both terms and phrases (or sentences) [31]. They are often approximated by N-gram models, such as Unigram, Bigram or Trigram, for considering term dependencies easily. Language modeling approaches include model-based methods, and relevance models [18]. The former finds models that can best describe the features in positive documents while considering a background model [45]. The later tries to model the notation of relevance in a more generalized level [10]. Language modeling approaches have been well developed for information retrieval, especially for query expansion techniques [18, 39, 35]. They are also quite effective for exploiting positive feedback information. However, they cannot naturally handle negative feedback.

Pattern mining has been extensively studied in data mining communities for many years. A variety of efficient algorithms such as Apriori-like algorithms [1], PrefixSpan [21], and FP-tree [5] have been proposed. These research works have mainly focused on developing efficient mining algorithms for discovering patterns in databases. Usually, the existing data mining techniques return numerous discovered patterns (e.g., sets of terms) from a training set, but large numbers of them are redundant patterns [40]. Nevertheless, the challenging issue is how to effectively deal with the large amount of discovered patterns and terms with a lot of noises.

Closed patterns have turned out to be a promising alternative to phrases [7, 32] because patterns enjoy good statistical properties like terms. To effectively use closed patterns for information filtering, closed sequential patterns have been used in pattern taxonomy models (PTM) [32, 36, 37], which deployed closed sequential patterns into a vector that included a set of terms and a term-weight distribution. The pattern deploying method has shown encouraging improvements on effectiveness in comparing with traditional probabilistic models, Rocchio based method and N-gram. The similar research also appeared in [41] for developing a new methodology of post-processing of pattern mining, pattern summarization, which grouped patterns into some clusters and then composed patterns in the same cluster into a master pattern that consists of a set of terms and a term-weight distribution.

These approaches introduced data mining techniques to information filtering; however, too many noisy patterns adversely affect PTM systems [15]. The major research issue is how to use both positive and negative feedback to significantly reduce the effects of noisy patterns. Traditional data mining techniques can only achieve a little progress for the effectiveness because they can only discuss this problem at the pattern level. This paper starts to consider human being’s perspective about relevance and uses a two-dimension concept to classify terms into three groups: positive specific terms, general terms and negative specific terms. In this perspective, term weights can be evaluated accurately based on their appearances in both positive patterns and negative patterns.

Our conference paper [13] is the first study on the problem of mining negative relevance feedback for information filtering. In this paper, we extend previous study by adding more examples, discussing more related research works, and extending the experiments for discussing the proposed iterative learning algorithm and statistic analysis. We also conducted some new experiments for using the proposed approach on adaptive filtering.

3 Pattern taxonomy mining

In this paper, we assume that all documents are split in paragraphs. So a given document d yields a set of paragraphs PS(d). Let D be a training set of documents, which consists of a set of positive documents, D +; and a set of negative documents, D . Let \(T = \{t_1, t_2, \ldots, t_m\}\) be a set of terms (or keywords) which are extracted from the set of positive documents, D +.

3.1 Frequent and closed patterns

Given a termset X, a set of terms, in document d, \(\ulcorner X \urcorner\) is used to denote the covering set of X for d, which includes all paragraphs dp ∈ PS(d) such that \(X \subseteq dp\), i.e., \(\ulcorner X \urcorner = \{dp | dp\in PS(d), X\subseteq dp\}\). Its absolute support is the number of occurrences of X in PS(d), that is \(sup_{a}(X)=|\ulcorner X\urcorner|\). Its relative support is the fraction of the paragraphs that contain the pattern, that is, \(sup_{r}(X)=\frac{|\ulcorner X \urcorner|}{|PS(d)|}\). A termset X is called frequent pattern if its sup a (or sup r ) \(\geq min\_sup\), a minimum support.

Table 1 lists a set of paragraphs for a given document d, where \(PS(d)=\{dp_1, \ldots, dp_6\}\), and duplicate terms are removed. Let min_sup = 3 giving rise to ten frequent patterns which are illustrated in Table 2. Normally not all frequent patterns are useful [32, 40]. For example, pattern \(\{t_{3}, t_{4}\}\) always occurs with term t 6 in paragraphs (see Table 1); therefore, we want to keep the larger pattern only.

Table 1 A set of paragraphs
Table 2 Frequent patterns and covering sets

Given a termset X, its covering set \(\ulcorner X \urcorner\) is a subset of paragraphs. Similarly, given a set of paragraphs \(Y\subseteq PS(d)\), we can define its termset, which satisfies

$$ termset(Y) = \{t | \forall dp \in Y \Rightarrow t\in dp\}. $$

The closure of X is defined as follows:

$$ Cls(X) = termset(\ulcorner X \urcorner). $$

A pattern X (also a termset) is called closed if and only if X = Cls(X).

Let X be a closed pattern. We have

$$ sup_{a}(X_1) < sup_{a}(X) $$
(1)

for all pattern \(X_1 \supset X\).

3.2 Pattern taxonomy

Patterns can be structured into a taxonomy by using the is-a (or subset) relation and closed patterns. For example, Table 2 contains ten frequent patterns; however, it includes only three closed patterns: \(\langle t_3, t_4, t_6\rangle\), \(\langle t_1, t_2\rangle\), and 〈t 6〉. Simply, a pattern taxonomy is described as a set of pattern-absolute support pairs, for example \(\hbox{PT}=\{\langle t_3, t_4, t_6 \rangle_3, \langle t_1, t_2 \rangle_3, \langle t_6 \rangle_5\}\), where non-closed patterns are pruned. After pruning, some direct “is-a” retaliations may be changed, for example, pattern {t 6} would become a direct sub-pattern of \(\{t_3, t_4, t_6\}\) after pruning non-closed patterns \(\langle t_3, t_6\rangle\) and \(\langle t_4, t_6\rangle\).

Smaller patterns in the taxonomy, for example pattern {t 6}, are usually more general because they could be used frequently in both positive and negative documents; and larger patterns, for example pattern \(\{t_3, t_4, t_6\}\), in the taxonomy are usually more specific since they may only used in positive documents.

3.3 Closed sequential patterns

A sequential pattern \(s = \langle t_1,\ldots, t_r\rangle\) (t i  ∈ T) is an ordered list of terms. A sequence \(s_1 = \langle x_1,\ldots, x_i\rangle\) is a sub-sequence of another sequence \(s_2 = \langle y_1,\ldots, y_j\rangle\), denoted by \(s_1 \sqsubseteq s_2 \), iff \(\exists j_1,\ldots, j_i\) such that \(1\leq j_1\) \(<j_2 \ldots < j_i \leq j\) and \(x_1 = y_{j_1}, x_2 = y_{j_2}, \ldots, x_i = y_{j_i}\). Given \(s_1 \sqsubseteq s_2 \), we usually say s 1 is a sub-pattern of s 2, and s 2 is a super-pattern of s 1. In the following, we simply say patterns for sequential patterns.

Given a pattern (an ordered termset) X in document d, \(\ulcorner X \urcorner\) is still used to denote the covering set of X, which includes all paragraphs ps ∈ PS(d) such that \(X\sqsubseteq ps\), i.e., \(\ulcorner X \urcorner = \{ps | ps\in PS(d), X \sqsubseteq ps\}\). Its absolute support and relative support are defined as the same as for the normal patterns.

A sequential pattern X is called frequent pattern if its relative support \(\geq min\_sup\), a minimum support. The property of closed patterns (see Eq. 1) can be used to define closed sequential patterns. A frequent sequential pattern X is called closed if not ∃ any super-pattern X 1 of X such that \(sup_{a}(X_1) = sup_{a}(X)\).

4 Deploying patterns on terms

The evaluation of term supports (weights) in this paper is different from the term-based approaches. For a term based approach, the evaluation of a given term’s weight is based on its appearance in documents. For pattern mining, terms are weighted according to their appearance in discovered patterns.

To improve the efficiency of the pattern taxonomy mining, SPMining(D +min_sup) algorithm [32], was proposed (also used in [15, 37]) to find closed sequential patterns for all document d  ∈ D +, which used the well-known Apriori property in order to reduce the searching space. For all positive document d ∈ D +, the SPMining algorithm discovered all closed sequential patterns based on a given min_sup.

Let SP 1, SP 2,..., \(SP_{|D^{+}|}\) be the sets of discovered closed sequential patterns for all document \(d_i \in D^{+} (i =1, \cdots, |D^{+}|)\). For a given term t, its support in these discovered patterns can be described as follows:

$$ support(t, D^{+}) = \sum_{i=1}^{|D^{+}|}{\frac{|\{p|p\in SP_i, t \in p\}|}{\sum_{p\in SP_i}{|p|}}} $$

Table 3 illustrates a real example of pattern taxonomy for a set of positive documents \(D^{+}=\{d_1, d_2, \cdots, d_5\}\). For example, term global appears in three documents (d 2, d 3 and d 5). Therefore, its support can be calculated based on patterns in the three documents’s pattern taxonomies:

$$ support(global,D^{+}) = \frac{2}{4}+\frac{1}{3}+\frac{1}{3}=\frac{7}{6}. $$
Table 3 Example of sets of discovered closed sequential patterns in pattern taxonomies, where the minimum absolute support is 2

After the supports of terms have been computed from the training set, the following rank will be assigned to an incoming document d that can be used to decide its relevance:

$$ rank(d) = \sum_{t \in T} weight(t)\tau(t,d) $$

where weight(t) = support(tD +); and τ(t,d) = 1 if t ∈ d; otherwise τ(t,d) = 0.

5 Mining negative feedback

In general, the concept of relevance is subjective; and normally people can describe the relevance of a topic (or document) in two dimensions: the specificity and exhaustivity, where “specificity” describes the extent to which the topic focuses on what users want, and “exhaustivity” describes the extent to which the topic discusses what users want. It is easy for human being to do so. However, it is very difficult to use the two dimensions for IF systems. In this section, we first discuss how to use the two dimensions for understanding the different roles of the selected terms. We also presents an algorithm for both negative document selection and term weight revision.

5.1 Specific and general terms

Formally, let DP + be the union of all discovered positive patterns of pattern taxonomies of D +, and DP be the union of all discovered negative patterns of pattern taxonomies of D , where a closed sequential pattern of D is called negative pattern. Given a term t ∈ T, its exhaustivity is the number of discovered patterns in both DP + and DP that contain t, and its specificity is the number of discovered patterns in DP + but not in DP that contain t. Based on this understanding, in this paper we classify terms into three groups. We call a term a general term if it appears in both positive patterns and negative patterns. We also call terms positive (or negative) specific terms if they appear only in patterns discovered in positive (or negative) documents only.

Based on the above discussion, we have the following definitions for the set of general terms GT, the set of positive specific terms T +, and the set of negative specific terms T :

$$ \begin{aligned} GT&= \{t| (\exists p_1 \in DP^{+}) \wedge (\exists (p_2 \in DP^{-}) \Rightarrow t \in (p_1 \cap p_2)\},\\ T^{+}&= \{t| t \notin GT, \exists (p \in DP^{+}) \Rightarrow t \in p\}, \ and\\ T^{-}&=\{t| t \notin GT, \exists (p \in DP^{-}) \Rightarrow t \in p\}. \end{aligned} $$

It is easy to verify that \(GT \cap T^{+} \cap T^{-} = \emptyset\). Therefore, \((GT, T^{+}, T^{-})\) is a partition of all terms in patterns.

To describe user profiles for a given topic, normally we believe that specific terms are very useful for the topic in order to distinguish to other topics. However, some experimental results show that using only specific terms are not good enough to improve the performance of information filtering because user information needs cannot simply be covered by documents that only contain the specific terms. Therefore, the best way is to use the specific terms mixed with some of the general terms.

5.2 Strategies of revision

After we can classify terms into three categories, we firstly show the basic process of revising discovered features in the training set. This process can help readers to understand the proposed strategies for revising discovered features in different categories.

The process first extracts initial features in the positive documents in the training set, which include terms and patterns. It then selects some negative samples (or called offenders) in the set of negative documents in the training set. It also extracts negative features, including both terms and negative patterns, from the selected negative documents using the same pattern mining technique as used for the feature extraction in positive documents. In addition, it revises the initial features and obtains revised features. The process can be repeated for several times as follows: selecting negative documents, extracting negative features and revising revised features.

Algorithm NFMining(D) describes the details of the strategies of the revision, where we assume that the number of negative documents is greater than the number of positive documents. For a given training set \(D=\{D^{+}, D^{-}\}\), we assume that the initial features, \(\langle T, DP^{+}, DP^{-}\rangle\), have been extracted from positive documents D + before we start the algorithm, where we let DP  = ∅. We also let the experimental parameter α =  − 1 that will be used for calculating weights of terms in negative patterns.

Step 1 initializes the set of general terms GT, the set of positive specific terms T + and the set of negative specific terms T , where loop is used to control the times of the revision. Step 2 and 3 calculate terms’ weights for all term in T.

Step 4 and 5 rank documents in the set of negative documents, where if t is a negative specific term, its weight is the revised weight that calculates in step 10 and 11. The weight function can be described as follows:

$$ \begin{aligned} weight(t)= \left\{ \begin{array}{ll} \hbox{its revising weight}, & \hbox{ if } t\in T^{-} \\ support(t, D^{+}), & \hbox{otherwise} \end{array}\right. \end{aligned} $$

Step 6 and 7 sort the negative documents based on documents’ rank values, and select offenders, some negative documents. If a document’s rank less than or equals to 0 that means this document is clearly negative to the system. A document has hight rank that means the document is an offender because it forces the system make mistake. The offenders are normally defined as the top-K negative documents in sorted D [14]. In this paper, we let \(K=\lceil \frac{|D^{+}|}{3}\rceil\). In the first revision (loop = 0), we ignore the top-j negative documents for offender selection since the initial features only coming from positive documents and we believe that positive features are more important than negative features in the beginning, where \(j=\lfloor \frac{|D^{-}|} {|D^{+}|}\rfloor\), the largest integer that less than or equals to \(\frac{|D^{-}|}{|D^{+}|}\).

Step 8 and 9 extract negative features (DP , T 0) from selected negative documents D 3 , where it calls algorithm SPMining \((D_{3}^{-}, min\_sup)\) to discover negative patterns DP and T 0 that includes all terms in patterns in DP .

Step 10 to 12 revise negative specific terms’ weights. These steps will go through a loop for three times and the iteration is controlled by step 13. In each loop, when a specific negative term is extracted in the first time, the algorithm simply negatives its support obtained from the selected negative documents; otherwise, the algorithm cumulates its weight as follows:

$$ weight(t)=\alpha\times support(t,D_{3}^{-}) + weight(t). $$

After three loops, the algorithm participates T into general terms GT and positive specific terms T + in step 14 and 15. It also revises positive specific terms’ weights using the following equation in step 16 and 17:

$$ weight(t)=weight(t)*(1+\frac{|\{d|d\in D^+,t\in d\}|}{|D^+|}) $$

At last, it updates T to include negative specific terms in step 18.

NFMining calls three times SPMining and the total negative documents used in the three times is O(|D+|); therefore, it takes the same computation time for mining patterns in selected negative documents as the SPMining does for mining patterns in positive documents. NFMining also takes times for sorting D, assigning weights to terms and partitioning terms into groups. The time complexity for these operations is \(O(|D^{-}|(log^{(|D^{-}|)}+|T|) + |T|^2 )\).

This algorithm consists of three loops for mining negative specific terms and the corresponding weights. For each loop, after finishing the loop, it is obvious that the number of negative specific terms, |T |, is not less than the number of negative specific terms before the loop, because of the operation, \(T^{-} = T^{-} \cup (T_0 - T)\), in Step 12. We expect the three loops can produce enough negative specific terms in order to reduce the side effects of general terms. We will discuss more details for this question in Sect. 6.4.

6 Evaluation

In this section, we first discuss the data collection used for our experiments. We also describe the baseline models and their implementation. In addition, we present the experimental results and the discussion.

\(\user2{NFMining}(D)\)

Input: A training set, \(\{D^{+}, D^{-}\}\), parameter α =  − 1;

           extracted features \(<T, DP^{+}, DP^{-}>\), DP  = ∅;

           support function and minimum support min_sup.

Output: Updated term set T and function weight.

Method:

1: \(GT = \emptyset, T^{+} = \emptyset, T^{-} = \emptyset\), loop = 0;

2: foreach t ∈ T do

3:       weight(t) = support(tD +);

4: foreach d ∈ D do

5:       \(rank(d)=\Upsigma_{t \in d \cap (T\cup T^{-})} weight(t)\);

6: let \(D^{-} = \{d_0, d_1,..., d_{|D^{-}|-1}\}\) in descendent ranking order,

     let \(j=\lfloor \frac{|D^{-}|}{|D^{+}|}\rfloor\) if loop = 0, otherwise j = 0;

7: \(D_{3}^{-} = \{d_i | d_i \in D^{-}, j \leq i < \lceil \frac{|D^{+}|}{3}\rceil + j\}\);

8: \(DP^{-} =SPMining\,(D_{3}^{-}, min\_sup)\); //find negative patterns

9: T 0 = {t  ∈ p | p  ∈ DP }; // all terms in negative patterns

10: foreach t ∈ (T 0 − T) do

11:     if (loop = 0) then weight(t) = α × support(t, D 3 )

         else weight(t) = α × support(t,D 3 ) + weight(t);

12: \(T^{-} = T^{-} \cup (T_0 - T)\), loop +  + ;

13: if loop < 3 then goto step 4;

14: foreach t ∈ T do //term partition

15:     if (t  ∈ T ) then GT = GT ∪ {t}

          else \(T^{+} =T^{+} \cup \{t\}\);

16: foreach t ∈ T + do

17:     \(weight(t)=weight(t)*(1+ \frac{|\{d|d\in D^+,t\in d\}|}{|D^+|})\);

18: T = TT ;

6.1 Data

Reuters Corpus Volume 1 (RCV1) was used to test the effectiveness of the proposed model. RCV1 corpus consists of all and only English language stories produced by Reuter’s journalists between August 20, 1996, and August 19, 1997 with total 806,791 documents. The document collection is divided into training sets and testing sets.

TREC (2002) has developed and provided 100 topics for the filtering track aiming at building a robust filtering system. The topics are of two types: (1) A first set of 50 topics are developed by the assessors of the National Institute of Standards and Technology (NIST) (i.e., assessor topics); The relevance judgements have been made by assessor of NIST. (2) A second set of 50 topics have been constructed artificially from intersections of pairs of Reuters categories (i.e., intersection topics) [30].

Difference from the assessor topics, the relevance judgements have been made by machine learning methods not by human being for intersection topics. The assessor topics are more reliable and the quality of the intersection topics is not quite good [23, 30]. For this reason, we use the all 50 assessor topics in this paper.

Documents in the RCV1 collection are marked in XML. To avoid bias in experiments, all of the meta-data information in the collection have been ignored. The documents are treated as plain text documents by preprocessing the documents. The tasks of removing stop-words according to a given stop-words list and stemming term by applying the Porter Stemming algorithm are conducted [16].

6.2 Baseline models and setting

In this paper, we select three term-based baseline models because they are frequently used for both positive and negative documents. They are a Rocchio model, a BM25 based IF model, and a SVM based model. The PTM model is also used to measure the performance of using negative feedback for pattern mining. In this paper, the proposed approach is called Negative PaTtern Mining model (N-PTM), which firstly discovers sequential closed patterns from positive documents, deploys discovered patterns on their terms. Then, it discovers negative patterns from negative documents to group and revise the extracted features from positive documents as shown in Sect. 5.

The Rocchio algorithm [26] has been widely adopted in the areas of text categorization and information filtering. It can be used to build the profile for representing the concept of a topic which consists of a set of relevant (positive) and irrelevant (negative) documents. The Centroid \(\vec{c}\) of a topic can be generated as follows:

$$ \alpha \frac{1}{|D^+|}\sum_{\overrightarrow{d}\in D^+}\frac{\overrightarrow{d}}{||\overrightarrow{d}||}-\beta \frac{1}{|D^-|}\sum_{\overrightarrow{d}\in D^-}\frac{\overrightarrow{d}}{||\overrightarrow{d}||} $$

There are two sets of setting for α and β: α = 16 and β = 4; and α = β = 1.0. We tested both sets and found α = β = 1.0 was the best set. So, we use α = β = 1.0 in the above equation.

BM25 [8, 24] is the one of state-of-the-art retrieval functions used in document retrieval. The term weights are estimated using the following BM25 based equation:

$$ W(t) = \frac{tf\cdot (k_1+1)}{k_1\cdot ((1-b)+b\frac{DL}{AVDL}) + tf}\cdot \log \frac{\frac{(r+0.5)}{(n-r+0.5)}} {\frac{(R-r+0.5)}{(N-n-R+r+0.5)}} $$

where N is the total number of documents in the training set; R is the number of positive documents in the training set; n is the number of documents which contain term t; r is the number of positive documents which contain term t; tf is the term frequency; DL and AVDL are the document length and average document length, respectively; and k 1 and b are the experimental parameters (the values of k 1 and b are set as 1.2 and 0.75, respectively, in this paper).

Information filtering can also be regarded as a special instance of text classification [28]. SVM is a statistical method that can be used to find a hyperplane that best separates two classes. SVM achieved the best performance on the Reuters-21578 data collection for document classification [43]. The decision function in SVM is defined as:

$$ h(x) = sign (w \cdot x + b)= \left\{ \begin{array}{ll} +1 & \hbox{ if } {( w \cdot x + b)} > 0 \\ -1 & \hbox{ otherwise} \end{array}\right. $$

where x is the input object; b in ℜ is a threshold and \(w =\sum^l_{i=1}y_i\alpha_ix_i \) for the given training data: \( (x_i,y_i),\ldots, (x_l, y_l)\), where x i ∈ ℜn and y i equals \(+1\,(-1)\), if document x i is labeled positive (negative). α i ∈ ℜ is the weight of the training example x i and satisfies the following constraints:

$$ \forall_i : \alpha_i \geqslant 0 \quad \hbox{and} \quad \sum^l_{i=1} \alpha_iy_i =0 $$
(2)

To compare with other baseline models, we tried to use SVM to rank documents rather than to make binary decisions. For this purpose, threshold b can be ignored. We also believe that the positive documents in the training set should have the same importance to user information needs because the training set was only simply divided into positive documents and negative documents. So we assign the same α i value (i.e., 1) to each positive document first, and then determine the same α i (i.e., \(\acute{\alpha}\)) value to each negative document based on Eq. 2. Therefore, we use the following weighting function to estimate the similarity between a testing document and a given topic:

$$ weight(d) = w \cdot d $$

where \(\cdot\) means inner product; d is the term vector of the testing document; and

$$ w =\left(\sum_{d_i\in D^{+}}d_i\right) + \left(\sum_{d_j\in D^{-}}d_j\acute{\alpha}\right). $$

For each topic, we also choose 150 terms in the positive documents based on tf*idf values for all term-based baseline models.

PTM model is also selected as one of the baselines models because we want to verify that mining negative feedback can significantly improve the performance of PTM. The maximum size of the term set T is 4000 for PTM. We also set min_sup = 0.2 (relative support) for both PTM and N-PTM.

The performance of PTM was based on the number of closed patterns that were decided by a minimum support [36]. If the minimum support is very small, many noisy patterns can be introduced to the system; however, if it is very big then many useful patterns may be missed out. For RCV1, the total number of frequent sequential patterns is 36,202 that includes 28,733 closed patterns if min_sup = 0.2. PTM can remove 20% of the frequent patterns if min_sup = 0.2. In this paper, we use a fixed minimum support value, min_sup = 0.2, suggested by [36].

6.3 Results

The effectiveness was measured by four different means: The F-beta (F β) measure, Mean Average Precision (MAP), the break-even point (b/p), and Interpolated Average Precision (IAP) on 11-points.

Fβ is calculated by the following function:

$$ F_\beta = \frac{(\beta^2 + 1)PR}{\beta^2 P + R} $$

The parameter β = 1 is used in our study, which means that recall and precision is weighed equally. Mean Average precision is calculated by measuring precision at each relevant document first, and averaging precision over all topics. The b/p break-even point indicates the value at which precision equals recall. The larger a b/p, MAP, IAP or Fβ-measure score is, the better the system performs. 11-points measure is also used to compare the performance of different systems by averaging precisions at 11 standard recall levels (i.e., recall = 0.0, 0.1, ..., 1.0).

Statistical method is also used to analyze the experimental results. The t-test assesses whether the means of two groups are statistically different from each other. The paired two-tailed t-test is used in this paper. If DIF represents the difference between observations, the hypotheses are: Ho: DIF = 0 (the difference between the two observations is 0). Ha: DIF ≠ 0 (the difference is not 0). N is the sample size of group. The test statistic is t with N − 1 degrees of freedom (df). If the p value associated with t is low (<0.05), there is evidence to reject the null hypothesis. Thus, there is evidence that the difference in means across the paired observations is significant. The N-PTM model is compared with PTM, Rocchio, BM25, and SVM models for each variable b/p, MAP, IAP, F β=1 over all the 50 topics, respectively.

6.3.1 N-PTM vs baseline models

Table 4 illustrates the results of all models against the five measures for all assessor topics. Compared with PTM which uses positive documents only, the proposed N-PTM model uses both positive and negative feedback. It is obvious that N-PTM is extremely better than PTM for all five measures. The proposed model N-PTM is also compared with term-based baseline models in Table 4 including Rocchio, BM25, and SVM, which also use both positive and negative feedback as well. The results of 11-points on all assessor topics are reported in Fig. 1.

Table 4 Results for all assessor topics on RCV1
Fig. 1
figure 1

Comparison between the proposed method and other approachs

As shown in Table 4 and Fig. 1, the proposed new model (N-PTM) has achieved the best performance results for the assessor topics.

We also conducted the t test to compare the proposed model with all baseline models and the results are listed in Table 5. The percentage changes are shown in Table 6. Comparing with these baseline models, the proposed approach achieves excellent performance with 13.73% (max 17.34% and min 8.76%) average percentage change for all five measures.

Table 5 P value for all models comparing with N-PTM
Table 6 Percentage change over all baseline models

These statistic results indicate that the proposed model is extremely statistically significant. Therefore, we conclude that mining negative relevance feedback for information filtering is an exciting achievement for pattern based approaches.

In the training phase, it is obvious that N-PTM and PTM use more times than other term-based models because of mining patterns in paragraphes. However, for the time complexity in the testing phase, all models take O(|T| × |d|) for all incoming documents d. In our experiments, the number of terms used by all models for each topic is less than 300 in average. Therefore, there is no significant difference between these models on time complexity in the testing phase.

6.3.2 Adaptive filtering

In this section, we design some experiments for testing the adaptive performance of the proposed N-PTM model. We expect these experiments can achieve the consistent performance like the batch one in the last section.

For each topic, the system starts from an initial training set, then adds a window of new training documents. The size of the window set is 25. Each window of new training documents is selected randomly in the testing set. To test the robustness of the proposed model, we conduct the process of adaptive six times (six windows) for the same initial training set. Table 7 shows the results of the N-PTM models which combine new training documents with the initial one into a big training set and then train the system again.

Table 7 Adaptive N-PTM models for all assessor topics

We also test the adaptive performance of the term-based baseline models for the same settings, and found that the Rocchio model achieves the best performance. Table 8 shows the results of adaptive Rocchio models for using the six windows. The experiment results show that the adaptive N-PTM models also achieve excellent performance with 9.10% (max 12.00% and min 6.28%) average percentage change for all five measures on the assessor topics. We believe that the performance of N-PTM model is consistent and very significant for all five measures on the RCV1 data collection.

Table 8 Adaptive Rocchio models for all assessor topics

6.4 Discussion

The main process of the proposed approach consists of two steps: offender selection, and the revision of term weights. It is obvious that not all negative documents are suitable to be selected as offenders, where offenders are the most useful negative documents that can help to balance the percentages of general terms and specific terms in the extracted features. Informally, the documents that have high weight are called offenders.

Table 9 shows the statistical information for N-PTM with different values of K including the average numbers of offenders, extracted terms and their weights, and the performance. The results of 11-points on all assessor topics for the different values of K are reported in Fig. 2. It is obvious that \(K=\lceil \frac{|D^{+}|}{3}\rceil\) is the best one. The statistical information illustrates that the proposed method for offender selection meets the design objectives.

Table 9 Statistical information for N-PTM with different values of K
Fig. 2
figure 2

Comparison between used all negative documents and used the offender one

As mentioned in Sect. 5.2, we used three loops to get negative specific terms in order to reduce the side effects of using general terms. Table 10 illustrates the performance of the loops used in Algorithm NFMining (up to 6 loops). The table shows that the system achieves the best result in average after the third loop.

Table 10 Performance of each loop for all assessor topics

Table 11 shows the average numbers of positive documents, negative documents, offenders and extracted terms in the training sets for the loops in the algorithm SPMining. Based on the proposed model, we set \(K=\lceil \frac{|D^{+}|}{3}\rceil\), that is, the number of offender documents should be equal or less than the number of positive documents. We also use loops to calculate the closeness of the offender document to the positive, such as, if a document is very closed to the positive documents it will be ranked at the top for the next loop. As shown in Table 11 the average number of positive documents is about 13 and the average number of negative documents is about 41; however, the average number of offender documents that have been selected in each loop is only 4 or 3. The table also illustrates that only \(15.74\%=\frac{6.5}{41.3}\) negative documents are selected as offenders for the system, that is, the proposed method is much efficient for reducing the space of negative documents.

Table 11 Extracted features in the loops of algorithm SPMining,where min_sup = 0.2

For the revision of term weights, the proposed method first classifies extracted terms into general terms and specific terms that is a distinguish advantage comparing with others [35, 46]. The normal belief is that specific terms are more interesting than general terms for a given topic. Therefore, the proposed method increases the weights of positive specific terms when it conduces the revision using negative documents.

General terms are not only frequently appear in positive documents, but also frequently appear in some negative documents because negative documents may describe some extent to which the topic discusses what users want. To reduce the side effects of using general terms in the extracted features, the proposed method adds negative specific terms (and negative weights) into the extracted features by the loops (see Algorithm NFMining).

Table 11 also shows the average numbers of extracted general terms #GT, specific terms #T + and negative specific terms #T , and their average weights. For the system, before revision, it can be seen that more than \(61\% = \frac{56.04}{56.04+36.33}\) weights are distributed to general terms although the percentage of general terms is \(31.8\% =\frac{50}{50+107}\) for all extracted terms in positive documents.

After revision, 126 negative specific terms are added into T in average for the system (see Table 11), and they are assigned weight −26.54 in average. In this way, these negative specific terms could reduce the side effects of general terms if both general terms and negative specific terms appear in negative documents because now only \(45\% = \frac{56.04-26.54}{56.04-26.54+36.33}\) weights could be distributed to general terms considering positive specific terms get weight 36.33 in average and general terms get 29.5 = 56.04 − 26.54 in average.

The above analysis illustrates that the proposed algorithm for finding negative specific terms meets the design objective for reducing the side effects of using general terms.

7 Conclusions

Negative relevance feedback is very useful for information filtering. However, whether negative feedback can largely improve filtering accuracy is still an open question. This paper presents a pattern mining based approach for this open question. It introduces a method to select negative documents (or called offenders) that are close to the extracted features in the positive documents. It also proposes an approach to classify extracted terms into three groups: positive specific terms, general terms and negative specific terms. In this perspective, it presents an iterative algorithm to revise extracted features. Compared with the state-of-the-art models, the results of experiments on the RCV1 data collection demonstrate that the effectiveness of information filtering can be significantly improved by the proposed new approach, and the performance is also consistent for adaptive filtering. This research provides a promising methodology for evaluating term weights based on discovered patterns (rather than documents) in both positive and negative relevance feedback.