rFerns: An Implementation of the Random Ferns Method for General-Purpose Machine Learning

In this paper I present an extended implementation of the Random ferns algorithm contained in the R package rFerns. It differs from the original by the ability of consuming categorical and numerical attributes instead of only binary ones. Also, instead of using simple attribute subspace ensemble it employs bagging and thus produce error approximation and variable importance measure modelled after Random forest algorithm. I also present benchmarks' results which show that although Random ferns' accuracy is mostly smaller than achieved by Random forest, its speed and good quality of importance measure it provides make rFerns a reasonable choice for a specific applications.


Introduction
Random ferns is a machine learning algorithm proposed by [11] for matching same elements between two images of the same scene, allowing one to recognise certain objects or trace them on videos. The original motivation behind this method was to create a simple and efficient algorithm by extending the Naïve Bayes classifier; still the authors acknowledged its strong connection to the decision tree ensembles like the Random forest [2] algorithm.
Since introduction, Random ferns have been applied in numerous computer vision application, like image recognition [1], action recognition [10] or augmented reality [14]. However, it has not gathered attention outside this field; thus, this work aims to bring this algorithm to a much wider spectrum of applications. In order to do that, I propose a generalised version of the algorithm, implemented as an R [13] package rFerns.
The paper is organised as follows. Section 2 briefly recalls the Bayesian derivation of the original version of Random ferns, presents the decision tree ensemble interpretation of the algorithm and lists modifications leading to the rFerns variant. Next, in the Section 3, I present the rFerns package and discuss the Random ferns incarnation of a two important features of the Random forest, internal error approximation and attribute importance measure. Section 4 contains the assessment of rFerns in a several well known machine learning problems. The results and computational performance of the algorithm are compared with Random forest implementation contained in the randomForest package [8]. The paper is concluded in the Section 5.

Random ferns algorithm
Following the original derivation, let's consider a classification problem based on an dataset (Xi,j, Yi) with p binary attributes X·,j and n objects Xi,· equally distributed over C disjoint classes (those assumptions will be relaxed in the further part the paper). The generic Maximum a Posteriori (MAP) Bayes classifier classifies the object Xi,· as Y p i = arg max y P (Yi = y|Xi,1, Xi,2, . . . , Xi,p); (1) according to the Bayes theorem, it is equal to Although this formula is strict, it is not practically usable due to a huge (2 p ) number of possible Xi,· value combinations, most likely much larger than available number of training objects n and thus making reliable estimation of probability impossible. The simplest solution to this problem is to assume complete independence of the attributes, what brings us to the Naïve Bayes classification where Y p i = arg max y j P (Xi,j|Yi = y).
The original Random ferns classifier [11] is an in-between solution defining a series of K random selections of D features ( j k ∈ {1..P } D , k = 1, . . . , K) treated using a corresponding series of simple exact classifiers (ferns), which predictions are assumed independent and thus combined in a naïve way, i.e., where . This way one can still represent more complex interactions in the data, possibly achieving better accuracy than in purely naïve case. On the other hand, such defined fern is still very simple and manageable for a range of D values.
The training of the Random ferns classifier is performed through estimating probabilities P (X i, j k |Yi = y) with empirical probabilities calculated from a training dataset (X t i,j , Y t i ) of a size n t × p. Namely, one uses frequencies of each class in each subspace of the attribute space defined by j k assuming a Dirichlet prior, i.e., where # denotes the number of elements in a set and is the set of training objects in the same leaf of fern k as object i.

Ensemble of decision trees interpretation
A fern implements a partition of feature space into regions corresponding to all possible combinations of values of attributes j k . This way it is equivalent to a binary decision tree of a depth D for which all splitting criteria on a tree level d are identical and split according to an attribute of index j d , as shown on the Figure 1. Consequently, because the attribute subsets j k are generated randomly, the whole Random ferns classifier is equivalent to a random subspace [6] ensemble of K constrained decision trees.
Most ensemble classifiers combine predictions of its members through majority voting; it is also the case for Random ferns when one consideres scores S i, j k (y) defined as S i, j k (y) = log P (X i, j k |Yi = y) + log C.
This mapping effectively converts the MAP rule into majority voting Addition of log C causes that a fern that has no knowledge about the probability of classes for some object will give it a vector of scores equal zero.

Introduction of bagging
Using the ensemble of trees interpretation, in the rFerns implementation I was able to additionally combine random subspace with bagging, as it was shown to improve the accuracy of a similar ensemble classifiers [3,2,12]. This method restricts training of each fern to bag, a collection of objects selected randomly by sampling with replacement n t objects from an original training dataset, thus changing Equation 6 into where B k is a vector of indexes of the objects in the k-th fern's bag.
In such a set-up, the probability that a certain object won't be included in a bag is (1−1/n t ) n t , thus each fern has a set of on average n t (1−1/n t ) n t (n t e −1 ≈ 0.368n t for a large n t ) objects which were not used to build it. They form out-of-bag (OOB) subsets which will be denoted here as B * k .

Generalisation beyond binary attributes
As the original version of the Random ferns algorithm was formulated for datasets containing only binary attributes, the rFerns implementation had to introduce a way to also cope with continuous and categorical ones. In the Bayesian classification view, this issue should be resolved by postulating and fitting some probability distribution over each attribute. However, this approach introduces additional assumptions and possible problems connected to the reliability of fitting.
In the decision tree ensemble view, each non-terminal tree node maps certain attribute to a binary split using some criterion function, which is usually a greater-than comparison with some threshold value ξ in case of continuous attributes (i.e., f ξ : x → (x > ξ)) and test whether it belongs to some subset of possible categories Ξ in case of categorical attributes (i.e., fΞ : x → (x ∈ Ξ)).
In most Classification And Regression Trees (CART) and CART-based algorithms (including Random forest) the ξ and Ξ parameters of those functions are greedily optimised based on the training data to maximise the 'effectiveness' of the split, usually measured by the information gain in decision it provides. However, in order to retain the stochastic nature of Random ferns the rFerns implementation generates them at random, similar to the Extra-trees algorithm by [4]. Namely, when a continuous attribute is selected for creation of a fern level a threshold ξ is generated as a mean of two randomly selected values of it. Correspondingly, for a categorical attribute Ξ is set to a random one of all possible subsets of all categories of this attribute, except of two containing respectively all and none of the categories.
Obviously, completely random generation of splits can be less effective than optimising them in terms of the accuracy of a final classifier; the gains in computational efficiency may also by minor due to a fact that it does not change the complexity of the split building. However, this way the classifier can escape certain overfitting scenarios and unveil more subtle interaction. This and the more even usage of attributes may be beneficial both for the robustness of the model and the accuracy of the importance measure it provides.
While in this generalisation the scores depend on thresholds ξ and Ξ, from now on I will denote them as Si,F k where F k contains j k and necessary thresholds.

Unbalanced classes case
When the distribution of the classes in the training decision vector becomes less uniform, its contribution to the final predictions of a Bayes classifier increases, biasing learning towards the recognition of larger classes. Moreover, the imbalance may reach the point where it prevails the impact of attributes, making the whole classifier always vote on a largest class.
The original Random ferns algorithm was developed under assumption that the classes are equal, however such a case is very rare in a general machine learning and so the rFerns implementation has to cope with that problem as well. Thus, it is internally enforcing balance of class' impacts by dividing the counts of objects of a certain class in a current leaf by the fraction of objects of that class in the bag of the current fern -this is equivalent to a standard procedure of oversampling under-represented classes so that the amounts of objects of each class are equal within bag.
Obviously there exist exceptional use cases when such a heuristic may be undesired, for instance when the cost of misclassification is not uniform. Then, it might be reversed or replaced with other prior by modifying the raw scores before the voting is applied.

rFerns package
The training of a Random ferns model is performed by the rFerns function; it requires two parameters, the number of ferns K and the depth of each one D, which should be passed via ferns and depth arguments respectively. If not given, K = 1000 and D = 5 are assumed. The current version of the package supports depths in range 1..15. The training set can be given either explicitly by passing predictor data frame and the decision vector, or via usual formula interface: R> model <-rFerns(Species~., data = iris, ferns = 1000, depth = 5) R> model <-rFerns(iris [, -5]

, iris[, 5])
The results is a S3 object of a class rFerns, containing the ferns' structures F k and fitted scores' vectors for all leaves.
To classify new data, one should use the predict method of the rFerns class. It will pull the dataset down each fern assigning each object with score vector from the leaf it ended in, sum the scores over the ensemble and finds the predicted classes.
For instance, let's set aside the even objects of iris data as a test set and train the model on the rest: Adding scores=TRUE to the predict call makes it return raw class scores.
The following code will extract scores of first three objects of each class in the test set:

Error estimate
By design, machine learning methods usually produce a highly biased results when tested on the training data; to this end, one needs to perform external validation to reliably assess its accuracy. However, in a bagging ensemble we can perform a sort of internal cross-validation in which each train set object prediction is built by voting of only those of base classifiers which did not used this object for their training, i.e., which had it in their OOB subsets. This idea has been originally used in the Random forest algorithm, and can be trivially transferred on any bagging ensemble, including rFerns version of Random ferns. In this case the OOB predictions Y * i will be given by and can be compared with the true classes Yi to calculate the OOB approximation of the overall error. On the R level, OOB predictions are always calculated when training an rFerns model; when its corresponding object is printed, the overall OOB error and confusion matrix are shown, along with the training parameters:

R> print(model)
Forest of 1000 ferns of a depth 5. One can also access raw OOB predictions and scores by executing the predict method without providing new data to be classified: Note that for a very small values of K some objects may manage to appear in every bag and thus get an undefined OOB prediction.

Importance measure
In addition to the error approximation, Random forest also uses the OOB objects to calculate the attribute importance. It is defined as a difference in the accuracy on the original OOB subset and OOB subset with the values of a certain attribute permuted, averaged over all trees in the ensemble.
Such a measure can also be grafted on any bagging ensemble, including rFerns; moreover, one can make use of scores and replace the difference in accuracy with mean difference in score of the correct class, this way extracting importance information even from the OOB objects that are misclassified. Precisely, such defined Random ferns importance of an attribute a equals where A(a) = {k : a ∈ j k } is a set of ferns that use attribute a and S p i,F k is Si,F k estimated on a permuted X t in which values of attribute a have been shuffled. One should also note that the fully stochastic nature of selecting attributes for building individual ferns guarantees that the attribute space is evenly sampled and thus all, even marginally relevant attributes are included in the model for a large enough ensemble.
Calculation of the variable importance can be triggered by adding importance=TRUE to the call to rFerns; then, the necessary calculations will be performed during the training process and the obtained importance scores placed into importance element of the rFerns object.

Accuracy
For each of the testing sets, I have built 10 Random ferns models for each of the depths in range {1..15} and number of ferns equal to 5000 and collected the OOB error approximations. Next, I have used those results to find optimal depths for each set (D b ) -for simplicity I selected value for which the mean OOB error from all iterations was minimal.
Finally, I have verified the error approximation by running 10-fold stochastic cross-validation. Namely, the set was randomly slit into test and training subsets, composed respectively of 10% and 90% of objects; the classifier was then trained on a training subset and its performance was assessed using the test set. Such procedure has been repeated ten times.
As a comparison, I have also built and cross-validated 10 Random forest models with 5000 trees. The ensemble size was selected so that both algorithm would manage to converge for all problems.
The results of those tests are collected in the Table 1. One can see that as in case of Random forest, OOB error approximation is a good estimate of the final classifier error. It is also well serves as an optimisation target for the fern depth selection -only in case of the Sonar data the naïve selection of the depth giving minimal OOB error led to a suboptimal final classifier, however one should note that the minimum was not significant in this case.
Based on the OOB approximations, forest outperforms ferns in all but one case; yet the results of cross-validation show that those differences are in practice masked by the natural variability of both classifiers. Only in case of the Satellite data Random forest clearly achieves almost two times smaller error.

Importance
To test importance measure, I have used two sets for which importance of attributes should follow certain pattern. Each objects in the DNA set [9] represent 60-residue DNA sequence in a way so that each consecutive triplet of attributes encodes one residue. Some of the sequences contain a boundary between exon and intron (or intron and exon 1 ) regions of the sequence -the objective is to recognise and classify those sequences. All sequences were aligned in a way that the boundary always lies between 30th and 31st residue; while the biological process of recognition is local, the most important attributes should be those describing residues in the vicinity of the boundary.
Objects in the Sonar set [5] correspond to echoes of a sonar signal bounced off either a rock or a metal cylinder (a model of a mine). They are represented as power spectra, thus each next attribute value corresponds  to the signal power contained within a consecutive frequency interval. This way one may expect that there are frequency bands in which echoes significantly differ between classes, what would manifest as a set of peaks in the importance measure vector. For both of this sets, I have calculated the importance measure using 1000 ferns of a depth 10. As a baseline, I have used importance calculated using Random forest algorithm with 1000 trees.
The results are presented on Figure 2 and Figure 3. The importance measures obtained is both cases are consistent with the expectations based on the sets' structures -for DNA, one can notice a maximum around attributes 90-96, corresponding the actual cleavage site location. For Sonar, the importance scores reveal a band structure which likely corresponds to the actual frequency intervals in which the echoes differ between stone and metal.
Both results are also qualitatively in agreement with those obtained from Random forest models. Quantitative difference comes form the completely different formulations of both measures and possibly the higher sensitivity of ferns resulting from its fully stochastic construction.

Computational performance
In order to compare training times of rFerns and randomForest codes, I have trained both models on all 7 benchmark sets for 5000 ferns/trees, and, in case of ferns, depths 10 and D b . Than I have repeated this procedure,    Table 2: Training times of the rFerns and randomForest models made for 5000 base classifiers, with and without importance calculation. Times are given as a mean over 10 repetitions.
this time making both algorithms calculate importance measure during training. I have repeated both tests 10 times to stabilise the results and collected the mean execution times; the results are collected in the Table 2. The results show that the usage of rFerns may result is significant speedups in certain applications; best speedups are achieved for the sets with larger number of objects, which is caused by the fact that Random ferns' training time scales linearly with the number of objects, while Random forest's ∼ n log n.
Also the importance can be calculated significantly faster by rFerns than by randomForest, and the gain increases with the size of the set.
rFerns is least effective for sets which require large depths of the fernin case of Vowel and Vehicle sets it was even slower than Random forest. However, one should note that while the complexity of Random ferns ∼ 2 D , its accuracy usually decreases much less dramatically when decreasing D from its optimal value -this way one may expect an effective trade-off between speed and accuracy.

Conclusions
In this paper, I have presented rFerns, a general-purpose implementation of the Random ferns, a fast, ensemble-based classification method. Slight modifications of the original algorithm allowed me to additionally implement OOB error approximation and attribute importance measure.
Presented benchmarks showed that such algorithm can achieve accuracies comparable to Random forest algorithm while usually being much faster, especially for large datasets.
Also the importance measure proposed in this paper can be calculated very quickly and proved to behave in a desired way and be in agreement with the results of Random forest; however the in-depth assessment of its quality and usability for feature selection and similar problems requires further research.