Abstract
Directional data are ubiquitous in science. These data have some special properties that rule out the use of classical statistics. Therefore, different distributions and statistics, such as the univariate von Mises and the multivariate von Mises–Fisher distributions, should be used to deal with this kind of information. We extend the naive Bayes classifier to the case where the conditional probability distributions of the predictive variables follow either of these distributions. We consider the simple scenario, where only directional predictive variables are used, and the hybrid case, where discrete, Gaussian and directional distributions are mixed. The classifier decision functions and their decision surfaces are studied at length. Artificial examples are used to illustrate the behavior of the classifiers. The proposed classifiers are then evaluated over eight datasets, showing competitive performances against other naive Bayes classifiers that use Gaussian distributions or discretization to manage directional data.
Similar content being viewed by others
Notes
The source code is available at: http://www.unc.edu/sungkyu.
The Oriana software is available at: http://www.kovcomp.co.uk/oriana.
The Texas Commission on Environmental Quality website is available at: http://www.tceq.state.tx.us.
References
Agresti A (2007) An introduction to categorical data analysis, 2nd edn. Wiley, New York
Amayri O, Bouguila N. (2013) Beyond hybrid generative discriminative learning: spherical data classification. Pattern Anal Appl, in press
Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises–Fisher distributions. J Mach Learn Res 6:1345–1382
Berens P (2009) CircStat: a MATLAB toolbox for circular statistics. J Stat Softw 31(10):1–21
Berkholz DS, Krenesky PB, Davidson JR, Karplus PA (2010) Protein geometry database: a flexible engine to explore backbone conformations and their relationships to covalent geometry. Nucleic Acids Res 38(Suppl 1):D320–D325
Blackard JA, Dean DJ (1999) Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput Electron Agric 24(3):131–151
Bock RK, Chilingarian A, Gaug M, Hakl F, Hengstebeck T, Jiřina M, Klaschka J, Kotrč E, Savický P, Towers S, Vaiciulis A, Wittek W (2004) Methods for multidimensional event classification: a case study using images from a Cherenkov gamma-ray telescope. Nucl Instrum Methods in Phys Res Sect A-Accel Spectrom Detect Assoc Equip 516(2–3):511–528
Bøttcher SG (2004) Learning Bayesian networks with mixed variables. PhD thesis, Aalborg University
Bouckaert RR (2004) Estimating replicability of classifier learning experiments. In: Brodley CE (ed) Proceedings of the 21st international conference on machine learning, ACM
Damien P, Walker S (1999) A full Bayesian analysis of circular data using the von Mises distribution. Can J Stat Rev Can Stat 27(2):291–298
deHaas–Lorentz GL (1913) Die Brownsche Bewegung und einige verwandte Erscheinungen. Friedr. Vieweg und Sohn
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Devlaminck D, Waegeman W, Bauwens B, Wyns B, Santens P, Otte G (2010) From circular ordinal regression to multilabel classification. In: Proceedings of the 2010 workshop on preference learning, European conference on machine learning
Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero–one loss. Mach Learn 29:103–130
Downs TD (2003) Spherical regression. Biometrika 90(3):655–668
Downs TD, Mardia KV (2002) Circular regression. Biometrika 89(3):683–697
Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York
Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York
Eben K (1983) Classification into two von Mises distributions with unknown mean directions. Aplikace Matematiky 28(3):230–237
El Khattabi S, Streit F (1996) Identification analysis in directional statistics. Comput Stat Data Anal 23:45–63
Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Bajcsy R (ed) Proceedings of the 13th international joint conference on artificial intelligence, Morgan Kaufmann, San Mateo, pp 1022–1027
Figueiredo A (2009) Discriminant analysis for the von Mises–Fisher distribution. Commun Stat Simul Comput 38(9):1991–2003
Figueiredo A, Gomes P (2006) Discriminant analysis based on the Watson distribution defined on the hypersphere. Stat: J Theor Appl Stat 40(5):435–445
Fisher NI (1987) Statistical analysis of spherical data. Cambridge University Press, Cambridge
Fisher NI (1993) Statistical analysis of circular data. Cambridge University Press, Cambridge
Fisher NI, Lee AJ (1992) Regression models for an angular response. Biometrics 48:665–677
Fisher RA (1953) Dispersion on a sphere. Proc R Soc Lond Ser A Math Phys Sci 217:295–305
Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml
Frank E, Trigg L, Holmes G, Witten IH (2000) Technical note: naive Bayes for regression. Mach Learn 41(1):5–25
Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29:131–163
García S, Herrera F (2008) An extension on “Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9:2677–2694
Guttorp P, Lockhart RA (1988) Finding the location of a signal: a Bayesian analysis. J Am Stat Soc 83:322–330
Güvenir HA, Acar B, Demiröz G, Çekin A (1997) A supervised machine learning algorithm for arrhythmia analysis. In: Murray A, Swiryn S (eds) Computers in cardiology 1997, pp 433–436
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explorations 11(1)
Hornik K, Grün B (2013) On conjugate families and Jeffreys priors for von Mises–Fisher distributions. J Stat Plan Infer 143(5):992–999
Jaakola TS (1997) Variational methods for inference and estimation in graphical models. PhD thesis, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology
Jammalamadaka SR, SenGupta A (2001) Topics in circular statistics. World Scientific, Singapore
Johnson RA, Wehrly TE (1978) Some angular-linear distributions and related regression models. J Am Stat Assoc 73(363):602–606
Jossinet J (1996) Variability of impedivity in normal and pathological breast tissue. Med Biol Eng Comput 34(5):346–350
Kadous MW (2002) Temporal classification: extending the classification paradigm to multivariate time series. PhD thesis, School of Computer Science and Engineering, University of New South Wales
Kato S, Shimizu K, Shieh G (2008) A circular-circular regression model. Stat Sin 18(2):633–645
Koller D, Friedman N (2009) Probabilistic graphical models. Principles and techniques. The MIT Press, Boston
Kovach WL (1989) Quantitative methods for the study of lycopod megaspore ultrastructure. Rev Palaeobot Palynol 57(3–4):233–246
Langley P, Sage S (1994) Induction of selective Bayesian classifiers. In: López de Mántaras R, Poole D (eds) Proceedings of the 10th conference on uncertainty in artificial intelligence, Morgan Kaufmann, San Mateo, pp 399–406
Lévy MP (1939) L’addition des variables aléatoires définies sur une circonférence. Bull Soc Math Fr 67:1–41
López-Cruz PL, Bielza C, Larranaga P (2011) The von Mises naive Bayes classifier for angular data. In: Proceedings of the 14th conference of the Spanish Association for Artificial Intelligence, CAEPIA 2011, LNCS 7023, pp 145–154
Mardia KV (1975) Statistics of directional data. J R Stat Soc Ser B Stat Methodol 37(3):349–393
Mardia KV (2006) On some recent advancements in applied shape analysis and directional statistics. In: Barber S, Baxter PD, Mardia KV (eds) Systems biology and statistical bioinformatics, Leeds University Press, Leeds, pp 9–17
Mardia KV (2010) Bayesian analysis for bivariate von Mises distributions. J Appl Stat 37(3):515–528
Mardia KV, Jupp PE (2000) Circular statistics. Wiley, New York
Minsky M (1961) Steps toward artificial intelligence. Proc Inst Radio Eng 49:8–30
von Mises R (1918) Uber die “Ganzzahligkeit” der Atomgewichte und verwandte Fragen. Phys Z 19:490–500
Mooney JA, Helms PJ, Jolliffe IT (2003) Fitting mixtures of von Mises distributions: a case study involving sudden infant death syndrome. Comput Stat Data Anal 41(3–4):505–513
Morales M, Rodríguez C, Salmerón A (2007) Selective naive Bayes for regression based on mixtures of truncated exponentials. Int J Uncertainty Fuzziness Knowl Based Syst 15(6):697–716
Morris JE, Laycock PJ (1974) Discriminant analysis of directional data. Biometrika 61(2):335–341
Pazzani MJ (1995) Searching for dependencies in Bayesian classifiers. In: Fisher D, Lenz HJ (eds) Learning from Data: Artificial Intelligence and Statistics V. In: Proceedings of the 5th International Workshop on Artificial Intelligence and Statistics, Springer, pp 239–248
Pearl J (1988) Probabilistic reasoning in intelligent systems. Morgan Kaufmann, San Mateo
Peot MA (1996) Geometric implications of the naive Bayes assumption. In: Horvitz E, Jensen FV (eds) Proceedings of the 12th conference on uncertainty in artificial intelligence, Morgan Kaufmann, San Mateo, pp 414–419
Pérez A, Larrañaga P, Inza I (2006) Supervised classification with conditional Gaussian networks: increasing the structure complexity from naive Bayes. Int J Approx Reason 43:1–25
Perrin F (1928) Étude mathématique du mouvement Brownien de rotation. Ann Sci Ec Norm Super 45:1–51
Rivest LP, Chang T (2006) Regression and correlation for 3 × 3 rotation matrices. Can J Stat Rev Can Stat 34(2):187–202
Romero V, Rumí R, Salmerón A (2006) Learning hybrid Bayesian networks using mixtures of truncated exponentials. Int J Approx Reason 42:54–68
Sahami M (1996) Learning limited dependence Bayesian classifiers. In: Simoudis E, Han J, Fayyad UM (eds) Proceedings of the 2nd international conference on knowledge discovery and data mining, AAAI Press, pp 335–338
SenGupta A, Roy S (2005) A simple classification rule for directional data. In: Balakrishnan N, Nagaraja HN, Kannan N (eds) Advances in ranking and selection, multiple comparisons, and reliability, statistics for industry and technology, Birkhäuser, Boston, pp 81–90
SenGupta A, Ugwuowo FI (2011) A classification method for directional data with application to the human skull. Commun Stat Theory Methods 40:457–466
Shenoy PP, West JC (2011) Inference in hybrid Bayesian networks using mixtures of polynomials. Int J Approx Reason 52(5):641–657
da Silva JE, Marques de Sá J, Jossinet J (2000) Classification of breast tissue by electrical impedance spectroscopy. Med Biol Eng Comput 38(1):26–30
Sra S (2012) A short note on parameter approximation for von Mises–Fisher distributions: and a fast implementation of I s (x). Comput Stat 27(1):177–190
Wood AT (1994) Simulation of the von Mises–Fisher distribution. Commun Stat Simul Comput 23(1):157–164
Zemel RS, Williams CKI, Mozer MC (1995) Lending direction to neural networks. Neural Netw 8(4):503–512
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: von Mises NB classifier decision function
1.1 vMNB with one predictive variable
We start by equaling the posterior probability of each class value using the probability density function of the von Mises distribution (1):
Simplify the constant 2π, take logarithms and arrange all terms on the same side of the equation:
Substitute \(\cos(\beta-\gamma) = \cos(\beta)\cos(\gamma)+\sin(\beta)\sin(\gamma)\) and operate the logarithms:
Arrange using cosϕ and \(\sin{\phi}\) as common terms:
Substitute
and get:
Trigonometrically, this is equivalent to:
where \(T=\sqrt{a^2+b^2}, \cos{\alpha}=a/T, \sin{\alpha}=b/T, \tan{\alpha}=b/a. \) Isolating ϕ from the equation, we get:
The NB classifier finds two angles that bound the class regions.
1.2 Particular cases
We have also derived these angles when the conditional probability distributions share one of the parameters. We consider that the classes are equiprobable. If they are not equiprobable, the prior probabilities of the class values influence the value of D, modifying the class subregions so that more likely classes have larger subregions.
-
Case 1: \(\kappa_{\varPhi|1} = \kappa_{\varPhi|2} = \kappa_\varPhi \hbox{ and } \mu_{\varPhi|1} \neq \mu_{\varPhi|2}. \) When the concentration parameter is the same in the two distributions, we have the following values for the constants:
$$ \begin{aligned} a &= \kappa_\varPhi(\cos{\mu_{\varPhi|1}} - \cos{\mu_{\varPhi|2}}),\\ b &= \kappa_\varPhi(\sin{\mu_{\varPhi|1}} - \sin{\mu_{\varPhi|1}}),\\ D &= -\ln{\frac{p(C=1)I_0(\kappa_{\varPhi|2})}{p(C=2)I_0(\kappa_{\varPhi|1})}} = - \ln{1} = 0. \end{aligned} $$Substituting in the expression of the arccosine, we get:
$$ \arccos(D/T) = \arccos{0} = \pi/2. $$To compute α, we take the trigonometric identities:
$$ \begin{aligned} \cos\beta - \cos\gamma &= -2\sin\left(\frac{1}{2}(\beta + \gamma)\right)\sin\left(\frac{1}{2}(\beta - \gamma)\right),\\ \sin\beta - \sin\gamma &= 2\sin\left(\frac{1}{2}(\beta - \gamma)\right)\cos\left(\frac{1}{2}(\beta + \gamma)\right), \end{aligned} $$which we substitute in the following expression:
$$ \begin{aligned} \tan\alpha &= \frac{b}{a} = \frac{\kappa_\varPhi(\sin{\mu_{\varPhi|1}} - \sin{\mu_{\varPhi|2}})}{\kappa_\varPhi(\cos{\mu_{\varPhi|1}} - \cos{\mu_{\varPhi|2}})}\\ &= \frac{2\sin(\frac{1}{2}(\mu_{\varPhi|1}-\mu_{\varPhi|2}))\cos(\frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2}))}{-2\sin(\frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2}))\sin(\frac{1}{2}(\mu_{\varPhi|1}-\mu_{\varPhi|2}))}\\ &= -\frac{\cos(\frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2}))}{\sin(\frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2}))}\\ &= -\cot\left(\frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2})\right)\\ &= \tan\left(\frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2}) + \frac{\pi}{2}\right),\\ \alpha &= \frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2}) + \frac{\pi}{2}. \end{aligned} $$Now we can compute the decision angles found by the classifier:
$$ \phi = \alpha \pm \arccos(D/T) = \frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2}) + \frac{\pi}{2} \pm \frac{\pi}{2}. $$The two decision angles are:
$$ \begin{aligned} \phi' &= \frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2}),\\ \phi'' &= \frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2}) + \pi. \end{aligned} $$These two angles correspond to the bisector angle of the two mean directions.
-
Case 2: \(\kappa_{\varPhi|1} \neq \kappa_{\varPhi|2} \hbox{ and } \mu_{\varPhi|1} = \mu_{\varPhi|2} = \mu_\varPhi. \) In this scenario the mean directions are equal, so the constants reduce to:
$$ \begin{aligned} a &= (\kappa_{\varPhi|1} - \kappa_{\varPhi|2})\cos{\mu_\varPhi},\\ b &= (\kappa_{\varPhi|1} - \kappa_{\varPhi|2})\sin{\mu_\varPhi},\\ D &= -\ln{\frac{p(C=1)I_0(\kappa_{\varPhi|2})}{p(C=2)I_0(\kappa_{\varPhi|1})}} = -\ln{\frac{I_0(\kappa_{\varPhi|2})}{I_0(\kappa_{\varPhi|1})}},\\ T &= \sqrt{a^2 + b^2}\\ &= \sqrt{(\kappa_{\varPhi|1} - \kappa_{\varPhi|2})^2\cos^2{\mu_\varPhi} + (\kappa_{\varPhi|1} - \kappa_{\varPhi|2})^2\sin^2{\mu_\varPhi}}\\ &= \sqrt{(\kappa_{\varPhi|1} - \kappa_{\varPhi|2})^2(\cos^2{\mu_\varPhi} + \sin^2{\mu_\varPhi})}\\ &= \kappa_{\varPhi|1} - \kappa_{\varPhi|2}. \end{aligned} $$
We compute α by substituting in the expression:
Therefore, the resulting decision angles are given by:
Clearly, the two angles are defined with respect to the common mean direction, and their distance to that mean direction depends on the concentration parameter values.
1.3 vMNB with two predictive variables
In this scenario, we have two circular predictive variables \(\varPhi\) and \(\Psi. \) The domain defined by these variables is a torus (−π, π] × (−π, π]. As in the simpler case above, we compute the decision surfaces induced by the classifier by equaling the posterior probability of the two class values
Using Bayes’ rule and the conditional independence assumption, we get
We substitute the von Mises density (1) and get:
We simplify the constant 2π, take logarithms and arrange all the terms on the same side of the equation:
We substitute the trigonometric identity \(\cos(\beta-\gamma) = \cos(\beta)\cos(\gamma)+\sin(\beta)\sin(\gamma)\) and arrange the terms:
We define the following constants:
and substitute them to get
The Cartesian coordinates of the points defined by the angles ϕ and ψ on the surface of a torus are
where L is the distance from the center of the torus to the center of the revolving circumference that generates the torus, and l is the radius of the revolving circumference. We isolate the trigonometric functions and get
Substituting these expressions, we get the two following equations corresponding to the two signs of cosϕ:
Operating and arranging the terms, we get
These expressions are quadratic in z. Therefore, we conclude that von Mises NB with two predictive variables is a much more complex and flexible classifier than von Mises NB with one predictive variable.
Appendix 2: von Mises–Fisher NB classifier decision function
To study the decision function for the von Mises–Fisher NB classifier we proceed as in Appendix 1. We equal the posterior probabilities of the class values using the probability density function in Eq. (1):
Simplify the constants and take logarithms:
Arrange all the terms on the same side of the equation and operate the logarithms to get the following hyperplane equation:
2.1 Particular cases
Considering that both class values have the same prior probability and that one of the parameters has the same value in both distributions, Case 1 and Case 2 can be simplified as follows. When the prior probabilities are different, the hyperplanes move away from the mean direction of the most likely class value, making their subregions larger.
-
Case 1: \(\kappa_{{\mathbf{X}}|1} = \kappa_{{\mathbf{X}}|2} = \kappa_{\mathbf{X}} \hbox{ and } {\boldsymbol{\mu}}_{{\mathbf{X}}|1} \neq {\boldsymbol{\mu}}_{{\mathbf{X}}|2}.\) When the distributions share the concentration parameter, we get the expression:
$$ (\kappa_{{\mathbf{X}}} {{\boldsymbol{\mu}}}_{{{\mathbf{X}}}|1} - \kappa_{{\mathbf{X}}} {{\boldsymbol{\mu}}}_{{{\mathbf{X}}}|2})^{\rm T} {{\mathbf{X}}} + \ln{\frac{p(C=1)\kappa_{{\mathbf{X}}}^{\frac{n}{2}-1} I_{\frac{n}{2}-1}(\kappa_{{\mathbf{X}}})} {p(C=2)\kappa_{{\mathbf{X}}}^{\frac{n}{2}-1} I_{\frac{n}{2}-1}(\kappa_{{\mathbf{X}}})}} = 0. $$The logarithm reduces to 0 and we can take \(\kappa_{\mathbf{X}}\) as common term:
$$ \kappa_{{\mathbf{X}}} ({\boldsymbol{\mu}}_{{{\mathbf{X}}}|1} - {\boldsymbol{\mu}}_{{{\mathbf{X}}}|2})^{\rm T} {{\mathbf{X}}} = 0. $$Therefore, given that κ > 0 (otherwise the distributions are uniform), the hyperplane equation reduces to:
$$ ({\boldsymbol{\mu}}_{{{\mathbf{X}}}|1} - {\boldsymbol{\mu}}_{{{\mathbf{X}}}|2})^{\rm T} {{\mathbf{X}}} = 0. $$That equation specifies a hyperplane that contains the origin point (\(\mathbf{0}\)) and goes through the middle point of the sector that connects the points of the hypersphere defined by the mean directions \({\boldsymbol{\mu}}_{{\mathbf{X}}|1}\) and \({\boldsymbol{\mu}}_{{\mathbf{X}}|2}.\)
-
Case 2: \(\kappa_{{\mathbf{X}}|1} \neq \kappa_{{\mathbf{X}}|2} \hbox{ and } {\boldsymbol{\mu}}_{{\mathbf{X}}|1} = {\boldsymbol{\mu}}_{{\mathbf{X}}|2} = {\boldsymbol{\mu}}_{\mathbf{X}}.\) In the case where the mean directions have the same value, we can derive the following equation:
$$ \begin{aligned} &(\kappa_{{{\mathbf{X}}}|1} {\boldsymbol{\mu}}_{{\mathbf{X}}}^{\rm T} - \kappa_{{{\mathbf{X}}}|2} {\boldsymbol{\mu}}_{{\mathbf{X}}}^{\rm T}) {{\mathbf{X}}}\\ &\quad + \ln{\frac{p(C=1)(\kappa_{{{\mathbf{X}}}|1})^{\frac{n}{2}-1} I_{\frac{n}{2}-1}(\kappa_{{{\mathbf{X}}}|2})} {p(C=2)(\kappa_{{{\mathbf{X}}}|2})^{\frac{n}{2}-1} I_{\frac{n}{2}-1}(\kappa_{{{\mathbf{X}}}|1})}} = 0. \end{aligned} $$We can take \({\boldsymbol{\mu}}_{\mathbf{X}}^{\rm T}\) as a common term:
$$ (\kappa_{{{\mathbf{X}}}|1} - \kappa_{{{\mathbf{X}}}|2}) {\boldsymbol{\mu}}_{{\mathbf{X}}}^{\rm T} {{\mathbf{X}}} + \ln{\frac{(\kappa_{{{\mathbf{X}}}|1})^{\frac{n}{2}-1} I_{\frac{n}{2}-1}(\kappa_{{{\mathbf{X}}}|2})} {(\kappa_{{{\mathbf{X}}}|2})^{\frac{n}{2}-1} I_{\frac{n}{2}-1}(\kappa_{{{\mathbf{X}}}|1})}} = 0. $$Dividing by (\(\kappa_{{\mathbf{X}}|1} - \kappa_{{\mathbf{X}}|2}\)), we get:
$$ {\boldsymbol{\mu}}_{{\mathbf{X}}}^{\rm T} {{\mathbf{X}}} + \frac{1}{\kappa_{{{\mathbf{X}}}|1} - \kappa_{{{\mathbf{X}}}|2}}\ln{\frac{(\kappa_{{{\mathbf{X}}}|1})^{\frac{n}{2}-1} I_{\frac{n}{2}-1}(\kappa_{{{\mathbf{X}}}|2})} {(\kappa_{{{\mathbf{X}}}|2})^{\frac{n}{2}-1} I_{\frac{n}{2}-1}(\kappa_{{{\mathbf{X}}}|1})}} = 0. $$
The hyperplane defined by that equation is perpendicular to the shared mean direction vector \({\boldsymbol{\mu}}_{\mathbf{X}},\) and its position is given by the relationships between the concentration parameters.
Appendix 3: Mutual information computation
The mutual information between two variables X and Y is defined as
where ρ is a generalized probability function.
In supervised classification problems, we have to estimate \(\hbox{MI}(X_i,C)\) from a set of data pairs \(\left(x_i^{(j)},c^{(j)}\right),j=1,\ldots,m. \) When X i is a discrete variable, an estimator of the mutual information in (10) is given by
where \(\widehat{p}\) are the probabilities estimated from the counts in the dataset.
When the predictive variable X i is continuous, we take an approach consistent with conditional independence assumptions and we model the conditional probability densities of X i |C = c as Gaussian or von Mises distributions, depending on the nature of the variable, i.e., linear or angular. Therefore, the marginal density of X i is a mixture of Gaussian or von Mises distributions, respectively. Algorithm 1 shows the process for computing \(\hbox{MI}(X_i,C).\)
Appendix 4: Dataset analysis and preprocessing
A thorough inspection of the datasets for supervised classification available in the UCI Machine Learning Repository [28] reported only 5 out of 135 datasets containing some variable measured in angles (bottom half of Table 6). We found no reference to these directional data having be given special treatment. For this reason, we assume that they have been studied as linear continuous variables without taking into account their special properties. We omitted the Breast Tissue dataset [39, 67] from the study because it was not clear whether the “PhaseAngle” variable really represents an angle and how it was measured. Additionally, another four datasets not included in the UCI repository were considered for evaluation (top half of Table 6). A description of the datasets used in this study follows:
UCI datasets
-
Australian Sign Language ( Auslan ): Identification of 95 Australian Sign Language signs using position (x, y, z) and orientation angles (roll, pitch, yaw) of both hands [40]. Therefore, 12 measurements are studied. According to [40], the bending measurements are not very reliable, and they were omitted as predictive variables. This is a time series classification problem. The position and orientation of the hands are measured at different times, yielding approximately 54 data frames for each sign. We resampled a set of 10 evenly distributed frames and used them as predictive variables. According to the description, there are 95 different signs (class values), and each sign is repeated 27 times. However, the her sign only appears three times, whereas the his-hers sign appears 24 times. Therefore, we have assumed that they are the same sign and have considered them all as his-hers signs.
-
MAGIC Gamma Telescope ( MAGIC ): Discrimination of the images of hadronic showers initiated by primary gammas from those caused by cosmic rays in the upper atmosphere [7]. The images of the hadronic showers captured by the telescope are preprocessed and modeled as ellipses. The predictive variables describe the shape of the ellipses. The dataset includes one angular variable that captures the angle of the major axis in the ellipse with the vector that connects the center of the ellipse with the center of the camera.
-
Arrhythmia : Identification of the presence and absence of cardiac arrhythmia from electrocardiograms (ECG). The original dataset has 16 class values: one for healthy items, 14 types of cardiac arrhythmias and one class value for unclassified items [33]. We erased the unclassified items and built a binary class (normal vs. arrhythmia). The predictive variables describe clinical measurements, patient data and ECG recordings. The angular variables describe the vector angles from the front plane of four ECG waves. We removed variable 14, which had more than 83% missing values, and used Weka’s ReplaceMissingValues filter [22] to fill in the missing values of variables 11–13 and 15 with the mode. We also removed some non-informative discrete and continuous variables.
-
Covertype : Prediction of the kind of trees that grow in a specific area given some attributes describing the geography of the land [6]. The two angular variables describe the aspect (orientation) of the land from the true north and the slope of the ground. The original dataset has 581,012 samples and we used a Weka-supervised resampling method (without replacement) to reduce the dimensionality of the dataset to 100,000 samples.
4.1 Other datasets
-
Megaspores : Classification of megaspores into two classes (their group in the biological taxonomy) according to the angle of their wall elements [43]. The dataset is an example included in Oriana software.Footnote 2
-
Protein1 : Prediction of secondary structure including one aminoacid, using the dihedral angles (ϕ, ψ) of the residue as predictive information. We only considered α-helix and β-sheet structures, making the class binary. The data were retrieved from the protein geometry database [5].
-
Protein10 : Prediction of secondary structure including one aminoacid, using the dihedral angles (ϕ, ψ) and the planarity angle (ω). We considered the three angles in ten consecutive residues. We classified the four most common structures: α-helices, β-sheets, bends and turns. The data were retrieved from the protein geometry database [5].
-
Temperature : Prediction of the outdoor temperature from the season, wind speed and wind direction. We used hourly measurements from a weather station located in the city of Houston. Data for the year 2010 were retrieved, and we removed the hours with missing values for any of the four variables. The information was collected from the Texas Commission on Environmental Quality website. Footnote 3 The class variable (outdoor temperature) was measured in degrees Fahrenheit and discretized into the following three values: low (T ≤ 50), medium (50 < T < 70) and high (T ≥ 70).
Rights and permissions
About this article
Cite this article
López-Cruz, P.L., Bielza, C. & Larrañaga, P. Directional naive Bayes classifiers. Pattern Anal Applic 18, 225–246 (2015). https://doi.org/10.1007/s10044-013-0340-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-013-0340-z