Skip to main content
Log in

Directional naive Bayes classifiers

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Directional data are ubiquitous in science. These data have some special properties that rule out the use of classical statistics. Therefore, different distributions and statistics, such as the univariate von Mises and the multivariate von Mises–Fisher distributions, should be used to deal with this kind of information. We extend the naive Bayes classifier to the case where the conditional probability distributions of the predictive variables follow either of these distributions. We consider the simple scenario, where only directional predictive variables are used, and the hybrid case, where discrete, Gaussian and directional distributions are mixed. The classifier decision functions and their decision surfaces are studied at length. Artificial examples are used to illustrate the behavior of the classifiers. The proposed classifiers are then evaluated over eight datasets, showing competitive performances against other naive Bayes classifiers that use Gaussian distributions or discretization to manage directional data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. The source code is available at: http://www.unc.edu/sungkyu.

  2. The Oriana software is available at: http://www.kovcomp.co.uk/oriana.

  3. The Texas Commission on Environmental Quality website is available at: http://www.tceq.state.tx.us.

References

  1. Agresti A (2007) An introduction to categorical data analysis, 2nd edn. Wiley, New York

  2. Amayri O, Bouguila N. (2013) Beyond hybrid generative discriminative learning: spherical data classification. Pattern Anal Appl, in press

  3. Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises–Fisher distributions. J Mach Learn Res 6:1345–1382

    MATH  MathSciNet  Google Scholar 

  4. Berens P (2009) CircStat: a MATLAB toolbox for circular statistics. J Stat Softw 31(10):1–21

    MathSciNet  Google Scholar 

  5. Berkholz DS, Krenesky PB, Davidson JR, Karplus PA (2010) Protein geometry database: a flexible engine to explore backbone conformations and their relationships to covalent geometry. Nucleic Acids Res 38(Suppl 1):D320–D325

    Article  Google Scholar 

  6. Blackard JA, Dean DJ (1999) Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput Electron Agric 24(3):131–151

    Article  Google Scholar 

  7. Bock RK, Chilingarian A, Gaug M, Hakl F, Hengstebeck T, Jiřina M, Klaschka J, Kotrč E, Savický P, Towers S, Vaiciulis A, Wittek W (2004) Methods for multidimensional event classification: a case study using images from a Cherenkov gamma-ray telescope. Nucl Instrum Methods in Phys Res Sect A-Accel Spectrom Detect Assoc Equip 516(2–3):511–528

    Article  Google Scholar 

  8. Bøttcher SG (2004) Learning Bayesian networks with mixed variables. PhD thesis, Aalborg University

  9. Bouckaert RR (2004) Estimating replicability of classifier learning experiments. In: Brodley CE (ed) Proceedings of the 21st international conference on machine learning, ACM

  10. Damien P, Walker S (1999) A full Bayesian analysis of circular data using the von Mises distribution. Can J Stat Rev Can Stat 27(2):291–298

    Article  MATH  MathSciNet  Google Scholar 

  11. deHaas–Lorentz GL (1913) Die Brownsche Bewegung und einige verwandte Erscheinungen. Friedr. Vieweg und Sohn

  12. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MATH  MathSciNet  Google Scholar 

  13. Devlaminck D, Waegeman W, Bauwens B, Wyns B, Santens P, Otte G (2010) From circular ordinal regression to multilabel classification. In: Proceedings of the 2010 workshop on preference learning, European conference on machine learning

  14. Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero–one loss. Mach Learn 29:103–130

    Article  MATH  Google Scholar 

  15. Downs TD (2003) Spherical regression. Biometrika 90(3):655–668

    Article  MathSciNet  Google Scholar 

  16. Downs TD, Mardia KV (2002) Circular regression. Biometrika 89(3):683–697

    Article  MATH  MathSciNet  Google Scholar 

  17. Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York

  18. Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York

  19. Eben K (1983) Classification into two von Mises distributions with unknown mean directions. Aplikace Matematiky 28(3):230–237

    MATH  MathSciNet  Google Scholar 

  20. El Khattabi S, Streit F (1996) Identification analysis in directional statistics. Comput Stat Data Anal 23:45–63

    Article  MATH  MathSciNet  Google Scholar 

  21. Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Bajcsy R (ed) Proceedings of the 13th international joint conference on artificial intelligence, Morgan Kaufmann, San Mateo, pp 1022–1027

  22. Figueiredo A (2009) Discriminant analysis for the von Mises–Fisher distribution. Commun Stat Simul Comput 38(9):1991–2003

    Article  MATH  MathSciNet  Google Scholar 

  23. Figueiredo A, Gomes P (2006) Discriminant analysis based on the Watson distribution defined on the hypersphere. Stat: J Theor Appl Stat 40(5):435–445

    Article  MATH  MathSciNet  Google Scholar 

  24. Fisher NI (1987) Statistical analysis of spherical data. Cambridge University Press, Cambridge

  25. Fisher NI (1993) Statistical analysis of circular data. Cambridge University Press, Cambridge

  26. Fisher NI, Lee AJ (1992) Regression models for an angular response. Biometrics 48:665–677

    Article  MathSciNet  Google Scholar 

  27. Fisher RA (1953) Dispersion on a sphere. Proc R Soc Lond Ser A Math Phys Sci 217:295–305

    Article  MATH  Google Scholar 

  28. Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml

  29. Frank E, Trigg L, Holmes G, Witten IH (2000) Technical note: naive Bayes for regression. Mach Learn 41(1):5–25

    Article  Google Scholar 

  30. Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29:131–163

    Article  MATH  Google Scholar 

  31. García S, Herrera F (2008) An extension on “Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9:2677–2694

    MATH  Google Scholar 

  32. Guttorp P, Lockhart RA (1988) Finding the location of a signal: a Bayesian analysis. J Am Stat Soc 83:322–330

    Article  MathSciNet  Google Scholar 

  33. Güvenir HA, Acar B, Demiröz G, Çekin A (1997) A supervised machine learning algorithm for arrhythmia analysis. In: Murray A, Swiryn S (eds) Computers in cardiology 1997, pp 433–436

  34. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explorations 11(1)

  35. Hornik K, Grün B (2013) On conjugate families and Jeffreys priors for von Mises–Fisher distributions. J Stat Plan Infer 143(5):992–999

    Article  MATH  Google Scholar 

  36. Jaakola TS (1997) Variational methods for inference and estimation in graphical models. PhD thesis, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology

  37. Jammalamadaka SR, SenGupta A (2001) Topics in circular statistics. World Scientific, Singapore

  38. Johnson RA, Wehrly TE (1978) Some angular-linear distributions and related regression models. J Am Stat Assoc 73(363):602–606

    Article  MATH  MathSciNet  Google Scholar 

  39. Jossinet J (1996) Variability of impedivity in normal and pathological breast tissue. Med Biol Eng Comput 34(5):346–350

    Article  Google Scholar 

  40. Kadous MW (2002) Temporal classification: extending the classification paradigm to multivariate time series. PhD thesis, School of Computer Science and Engineering, University of New South Wales

  41. Kato S, Shimizu K, Shieh G (2008) A circular-circular regression model. Stat Sin 18(2):633–645

    MATH  MathSciNet  Google Scholar 

  42. Koller D, Friedman N (2009) Probabilistic graphical models. Principles and techniques. The MIT Press, Boston

  43. Kovach WL (1989) Quantitative methods for the study of lycopod megaspore ultrastructure. Rev Palaeobot Palynol 57(3–4):233–246

    Article  Google Scholar 

  44. Langley P, Sage S (1994) Induction of selective Bayesian classifiers. In: López de Mántaras R, Poole D (eds) Proceedings of the 10th conference on uncertainty in artificial intelligence, Morgan Kaufmann, San Mateo, pp 399–406

  45. Lévy MP (1939) L’addition des variables aléatoires définies sur une circonférence. Bull Soc Math Fr 67:1–41

    Google Scholar 

  46. López-Cruz PL, Bielza C, Larranaga P (2011) The von Mises naive Bayes classifier for angular data. In: Proceedings of the 14th conference of the Spanish Association for Artificial Intelligence, CAEPIA 2011, LNCS 7023, pp 145–154

  47. Mardia KV (1975) Statistics of directional data. J R Stat Soc Ser B Stat Methodol 37(3):349–393

    MATH  MathSciNet  Google Scholar 

  48. Mardia KV (2006) On some recent advancements in applied shape analysis and directional statistics. In: Barber S, Baxter PD, Mardia KV (eds) Systems biology and statistical bioinformatics, Leeds University Press, Leeds, pp 9–17

  49. Mardia KV (2010) Bayesian analysis for bivariate von Mises distributions. J Appl Stat 37(3):515–528

    Article  MathSciNet  Google Scholar 

  50. Mardia KV, Jupp PE (2000) Circular statistics. Wiley, New York

  51. Minsky M (1961) Steps toward artificial intelligence. Proc Inst Radio Eng 49:8–30

    MathSciNet  Google Scholar 

  52. von Mises R (1918) Uber die “Ganzzahligkeit” der Atomgewichte und verwandte Fragen. Phys Z 19:490–500

    MATH  Google Scholar 

  53. Mooney JA, Helms PJ, Jolliffe IT (2003) Fitting mixtures of von Mises distributions: a case study involving sudden infant death syndrome. Comput Stat Data Anal 41(3–4):505–513

    Article  MATH  MathSciNet  Google Scholar 

  54. Morales M, Rodríguez C, Salmerón A (2007) Selective naive Bayes for regression based on mixtures of truncated exponentials. Int J Uncertainty Fuzziness Knowl Based Syst 15(6):697–716

    Article  MATH  Google Scholar 

  55. Morris JE, Laycock PJ (1974) Discriminant analysis of directional data. Biometrika 61(2):335–341

    Article  MATH  MathSciNet  Google Scholar 

  56. Pazzani MJ (1995) Searching for dependencies in Bayesian classifiers. In: Fisher D, Lenz HJ (eds) Learning from Data: Artificial Intelligence and Statistics V. In: Proceedings of the 5th International Workshop on Artificial Intelligence and Statistics, Springer, pp 239–248

  57. Pearl J (1988) Probabilistic reasoning in intelligent systems. Morgan Kaufmann, San Mateo

  58. Peot MA (1996) Geometric implications of the naive Bayes assumption. In: Horvitz E, Jensen FV (eds) Proceedings of the 12th conference on uncertainty in artificial intelligence, Morgan Kaufmann, San Mateo, pp 414–419

  59. Pérez A, Larrañaga P, Inza I (2006) Supervised classification with conditional Gaussian networks: increasing the structure complexity from naive Bayes. Int J Approx Reason 43:1–25

    Article  MATH  Google Scholar 

  60. Perrin F (1928) Étude mathématique du mouvement Brownien de rotation. Ann Sci Ec Norm Super 45:1–51

    MATH  MathSciNet  Google Scholar 

  61. Rivest LP, Chang T (2006) Regression and correlation for 3 × 3 rotation matrices. Can J Stat Rev Can Stat 34(2):187–202

    Article  MATH  MathSciNet  Google Scholar 

  62. Romero V, Rumí R, Salmerón A (2006) Learning hybrid Bayesian networks using mixtures of truncated exponentials. Int J Approx Reason 42:54–68

    Article  MATH  Google Scholar 

  63. Sahami M (1996) Learning limited dependence Bayesian classifiers. In: Simoudis E, Han J, Fayyad UM (eds) Proceedings of the 2nd international conference on knowledge discovery and data mining, AAAI Press, pp 335–338

  64. SenGupta A, Roy S (2005) A simple classification rule for directional data. In: Balakrishnan N, Nagaraja HN, Kannan N (eds) Advances in ranking and selection, multiple comparisons, and reliability, statistics for industry and technology, Birkhäuser, Boston, pp 81–90

  65. SenGupta A, Ugwuowo FI (2011) A classification method for directional data with application to the human skull. Commun Stat Theory Methods 40:457–466

    Article  MATH  MathSciNet  Google Scholar 

  66. Shenoy PP, West JC (2011) Inference in hybrid Bayesian networks using mixtures of polynomials. Int J Approx Reason 52(5):641–657

    Article  MATH  MathSciNet  Google Scholar 

  67. da Silva JE, Marques de Sá J, Jossinet J (2000) Classification of breast tissue by electrical impedance spectroscopy. Med Biol Eng Comput 38(1):26–30

    Article  Google Scholar 

  68. Sra S (2012) A short note on parameter approximation for von Mises–Fisher distributions: and a fast implementation of I s (x). Comput Stat 27(1):177–190

    Article  MATH  MathSciNet  Google Scholar 

  69. Wood AT (1994) Simulation of the von Mises–Fisher distribution. Commun Stat Simul Comput 23(1):157–164

    Article  MATH  Google Scholar 

  70. Zemel RS, Williams CKI, Mozer MC (1995) Lending direction to neural networks. Neural Netw 8(4):503–512

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pedro L. López-Cruz.

Appendices

Appendix 1: von Mises NB classifier decision function

1.1 vMNB with one predictive variable

We start by equaling the posterior probability of each class value using the probability density function of the von Mises distribution (1):

$$ \begin{aligned} &p(C=1)\frac{1}{2 \pi I_0(\kappa_{\varPhi|1})} \exp(\kappa_{\varPhi|1} \cos{(\phi - \mu_{\varPhi|1})})\\ &\quad = p(C=2)\frac{1}{2 \pi I_0(\kappa_{\varPhi|2})} \exp(\kappa_{\varPhi|2} \cos{(\phi - \mu_{\varPhi|2})}). \end{aligned} $$

Simplify the constant 2π, take logarithms and arrange all terms on the same side of the equation:

$$ \begin{aligned} &\kappa_{\varPhi|1}\cos(\phi - \mu_{\varPhi|1}) - \kappa_{\varPhi|2}\cos(\phi - \mu_{\varPhi|2})\\ &\quad + \ln{\frac{p(C=1)}{I_0(\kappa_{\varPhi|1})}} - \ln{\frac{p(C=2)}{I_0(\kappa_{\varPhi|2})}} = 0. \end{aligned} $$

Substitute \(\cos(\beta-\gamma) = \cos(\beta)\cos(\gamma)+\sin(\beta)\sin(\gamma)\) and operate the logarithms:

$$ \begin{aligned} &\kappa_{\varPhi|1}\left[\cos{\phi}\cos{\mu_{\varPhi|1}}+\sin{\phi}\sin{\mu_{\varPhi|1}}\right]\\ &\quad -\kappa_{\varPhi|2}\left[\cos{\phi}\cos{\mu_{\varPhi|2}}+\sin{\phi}\sin{\mu_{\varPhi|2}}\right]\\ &\quad + \ln{\frac{p(C=1)I_0(\kappa_{\varPhi|2})}{p(C=2)I_0(\kappa_{\varPhi|1})}} = 0. \end{aligned} $$

Arrange using cosϕ and \(\sin{\phi}\) as common terms:

$$ \begin{aligned} &(\kappa_{\varPhi|1}\cos{\mu_{\varPhi|1}} - \kappa_{\varPhi|2}\cos{\mu_{\varPhi|2}})\cos{\phi}\\ &\quad + (\kappa_{\varPhi|1}\sin{\mu_{\varPhi|1}} - \kappa_{\varPhi|2}\sin{\mu_{\varPhi|2}})\sin{\phi}\\ &\quad + \ln{\frac{p(C=1)I_0(\kappa_{\varPhi|2})}{p(C=2)I_0(\kappa_{\varPhi|1})}} = 0. \end{aligned} $$

Substitute

$$ \begin{aligned} a &= \kappa_{\varPhi|1}\cos{\mu_{\varPhi|1}} - \kappa_{\varPhi|2}\cos{\mu_{\varPhi|2}},\\ b &= \kappa_{\varPhi|1}\sin{\mu_{\varPhi|1}} - \kappa_{\varPhi|2}\sin{\mu_{\varPhi|2}},\\ D &= -\ln\frac{p(C=1)I_0(\kappa_{\varPhi|2})}{p(C=2)I_0(\kappa_{\varPhi|1})}, \end{aligned} $$

and get:

$$ a\cos{\phi} + b\sin{\phi} = D. $$

Trigonometrically, this is equivalent to:

$$ T\cos(\phi-\alpha) = D, $$

where \(T=\sqrt{a^2+b^2}, \cos{\alpha}=a/T, \sin{\alpha}=b/T, \tan{\alpha}=b/a. \) Isolating ϕ from the equation, we get:

$$ \begin{aligned} \phi' = \alpha + \arccos(D/T),\\ \phi'' = \alpha - \arccos(D/T). \end{aligned} $$

The NB classifier finds two angles that bound the class regions.

1.2 Particular cases

We have also derived these angles when the conditional probability distributions share one of the parameters. We consider that the classes are equiprobable. If they are not equiprobable, the prior probabilities of the class values influence the value of D, modifying the class subregions so that more likely classes have larger subregions.

  • Case 1: \(\kappa_{\varPhi|1} = \kappa_{\varPhi|2} = \kappa_\varPhi \hbox{ and } \mu_{\varPhi|1} \neq \mu_{\varPhi|2}. \) When the concentration parameter is the same in the two distributions, we have the following values for the constants:

    $$ \begin{aligned} a &= \kappa_\varPhi(\cos{\mu_{\varPhi|1}} - \cos{\mu_{\varPhi|2}}),\\ b &= \kappa_\varPhi(\sin{\mu_{\varPhi|1}} - \sin{\mu_{\varPhi|1}}),\\ D &= -\ln{\frac{p(C=1)I_0(\kappa_{\varPhi|2})}{p(C=2)I_0(\kappa_{\varPhi|1})}} = - \ln{1} = 0. \end{aligned} $$

    Substituting in the expression of the arccosine, we get:

    $$ \arccos(D/T) = \arccos{0} = \pi/2. $$

    To compute α, we take the trigonometric identities:

    $$ \begin{aligned} \cos\beta - \cos\gamma &= -2\sin\left(\frac{1}{2}(\beta + \gamma)\right)\sin\left(\frac{1}{2}(\beta - \gamma)\right),\\ \sin\beta - \sin\gamma &= 2\sin\left(\frac{1}{2}(\beta - \gamma)\right)\cos\left(\frac{1}{2}(\beta + \gamma)\right), \end{aligned} $$

    which we substitute in the following expression:

    $$ \begin{aligned} \tan\alpha &= \frac{b}{a} = \frac{\kappa_\varPhi(\sin{\mu_{\varPhi|1}} - \sin{\mu_{\varPhi|2}})}{\kappa_\varPhi(\cos{\mu_{\varPhi|1}} - \cos{\mu_{\varPhi|2}})}\\ &= \frac{2\sin(\frac{1}{2}(\mu_{\varPhi|1}-\mu_{\varPhi|2}))\cos(\frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2}))}{-2\sin(\frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2}))\sin(\frac{1}{2}(\mu_{\varPhi|1}-\mu_{\varPhi|2}))}\\ &= -\frac{\cos(\frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2}))}{\sin(\frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2}))}\\ &= -\cot\left(\frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2})\right)\\ &= \tan\left(\frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2}) + \frac{\pi}{2}\right),\\ \alpha &= \frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2}) + \frac{\pi}{2}. \end{aligned} $$

    Now we can compute the decision angles found by the classifier:

    $$ \phi = \alpha \pm \arccos(D/T) = \frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2}) + \frac{\pi}{2} \pm \frac{\pi}{2}. $$

    The two decision angles are:

    $$ \begin{aligned} \phi' &= \frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2}),\\ \phi'' &= \frac{1}{2}(\mu_{\varPhi|1}+\mu_{\varPhi|2}) + \pi. \end{aligned} $$

    These two angles correspond to the bisector angle of the two mean directions.

  • Case 2: \(\kappa_{\varPhi|1} \neq \kappa_{\varPhi|2} \hbox{ and } \mu_{\varPhi|1} = \mu_{\varPhi|2} = \mu_\varPhi. \) In this scenario the mean directions are equal, so the constants reduce to:

    $$ \begin{aligned} a &= (\kappa_{\varPhi|1} - \kappa_{\varPhi|2})\cos{\mu_\varPhi},\\ b &= (\kappa_{\varPhi|1} - \kappa_{\varPhi|2})\sin{\mu_\varPhi},\\ D &= -\ln{\frac{p(C=1)I_0(\kappa_{\varPhi|2})}{p(C=2)I_0(\kappa_{\varPhi|1})}} = -\ln{\frac{I_0(\kappa_{\varPhi|2})}{I_0(\kappa_{\varPhi|1})}},\\ T &= \sqrt{a^2 + b^2}\\ &= \sqrt{(\kappa_{\varPhi|1} - \kappa_{\varPhi|2})^2\cos^2{\mu_\varPhi} + (\kappa_{\varPhi|1} - \kappa_{\varPhi|2})^2\sin^2{\mu_\varPhi}}\\ &= \sqrt{(\kappa_{\varPhi|1} - \kappa_{\varPhi|2})^2(\cos^2{\mu_\varPhi} + \sin^2{\mu_\varPhi})}\\ &= \kappa_{\varPhi|1} - \kappa_{\varPhi|2}. \end{aligned} $$

We compute α by substituting in the expression:

$$ \begin{aligned} &\tan\alpha = \frac{b}{a} = \frac{(\kappa_{\varPhi|1} - \kappa_{\varPhi|2})\sin{\mu_\varPhi}}{(\kappa_{\varPhi|1} - \kappa_{\varPhi|2})\cos{\mu_\varPhi}} = \tan\mu_\varPhi,\\ & \alpha = \mu_\varPhi. \end{aligned} $$

Therefore, the resulting decision angles are given by:

$$ \begin{aligned} \phi &= \alpha \pm \arccos(D/T),\\ \phi' &= \mu_\varPhi + \arccos \frac{D}{\kappa_{\varPhi|1} - \kappa_{\varPhi|2}},\\ \phi'' &= \mu_\varPhi - \arccos \frac{D}{\kappa_{\varPhi|1} - \kappa_{\varPhi|2}}. \end{aligned} $$

Clearly, the two angles are defined with respect to the common mean direction, and their distance to that mean direction depends on the concentration parameter values.

1.3 vMNB with two predictive variables

In this scenario, we have two circular predictive variables \(\varPhi\) and \(\Psi. \) The domain defined by these variables is a torus (−ππ] × (−ππ]. As in the simpler case above, we compute the decision surfaces induced by the classifier by equaling the posterior probability of the two class values

$$ p(C = 1 | \varPhi = \phi, \Psi = \psi) = p(C = 2 | \varPhi = \phi, \Psi = \psi). $$

Using Bayes’ rule and the conditional independence assumption, we get

$$ \begin{aligned} & p(C = 1) f_{\varPhi|C=1}(\phi;\mu_{\varPhi|1},\kappa_{\varPhi|1}) f_{\Uppsi|C=1}(\psi;\mu_{\Uppsi|1},\kappa_{\Uppsi|1})\\ &\quad = p(C = 2) f_{\varPhi|C=2}(\phi;\mu_{\varPhi|2},\kappa_{\varPhi|2}) f_{\Uppsi|C=2}(\psi;\mu_{\Uppsi|2},\kappa_{\Uppsi|2}). \end{aligned} $$

We substitute the von Mises density (1) and get:

$$ \begin{aligned} & p(C = 1) \frac{\exp(\kappa_{\varPhi|1} \cos(\phi-\mu_{\varPhi|1}))}{2 \pi I_0(\kappa_{\varPhi|1})} \frac{\exp(\kappa_{\Uppsi|1} \cos(\psi-\mu_{\Uppsi|1}))}{2 \pi I_0(\kappa_{\Uppsi|1})}\\ &\quad =p(C = 2) \frac{\exp(\kappa_{\varPhi|2} \cos(\phi-\mu_{\varPhi|2}))}{2 \pi I_0(\kappa_{\varPhi|2})} \frac{\exp(\kappa_{\Uppsi|2} \cos(\psi-\mu_{\Uppsi|2}))}{2 \pi I_0(\kappa_{\Uppsi|2})}. \end{aligned} $$

We simplify the constant 2π, take logarithms and arrange all the terms on the same side of the equation:

$$ \begin{aligned} & \kappa_{\varPhi|1} \cos(\phi-\mu_{\varPhi|1}) + \kappa_{\Uppsi|1} \cos(\psi-\mu_{\Uppsi|1})\\ &\quad - \kappa_{\varPhi|2} \cos(\phi-\mu_{\varPhi|2}) - \kappa_{\Uppsi|2} \cos(\psi-\mu_{\Uppsi|2})\\ &\quad + \ln{\frac{p(C=1)I_0(\kappa_{\varPhi|2})I_0(\kappa_{\Uppsi|2})}{p(C=2)I_0(\kappa_{\varPhi|1})I_0(\kappa_{\Uppsi|1})}} = 0. \end{aligned} $$

We substitute the trigonometric identity \(\cos(\beta-\gamma) = \cos(\beta)\cos(\gamma)+\sin(\beta)\sin(\gamma)\) and arrange the terms:

$$ \begin{aligned} &(\kappa_{\varPhi|1} \cos{\mu_{\varPhi|1}} - \kappa_{\varPhi|2} \cos{\mu_{\varPhi|2}})\cos{\phi} \\& \quad + (\kappa_{\varPhi|1} \sin{\mu_{\varPhi|1}} - \kappa_{\varPhi|2} \sin{\mu_{\varPhi|2}})\sin{\phi} \\ &\quad + (\kappa_{\Uppsi|1} \cos{\mu_{\Uppsi|1}} - \kappa_{\Uppsi|2} \cos{\mu_{\Uppsi|2}})\cos{\psi} \\ &\quad + (\kappa_{\Uppsi|1} \sin{\mu_{\Uppsi|1}} - \kappa_{\Uppsi|2} \sin{\mu_{\Uppsi|2}})\sin{\psi} \\ &\quad + \ln{\frac{p(C=1)I_0(\kappa_{\varPhi|2})I_0(\kappa_{\Uppsi|2})}{p(C=2)I_0(\kappa_{\varPhi|1})I_0(\kappa_{\Uppsi|1})}} = 0. \end{aligned} $$

We define the following constants:

$$ \begin{aligned} a &= \kappa_{\varPhi|1} \cos{\mu_{\varPhi|1}} - \kappa_{\varPhi|2} \cos{\mu_{\varPhi|2}},\\ b &= \kappa_{\varPhi|1} \sin{\mu_{\varPhi|1}} - \kappa_{\varPhi|2} \sin{\mu_{\varPhi|2}},\\ c &= \kappa_{\Uppsi|1} \cos{\mu_{\Uppsi|1}} - \kappa_{\Uppsi|2} \cos{\mu_{\Uppsi|2}},\\ d &= \kappa_{\Uppsi|1} \sin{\mu_{\Uppsi|1}} - \kappa_{\Uppsi|2} \sin{\mu_{\Uppsi|2}},\\ D &= -\ln{\frac{p(C=1)I_0(\kappa_{\varPhi|2})I_0(\kappa_{\Uppsi|2})}{p(C=2)I_0(\kappa_{\varPhi|1})I_0(\kappa_{\Uppsi|1})}}, \end{aligned} $$

and substitute them to get

$$ a\cos\phi + b\sin\phi + c\cos\psi + d\sin\psi = D. $$

The Cartesian coordinates of the points defined by the angles ϕ and ψ on the surface of a torus are

$$ \begin{aligned} x &= (L + l\cos\phi)\cos\psi,\\ y &= (L + l\cos\phi)\sin\psi,\\ z &= l\sin\phi, \end{aligned} $$

where L is the distance from the center of the torus to the center of the revolving circumference that generates the torus, and l is the radius of the revolving circumference. We isolate the trigonometric functions and get

$$ \begin{aligned} \sin\phi &= z/l,\\ \cos\phi &= \pm \sqrt{1 - \sin^2\phi} = \pm \sqrt{1 - \left(\frac{z}{l}\right)^2} = \pm \frac{1}{l}\sqrt{l^2 - z^2},\\ \sin\psi &= \frac{y}{L + l\cos\phi},\\ \cos\psi &= \frac{x}{L + l\cos\phi}. \end{aligned} $$

Substituting these expressions, we get the two following equations corresponding to the two signs of cosϕ:

$$ \begin{aligned} &\frac{a}{l}\sqrt{l^2 - z^2} + \frac{b}{l}z + \frac{c}{L+\sqrt{l^2 - z^2}}x\\ &\quad + \frac{d}{L+\sqrt{l^2 - z^2}}y + D = 0. \end{aligned} $$
$$ \begin{aligned} & -\frac{a}{l}\sqrt{l^2 - z^2} + \frac{b}{l}z + \frac{c}{L-\sqrt{l^2 - z^2}}x\\ & + \frac{d}{L-\sqrt{l^2 - z^2}}y + D = 0. \end{aligned} $$

Operating and arranging the terms, we get

$$ \begin{aligned} &clx + dly -az^2 + bz\sqrt{l^2 - z^2} + bLz\\ &\quad + (aL + Dl)\sqrt{l^2 - z^2} + al^2 + DLl = 0,\\ & clx + dly -az^2 - bz\sqrt{l^2 - z^2} + bLz\\ &\quad - (aL + Dl)\sqrt{l^2 - z^2} + al^2 + DLl = 0. \end{aligned} $$

These expressions are quadratic in z. Therefore, we conclude that von Mises NB with two predictive variables is a much more complex and flexible classifier than von Mises NB with one predictive variable.

Appendix 2: von Mises–Fisher NB classifier decision function

To study the decision function for the von Mises–Fisher NB classifier we proceed as in Appendix 1. We equal the posterior probabilities of the class values using the probability density function in Eq. (1):

$$ \begin{aligned} r({{\mathbf{X}}}) &= 0 \Leftrightarrow p(C=1)\frac{(\kappa_{{{\mathbf{X}}}|1})^{\frac{n}{2}-1}}{\sqrt{(2\pi)^n} I_{\frac{n}{2}-1}(\kappa_{{{\mathbf{X}}}|1})} \exp(\kappa_{{{\mathbf{X}}}|1} {\boldsymbol{\mu}}_{{{\mathbf{X}}}|1}^{\rm T} {{\mathbf{X}}})\\ &= p(C=2)\frac{(\kappa_{{{\mathbf{X}}}|2})^{\frac{n}{2}-1}}{\sqrt{(2\pi)^n} I_{\frac{n}{2}-1}(\kappa_{{{\mathbf{X}}}|2})} \exp(\kappa_{{{\mathbf{X}}}|2} {\boldsymbol{\mu}}_{{{\mathbf{X}}}|2}^{\rm T} {{\mathbf{X}}}). \end{aligned} $$

Simplify the constants and take logarithms:

$$ \begin{aligned} &\ln{\frac{p(C=1)(\kappa_{{{\mathbf{X}}}|1})^{\frac{n}{2}-1}}{I_{\frac{n}{2}-1}(\kappa_{{{\mathbf{X}}}|1})}} + \kappa_{{{\mathbf{X}}}|1} {\boldsymbol{\mu}}_{{{\mathbf{X}}}|1}^{\rm T} {{\mathbf{X}}} \\ &\quad =\ln{\frac{p(C=2)(\kappa_{{{\mathbf{X}}}|2})^{\frac{n}{2}-1}}{I_{\frac{n}{2}-1}(\kappa_{{{\mathbf{X}}}|2})}} + \kappa_{{{\mathbf{X}}}|2} {\boldsymbol{\mu}}_{{{\mathbf{X}}}|2}^{\rm T} {{\mathbf{X}}}. \end{aligned} $$

Arrange all the terms on the same side of the equation and operate the logarithms to get the following hyperplane equation:

$$ \begin{aligned} &(\kappa_{{{\mathbf{X}}}|1} {\boldsymbol{\mu}}_{{{\mathbf{X}}}|1} - \kappa_{{{\mathbf{X}}}|2} {\boldsymbol{\mu}}_{{{\mathbf{X}}}|2})^{\rm T} {{\mathbf{X}}}\\ &\quad + \ln{\frac{p(C=1)(\kappa_{{{\mathbf{X}}}|1})^{\frac{n}{2}-1} I_{\frac{n}{2}-1}(\kappa_{{{\mathbf{X}}}|2})} {p(C=2)(\kappa_{{{\mathbf{X}}}|2})^{\frac{n}{2}-1} I_{\frac{n}{2}-1}(\kappa_{{{\mathbf{X}}}|1})}} = 0. \end{aligned} $$

2.1 Particular cases

Considering that both class values have the same prior probability and that one of the parameters has the same value in both distributions, Case 1 and Case 2 can be simplified as follows. When the prior probabilities are different, the hyperplanes move away from the mean direction of the most likely class value, making their subregions larger.

  • Case 1: \(\kappa_{{\mathbf{X}}|1} = \kappa_{{\mathbf{X}}|2} = \kappa_{\mathbf{X}} \hbox{ and } {\boldsymbol{\mu}}_{{\mathbf{X}}|1} \neq {\boldsymbol{\mu}}_{{\mathbf{X}}|2}.\) When the distributions share the concentration parameter, we get the expression:

    $$ (\kappa_{{\mathbf{X}}} {{\boldsymbol{\mu}}}_{{{\mathbf{X}}}|1} - \kappa_{{\mathbf{X}}} {{\boldsymbol{\mu}}}_{{{\mathbf{X}}}|2})^{\rm T} {{\mathbf{X}}} + \ln{\frac{p(C=1)\kappa_{{\mathbf{X}}}^{\frac{n}{2}-1} I_{\frac{n}{2}-1}(\kappa_{{\mathbf{X}}})} {p(C=2)\kappa_{{\mathbf{X}}}^{\frac{n}{2}-1} I_{\frac{n}{2}-1}(\kappa_{{\mathbf{X}}})}} = 0. $$

    The logarithm reduces to 0 and we can take \(\kappa_{\mathbf{X}}\) as common term:

    $$ \kappa_{{\mathbf{X}}} ({\boldsymbol{\mu}}_{{{\mathbf{X}}}|1} - {\boldsymbol{\mu}}_{{{\mathbf{X}}}|2})^{\rm T} {{\mathbf{X}}} = 0. $$

    Therefore, given that κ > 0 (otherwise the distributions are uniform), the hyperplane equation reduces to:

    $$ ({\boldsymbol{\mu}}_{{{\mathbf{X}}}|1} - {\boldsymbol{\mu}}_{{{\mathbf{X}}}|2})^{\rm T} {{\mathbf{X}}} = 0. $$

    That equation specifies a hyperplane that contains the origin point (\(\mathbf{0}\)) and goes through the middle point of the sector that connects the points of the hypersphere defined by the mean directions \({\boldsymbol{\mu}}_{{\mathbf{X}}|1}\) and \({\boldsymbol{\mu}}_{{\mathbf{X}}|2}.\)

  • Case 2: \(\kappa_{{\mathbf{X}}|1} \neq \kappa_{{\mathbf{X}}|2} \hbox{ and } {\boldsymbol{\mu}}_{{\mathbf{X}}|1} = {\boldsymbol{\mu}}_{{\mathbf{X}}|2} = {\boldsymbol{\mu}}_{\mathbf{X}}.\) In the case where the mean directions have the same value, we can derive the following equation:

    $$ \begin{aligned} &(\kappa_{{{\mathbf{X}}}|1} {\boldsymbol{\mu}}_{{\mathbf{X}}}^{\rm T} - \kappa_{{{\mathbf{X}}}|2} {\boldsymbol{\mu}}_{{\mathbf{X}}}^{\rm T}) {{\mathbf{X}}}\\ &\quad + \ln{\frac{p(C=1)(\kappa_{{{\mathbf{X}}}|1})^{\frac{n}{2}-1} I_{\frac{n}{2}-1}(\kappa_{{{\mathbf{X}}}|2})} {p(C=2)(\kappa_{{{\mathbf{X}}}|2})^{\frac{n}{2}-1} I_{\frac{n}{2}-1}(\kappa_{{{\mathbf{X}}}|1})}} = 0. \end{aligned} $$

    We can take \({\boldsymbol{\mu}}_{\mathbf{X}}^{\rm T}\) as a common term:

    $$ (\kappa_{{{\mathbf{X}}}|1} - \kappa_{{{\mathbf{X}}}|2}) {\boldsymbol{\mu}}_{{\mathbf{X}}}^{\rm T} {{\mathbf{X}}} + \ln{\frac{(\kappa_{{{\mathbf{X}}}|1})^{\frac{n}{2}-1} I_{\frac{n}{2}-1}(\kappa_{{{\mathbf{X}}}|2})} {(\kappa_{{{\mathbf{X}}}|2})^{\frac{n}{2}-1} I_{\frac{n}{2}-1}(\kappa_{{{\mathbf{X}}}|1})}} = 0. $$

    Dividing by (\(\kappa_{{\mathbf{X}}|1} - \kappa_{{\mathbf{X}}|2}\)), we get:

    $$ {\boldsymbol{\mu}}_{{\mathbf{X}}}^{\rm T} {{\mathbf{X}}} + \frac{1}{\kappa_{{{\mathbf{X}}}|1} - \kappa_{{{\mathbf{X}}}|2}}\ln{\frac{(\kappa_{{{\mathbf{X}}}|1})^{\frac{n}{2}-1} I_{\frac{n}{2}-1}(\kappa_{{{\mathbf{X}}}|2})} {(\kappa_{{{\mathbf{X}}}|2})^{\frac{n}{2}-1} I_{\frac{n}{2}-1}(\kappa_{{{\mathbf{X}}}|1})}} = 0. $$

The hyperplane defined by that equation is perpendicular to the shared mean direction vector \({\boldsymbol{\mu}}_{\mathbf{X}},\) and its position is given by the relationships between the concentration parameters.

Appendix 3: Mutual information computation

The mutual information between two variables X and Y is defined as

$$ \begin{aligned} \hbox{MI}(X,Y) &= \int_X{\int_Y{\rho(x,y)\log{\frac{\rho(x,y)}{\rho(x)\rho(y)}}}}{\text{d}}x{\text{d}}y\\ &= {\mathbb{E}}_{(X,Y)}\left[\log\frac{\rho(x,y)}{\rho(x)\rho(y)}\right], \end{aligned} $$
(10)

where ρ is a generalized probability function.

In supervised classification problems, we have to estimate \(\hbox{MI}(X_i,C)\) from a set of data pairs \(\left(x_i^{(j)},c^{(j)}\right),j=1,\ldots,m. \) When X i is a discrete variable, an estimator of the mutual information in (10) is given by

$$ \hbox{MI}(X_i,C) = \frac{1}{m}\sum_{j=1}^m{\log{\frac{\widehat{p}\left(x_i^{(j)},c^{(j)}\right)}{\widehat{p}\left(x_i^{(j)}\right)\widehat{p}\left(c_{ }^{(j)}\right)}}}, $$
(11)

where \(\widehat{p}\) are the probabilities estimated from the counts in the dataset.

When the predictive variable X i is continuous, we take an approach consistent with conditional independence assumptions and we model the conditional probability densities of X i |C = c as Gaussian or von Mises distributions, depending on the nature of the variable, i.e., linear or angular. Therefore, the marginal density of X i is a mixture of Gaussian or von Mises distributions, respectively. Algorithm 1 shows the process for computing \(\hbox{MI}(X_i,C).\)

figure a

Appendix 4: Dataset analysis and preprocessing

A thorough inspection of the datasets for supervised classification available in the UCI Machine Learning Repository [28] reported only 5 out of 135 datasets containing some variable measured in angles (bottom half of Table 6). We found no reference to these directional data having be given special treatment. For this reason, we assume that they have been studied as linear continuous variables without taking into account their special properties. We omitted the Breast Tissue dataset [39, 67] from the study because it was not clear whether the “PhaseAngle” variable really represents an angle and how it was measured. Additionally, another four datasets not included in the UCI repository were considered for evaluation (top half of Table 6). A description of the datasets used in this study follows:

Table 6 Datasets used in this study

UCI datasets

  • Australian Sign Language ( Auslan ): Identification of 95 Australian Sign Language signs using position (xyz) and orientation angles (roll, pitch, yaw) of both hands [40]. Therefore, 12 measurements are studied. According to [40], the bending measurements are not very reliable, and they were omitted as predictive variables. This is a time series classification problem. The position and orientation of the hands are measured at different times, yielding approximately 54 data frames for each sign. We resampled a set of 10 evenly distributed frames and used them as predictive variables. According to the description, there are 95 different signs (class values), and each sign is repeated 27 times. However, the her sign only appears three times, whereas the his-hers sign appears 24 times. Therefore, we have assumed that they are the same sign and have considered them all as his-hers signs.

  • MAGIC Gamma Telescope ( MAGIC ): Discrimination of the images of hadronic showers initiated by primary gammas from those caused by cosmic rays in the upper atmosphere [7]. The images of the hadronic showers captured by the telescope are preprocessed and modeled as ellipses. The predictive variables describe the shape of the ellipses. The dataset includes one angular variable that captures the angle of the major axis in the ellipse with the vector that connects the center of the ellipse with the center of the camera.

  • Arrhythmia : Identification of the presence and absence of cardiac arrhythmia from electrocardiograms (ECG). The original dataset has 16 class values: one for healthy items, 14 types of cardiac arrhythmias and one class value for unclassified items [33]. We erased the unclassified items and built a binary class (normal vs. arrhythmia). The predictive variables describe clinical measurements, patient data and ECG recordings. The angular variables describe the vector angles from the front plane of four ECG waves. We removed variable 14, which had more than 83% missing values, and used Weka’s ReplaceMissingValues filter [22] to fill in the missing values of variables 11–13 and 15 with the mode. We also removed some non-informative discrete and continuous variables.

  • Covertype : Prediction of the kind of trees that grow in a specific area given some attributes describing the geography of the land [6]. The two angular variables describe the aspect (orientation) of the land from the true north and the slope of the ground. The original dataset has 581,012 samples and we used a Weka-supervised resampling method (without replacement) to reduce the dimensionality of the dataset to 100,000 samples.

4.1 Other datasets

 

  • Megaspores : Classification of megaspores into two classes (their group in the biological taxonomy) according to the angle of their wall elements [43]. The dataset is an example included in Oriana software.Footnote 2

  • Protein1 : Prediction of secondary structure including one aminoacid, using the dihedral angles (ϕψ) of the residue as predictive information. We only considered α-helix and β-sheet structures, making the class binary. The data were retrieved from the protein geometry database [5].

  • Protein10 : Prediction of secondary structure including one aminoacid, using the dihedral angles (ϕψ) and the planarity angle (ω). We considered the three angles in ten consecutive residues. We classified the four most common structures: α-helices, β-sheets, bends and turns. The data were retrieved from the protein geometry database [5].

  • Temperature : Prediction of the outdoor temperature from the season, wind speed and wind direction. We used hourly measurements from a weather station located in the city of Houston. Data for the year 2010 were retrieved, and we removed the hours with missing values for any of the four variables. The information was collected from the Texas Commission on Environmental Quality website. Footnote 3 The class variable (outdoor temperature) was measured in degrees Fahrenheit and discretized into the following three values: low (T ≤ 50), medium (50 < T < 70) and high (T ≥ 70).

Rights and permissions

Reprints and permissions

About this article

Cite this article

López-Cruz, P.L., Bielza, C. & Larrañaga, P. Directional naive Bayes classifiers. Pattern Anal Applic 18, 225–246 (2015). https://doi.org/10.1007/s10044-013-0340-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-013-0340-z

Keywords

Navigation