Non-determ i n istic outl ier detection method based on the variable precision rough set model

'I}!is nudy prcs¡nts a mrrrlrxl for the dctccrion of drl¡ars baacri on the Varidblc Prccisioñ Roúgh Sel t!,{odel (VPRsM). Ihe basit of thiJmodel ¡§ lhe Beñer"lir.r¡on ol the rtirxh¡d €on€efl ol a s¿l ¡n(lusiofl rc:l¡tb. (x' wh¡ch lhe Rough Sel Easic Model (R§BM) n b¿s€d. Ih€ prim¡ry conlributioo of thb ;údy ir rhe ¡mpn»€Í1ent in &rñl¡on rluality. which is ¡chiev« dr¡e to úe 8cn{rali5¿l¡oo allo\,v€d by tt¡e clastilic¿t¡on sytlem lhat allors á ceñain de8ree of uncenáinty. tiün this rÍerhod, ¿ comFrr¿lioflally ellicienr álSorihm i5 pq)o*d. fh€ €{xri.n€nls pe¿fftd w¡lh a .eal !.er¡arb ¿hd ¿ cornparisoo of tlc .('luhs with tl¡c RSBM-b¿s€d ncthod d(rnonrr¿G rhe effe(ti\,eneis ol lhe rn€lM as wcll es the algoilhm's ell¡c¡€rrcy ¡n d¡,€Ee conlers, which ¿lso iololle lá€e ¡mornts of dal¡.

These concepb highlight that KDD-DM processes require iri- cfeasin8ly emcie¡t melhods for the detection of out¡iers.In roday' § da(asets, irc¡@singly sophislicated data rcpresentation structures a¡d forms of storage rcnd to appear.Thcreforc, work must be performed based on obta¡ring effecÚvc dcrrtio¡ mod- els based on üe challcnges imposed by such pal¡cu¡arit¡c § arid on lhe use of new lechnologies in gener¿|.
The investiSation of state-of-the-an techniques in thi § study has allowed us to identify üe exlenl ofthe oudierdetection protF lem based on its applicalion in multiple coritexts.Our conclusion is that its scope ofapp¡icauon is wide aod diverse.Thisdive¡ §ity ofapplication ficlds, in v/hich the nature ofüe data and the cotl- texts ir¡ which üey are deñned acquire diffcrent particular¡tie §' is perhaps one of lhe reásons that explain üe wide variety of existing detect¡o, meüods.Each methodsdjusts 10 the data al¡d the contexts in which they vrillbeapplicd; thus, it ¡schallengin8 to conceive increasingly flexible detectior methods that can bc applied in different contexts.
With lhe goal of making oudier detection more efficieñt, re_ sca¡chers tend to apply new techniques.The Rough Set Basic   Model (RSBM) proposed by Professor Z. Pawlak [2] in 1982 is based oo asimple and solid malhematical basis: theequivalence relation rh€ory, which describes parilions constituled by indis_ ceroible types of objecG.In reccnt year §, this model ha § becn successfully applied in divese contexls.ln [3], w€ proposed a method bascd on lhe RSBM that demonstrated tic Yalidiiy and potential ofthis meúod fo¡ the daec¡ion ofoutliers.However, it could also be confirmed that Ihe RSBM or¡ly allows accurate classiñcalions, and many problems generally reql¡ ire urccn¡inty to be admitted into a given classilic¡tion along wilh havirg the capacity lo generalise the conclusion obt¿ined from Ílore re-duce¡l datasels.
In th¡s study, üe iritial hyporhesis is that the Va¡iable Preci- sion Rough Set Model (VPRSM) [4] can provide a solution to the abovementioned prohlem.RelyinS on the nondcterministic characterprovided by the VPRSM and by the relaxatio¡ ofthe §et inclus¡on concepa thüt allows the úa¡aSemcnt ofcenair thrqsh- olds set t y the use., we propose a new modcl in this study based on the VPRSM a¡d cre¡te a neu algorithm based on thc algo.rithm prescnted in [3], which showr significant improvements in i § generalisation and detcction capacity while maintaining thespatial and temporal complexity lcvcls that make it v¡ahte ¡n practice.
The fundámental cont¡ibution of the RS rheory is to fácilitate classification analysis.The approximation, both upper and lower, becomcs neccsary because of the inabilily to estahlish complctc classificalions of objects lhat belong to a cefain caiegory with lhe knowledge available [65].
w¡th a cedain frequency, the iÍfo.rnarionavailable only al- lows partial classifications to be made, and RS thcory can be efliciently used to mqlet this type of classification.However, from this theory, §uch a classification must be lrue [6ó1,limiting the poss¡bility of corceiving a classiric¿tion with a conr¡olled desree of uncertainly (i.e., the possibility that thcre is a cerain cnor in the cl¡ssiñcatio¡).This is nol possible with the RSBM.data involved (tI): üus, the RSBM ¡s a de¡erministic model.Howev€r, in re¿lity, the¡e are multiple situalions thatr€qui.eúe need fo( considerirrg ¡ncorect patial classificalions.AI incor_ rcct palial classification rule also Provides u §€ful informalion aÍd cancstablish the tcndency of values if most ofthe ¿vailable data to which the rule is ¿pplied car he co¡rectly clássified.
The prirnary objective of this sNdy is to improv€ the method b¿xd on RSBM [3] with the creation of a nonielermitistic outlier dete.tiorimelhod bas€d on the VPRSM.Th is new method mus( remain compulationally fÉsible for what we conc€ivc ao alSorhhñ that allo¡rs us to validate it.The staning hypothesis is th¡t the VPRSM modet broadens the applica(ion of the original method, which is based on the RSBM, to contcxts in which a classification with a cenain degr€e of uncertainty is required.

Detection method based on the VPRSM propertie §
The VPR §M is a generalisatioa of tlte RSBM [éE] [69] and is derived from the R §BM without assuming anythiDg additional.
The cssence of úe VPRSM model is given by the generali- sation of the tand¿rd concept of set inclusio¡ telaxation [77].
This concept is too rigorous for represettinS a ncarly complete set inclusion.Bascd oo an extended co¡cept for this relation deEoed iñ the VPRSM model [78], a certai0 dcgree of eror is allowed to be established or foreseen.
In this scction, *e construct thc p.oposed outlica detection method as we presenl and an¡lyse the maüematical tools prq.vided by the VPRSM model [4].
It becomes evident from lhe definition ofthe standard inclu-s¡or relution (se Defiritio¡¡ l) that there is no possibility of contemplaring any ty¡re of declassiñcation.
Definition I --Sasrdard indr¡sion rd¿tion: t-€t U be a frn¡te uÍivers€ of objects and X, Y c U;X *0:a¡dY + O. T1É¡,X The first step to overcome the limitations imposed by rlrc RSBM consists of breaking free of the need of eiplicitly defining the universál quantifier.The "measure of the degree ofdeclassification" (see Definition 2) proposed io the VPRSM makes this possible.
Deñnitiou 2 -llleasure of the degree ofdeclassifrcaaion: The measure of the degree of dealassiñcatioo relative to lhe set X \lith respect to set Y, c(X, Y), is the existifig relalive error when classifyiog a set ofobjecls ard is denned as: c(X, Y) = This defiritior is cvident bc.ause it cán be observcd that: The numerical expression c(X, Y) is indicative of the relative classiñcalion €ror.Thc product c(X, Y)*lXl will indicare rhe absotute classificatio¡ enor (i.e., üe ¡umber of misclassified ohjects).lf üe me¿sure of relalive dccla.ssification is used as a rcfer- ence, the inclusioo relation can bc defined to obviate thc ncrd to explicitly set the geoeral quantificr as follows: XgY <) c(X, Y) = 0. Based on this defnition, c(X, Y) cao have values grearer than 0 wiüoul hDing t«) high when the relation represents a ma- joriry.Thüs, a majority ofúe objects ofX must be cl¿ssified in Y The concept ofthe majori(y imposes (he seÍing ofa rhreshold, and in such a case, it is assumed that the majority implics that mo¡e úan 50 of the elcmenrs of X should be commor¡ with y Th us, the spec ificat ion of an adm issibte ürcshokl of q69¡ ¡a tha classificatior is addcd to the dcfi¡itioD of thc inclusior¡ relatioo tt8l. Definition 3 -trlajorily inclusion relation: Let U be a ñnire universe ofobjects;0 : P < 0.5, where á is the admissible declassiñcation enor; and X, Y C U, Xl d, Yl d.Thco, X is said to be primarily incl'Jded in Y, or X is iocluded in Y w¡th a P-enor, xg,X ifand only ¡fc(x, Y): P. From the sarne de6- ¡ition, it can he shown that ,=0 e¡presses a st¿nda¡d i¡clusion ¡cl¡tion, which is calle¡ the total inclusion in this model.
From the new defirit¡on of the inclusion relatioo, the most rcpreseota(iveconceptsofthe RSBM can be redcfi ned as follows.Atlditionally, notc that the p-negative region ofX is the u¡ionof a¡l the eqoivalence c¡ars¿s that can be classiñcd wirhi¡ X' wirh a class¡f¡catior error no{ higher than É.
Considering that when B{, the standard RS model is a par- l¡cular case o[ the VPRSM, rhe following propos¡tion can be establishcd, whcrc olher relations úat are also fulfillcd a¡e ex- pressed.
Proposition 5¡ a) X c {rl the ¡ower approx¡mation is a subset of the flower approximatioo b) XB § X: the r-uppcr approximarion is a subser of lhe upper approximation.
c) BNd g BN: the r-boundary region is a subset ofüe bound- ary region.

Outlier deteclion algorithm
Fror¡ the melhod proposed iñ the previous secrio¡, an algorithm must b€ built that can improve the delectior quality and provide a wider nnge ofapplications while mainlaininS rhe spát¡aland ternporal complexity tevels obtained to date.Such a meth(d would ens¡¡re its own viability in rcal environmcots whcre large amou¡ts of data must be considered. Fo¡ the design ofthis new algorithm, we havestated frorñ the RSBM algorithm, which has al¡e¡dy beeÍ tested and validated [3].Using úe theoretical framervor* provided by the VPRSM to implcmentthe propossd method, we have modiñcd the calcula- tion of significant rc8io¡s of üe original algorithm, panicularly wilh regard to lhe dercrmi¡atioñ ofthe P-inrer bor¡ndaries @l¡, lS i Sm).As already noted, in such a model, a certaio É-ertor is allowed in the classiñcation, which objectively translatcs inlo rel¿xiíg the inclusio¡ relations whc¡ establishi¡B the significant reSions of úrc model in Ihe analysis frarnework.Thus, the pos- sibility ofa nearly complete classificatiori is given by relaxing i § deterrninistic character based on the R §BM concep(¡on, The ,-error is added át the inputs of the algorithm imple- mented for the RSBM; üeretbre, rhe inputs fo¡ the VPRSMbased algorithm include t¡e following parameters: the uniwrse U, üe concept C (represe¡tod by variable X irl rhe algorirhúr), the criteria tha( disringuish the equivalence relárions considered in the analysis (r¡, l5 i 5 m), üe established detecrion threshold value p, and the p-eror.The same data structures describ€d fot thc RSBM-based algo.irhm[3] are maintained.The fundamen- tal dala structure lsed io the algorilhm is lhc dicrionary, which contains a set ofpairs (i.e., hqs .nd|.alues),etherc rhe key is an arbiu-¿ry ohject to which one ard oniy ol¡e object of the value- type object is assGi¿ted, [n the algorithm, ,(z1s are described by thc results ofapplying a cl¿ssiñer to ao a¡birrary element of lhe universe.Such á classifier is associated with a panicularequiv- Following the strategy of the original algorirhm, rhe ncw al- gorithm is composerl oftwo g¿gs5¡ lhe fomalion of thcr-inner h)undaries and an outlier dereclion process.In the following, e¿ch of these st¿ges is shown and anatyscd using irs pseudftode.

BUILD.REGIONS (
!,o I Thc remporal complexity of this stage is O(n.m*r/,where c is the cosl of classifying each eleñenr, ,¡ is rhe cardir¡ality of the universe, and m is the number of «¡uivalcocc rclations considered in the analysis.

Stage 2 -
Oútüer detection pÍocess: The set that co¡tains all the eleme¡ts that meet the corcept and can be outlier cand¡- dates is made up.From this set, all elemenls with a d¿8r¿€ o¡r ¿\cep¡ioñalitt grcatet than úe eslabl¡shed dctection rhreshold p a¡c cl¿ §-s¡6cd as such.
With regard to the spatial cornplexily, lhe sanrc order is ¿üso computer systeÍ¡s science & engioecrl¡g ftur 3 Reprer¿nbtive ngior¡s for ,r{.CoíEspoDditrS lo th€ RSBM .,I.,J§,.,@ *""*-,- / By d€ñn¡tior 4a: ctass g, X .+c¡ass € X/t Example of outl¡er detect¡on in a data s€t by the YPRSM algorithm The opcrat¡on of the p.oposed algorith m is sho*n using an ex¡rnple thal h;ghl¡ghts the way io which the significant rcSions vary when a cenain f-crror is allowed; this examplc also descrihes how the classiñcátion is ¡elaxed.In Sectio¡ 4, the test and validation of the proposal will be addressed with a real d¡t¿¡rel.
A universr: U that rcprcsents 25 parienrs is considercd Cfable l).ln this tablc, a di¡gnost¡c is cstablished for whethe¡ each patient suffcrs ffom flu or frot as a function of the palientt tcm- peralure and lrom lhe presence of a headache or not.
Two critcria are defioed, whcre each divides U inlo a determincd 0umbcr of equ ¡valence classes: respect to / I .The el€mens of both cl¿sses that fulñl C ¿re those that make up the inner boundary.F¡9.Gb shows how the clas-s¡licalion is made *hen É=0, which is equ¡valeot to the RSBM, and whcn allowinS a dcdassification error of P=0.25 (i.e., tbe VPRSM).Note thar for rt, nonc of the bourdaries change even if the v¡lue of P varies.Howevq, wber a¡alysinS what occurs rrith regard to 12, ¡t is observed that ¡he introdüction of a classifrcation efror can yary the bound¡¡y cleñents.Thus, the relation 12 produces 3 equivalencc classes for the univej'se U (Fig. 7-a).In eqr¡ivalence class 2, 80% of the elements belong to the concept c. when the bounda¡ies arc buih wiú P=0, equ¡va¡ence class 2 is within the boundaries betweer the elcmcnts th¿t belong to lheconcep{a¡ld thos€ lllat do not.This occurs besause therc are many elements that arc io equivalence cl¿s § 2 that do not belong to üe co¡cept b€caus€ they are not patients wilh flu.However, when a classification error is inlroduced (i.€., ,=42t, equivalence class 2 erlersrhe posilive region because 80% of its elements belong to tfie concept (Fig- 7-b).This fact makes sense becaus€ equivelcnce c¡ass 2 can be considered ro bc positive with a degree of enor of P=0-25 it many elements of the class ñe€t the concept c.As shown in the example, the irit.oductiono[ an enor in the classific¡r¡ioD ofúe elements that a¡eor ¿¡re notp¿m ofüe concept can relar the relation definilion and cl¿ssify the elemenls with a cenain margin ofenor.

VALIDATION OF RESULTS
The fundamcntal objective of üe e¡.periments i¡ al¡is study is lo val¡date the proposed hypothesis that the incorporatirrn ofthe precision vaÍiable to the propolied oudier detecrion algorilhm improves lhe results.Howevet Siven lhe la.ge amounts oldata with whichwork is typ¡cally performed forrhis ryp€ofproblem, another oftheobjcctives ofthe tests is aimed ar verifying that the lcmporal complexity of the algorithm remains lideá, io practice.
We willstill incorporate an ddiionalobject into üe proposed test, where the obtai¡ed results ca¡ b€ contaasled and compáred (o those of other methods, algorithms aod strategies-To accomplish úis goal, a daraset p,rovided by lhe UCI MachireLeaming Repository of üe Cenler for M¿rchire Ira¡ning aod Intelligent Syslems of the University ofC¿l¡fornia.Irvine [79] was chosen.This dalaset contains data from th€ Census Bureau Database of th€ Un¡tcd States, has already bcen uscd in more lha¡ 50 d¡verse scientific aniclos, and is tlercfore considered to bc a good refer- encc dat¡set.In [79], the most outstanding charecteristics of th is sd ard a detailed e¡planadon of its attibures can be ottained.

Experiments to determine detection quality
To demonstrale that lhe proposed melhod ¡s valid with regard to the detection capeity in re¿l datas€b, we have designed ced¿in tcsts i0 wh¡ch we defin€ a concept arxl a scrics ofcquivalence re- O_i f -N or mol _t em per at w e (x\ I -¡ f -H i g h -t e ¡r.per o tur e(r)

2-otherwise
A conccpt is dcfincd as those pat¡ents who st¡lfcr froln lhe llu:  Thcrefore, any element that satisñes the cor¡cept and belongs to the class c¡.1 (x = l, 2, 3, 4) is contrádictory with the relation ,¡ becaúse lhe individuals srrhi)ctcd to the analysis are children between I and l0 years ofage.
Tablc 2 shows lhe s€t of outlicrs that werc intcntionally in- troduced into fhe data-set, showing only the attributes that are relevanl for the ana lysis.Val ues th at contrad ¡ct the concept have been inlrodúccd.
When intcrprcting lhe resuh §, it has to be noted that ¡¡ all cases, with¡o the §€t of outliers dcteated, there wcre always some outlieni tl¡at had been intenlionally introduced ¡nto thc data sct.Whcn thc amount of oullicrs dcrccted was hiaher than thc amount of oullieB introduced, thcn all the i¡toduced out- liers were wirl¡in the detected set.When the number of out- liers ¡letgcted was krwer tha¡ ¡he amount introduced.ihen those that were dotected werc always the most contrad¡ctory outl¡ers.
-For smáll values of p and p, the numb€r of detected out- liers car be high, and elem€ots that arc not actually outliers can detected as such.For example, when ,¡ = 0.2 a¡¡l p = O.0,24 outliers were detected, which reamÍrs ao impofant ¿spect of the státistic view of the ortlier detection probleú for the ñnal designation of á case as €xceptional.When the considcrLd ca¡didate observations have been irlentilled by a given detection method, then the investig¿tor must peform an analysis oflhese rcsulB and select lhose observations thal demonstmte real con- l¡ad¡ctions with resp€ct to the studied sample.
'When gradually increasing the value of üe detection thresh- old (t), a reñnement in lhe detection is achievcd.In general, when üe valr¡e of this pa¡ameter incre¿ses, the number of out- liers detected deareases.Cive¡ this decreae, ¡I car b€ obs€rved that thos€ thát reñain ¡n each case are thos€ ú¡at arc contradictory with a hiSher numher ofattrihutes-However, ir cenain cases and for certain variations i¡ I¿, such refinement is not achicved.For example, 24 outl iers are detected wheo ¡r is vafied from 0.2 to 0.4 and l, = 0.0.The same results are found §,hen lt is varied irom 0.6 to l.O and p = 0.0.Additionally, in both c¿ §€s, the numtEr of outliers detcctql was 9. Note ¡hat in the two examples, the value ofÉ = O0, which implies üa( no degreeofdeclassification has been allowcd; therefore, these rcsults are indicatiye of the RSBM.Additionally, note that whcn 6 ce(aio dcgree of declas- si6cation (i.c., p # AO) is allowed for the same variations in / as in the previous exañple, (he amount ofdelected outliers is different. After ,, reaches its highest possible value (i.e., Ir=/.0), lhe ¡umher of dctccted outliers is 9: howcver, a higher detection refinement can be achieved if p is varied until the most con- tradictory outliers are idertified.Thus, dcrecrior¡ quality can be improved if a controlled degree ofde.¡¿ss¡fication(r) is allowed and increascd gradually.However, we must hecautiot swith thc variation of É berause allowing a high degree ofdeclassification ca¡ result in all elemens that ale oearboun¡larics going into the positive or negative rcgion, leav¡ng the inn€r hou0da¡ies wilh ¡lo cleÍrents becaus€ all of üem a¡e removed.ln the tests per- formed, for cxample, it is ev¡dent lhat th¡s phcnomenot occurs above, = ¿3 bccause nooutliers are dctccrcd atDve this value.

Exp€riments to dctermine the algorithm's feasibility
To describe the behaviou¡ of the proposed algorithm, we will analyse its behaviour when considering the variation of all rhe parametcrs tiatdcñnc tlrc size ofthealgorithm input, which in- cludc the numb€r of rows a¡d columns of the dataset and the numbe¡ofequivalence felations conside.ed in the analysis-Additionally, the b€haviourofthe VPRSM alSorirhm was com¡mred with that ofthe odginal RSBM model.
Addnionally, the v€rsion of the VPRSM based algorithú achieves better results in the detection of outliers by refinirig the candidates and focusing ofl detectiog those outliers that are mor€ contradictory.The proposed method achieves this result while ma¡nlaining the same temporal and spatial coúple¡ities as the RSBM-bas€d algorithm.The proposed method is shown to provide a computationally efficient solurion, offering thc pos- sibility of usiog quasi-linear algorithms, which is an adva¡tage that any d¿ta ana¡yst or engineer will value, given the typically elevated complexity ofthe procedures in rhe KDD-DM ñeld and the lypically l¿¡8e size of datasets.
ln üe lorg term, our investigation seeks a much more ambi- tious objertive: ro provide a rool ihat allows rhe probab¡lisric prediclion of an outlier condition for all elements of a given dataset in a computationally feasible manner.To achievc this Soal, rhe next step in úis field of research should corsistofcre-at¡ng an alSorithm lhat cao automaticá¡lycalculaaing úe É and P thresholds involved in the proposed rnethod that must be defined by the user-Based on this algorithm, our investigation will be focused or the cr€atior of a new method thát allows rhe set of süch lhresholds under wh¡ch a cetain elem€ntof a daláset would be an ouúier to be determined.
I vol 34 no 3ltay 2019 lLl I I I ficat¡on