USING HIDDEN MARKOV MODEL BASED ON THE MODEL CONFIDENCE

The issue of increasing the confidentiality and stealth of users on the Internet is the most pressing issue of the day. One way to increase the secrecy of using Internet services is to install the Tor software, which protects itself from the "data flow analysis" is a type of network surveillance that threatens the privacy of users, the confidentiality of business contacts and communications implemented through routing network traffic over a distributed network of servers running volunteers from around the world that does not allow the external observer to monitor the user's Internet connection, find out which sites were visited, and also does not allow the site to know the physical location of the user. However, the software in question has vulnerabilities that result in the loss of personal user freedom. The author, through the application of general scientific methods such as analysis and synthesis, identified a list of vulnerabilities and their importance for the confidentiality of the Tor software. The author carried out the simulation of the Tor software by devices of the experimental environment and the construction of experimental procedures based on the used mathematical apparatus of the Markov chains. The results of the experiment indicate the necessity to determine the validity of the model for analysis of the anonymity protocol. In the course of this research, an algorithm for testing the anonymity of Tor software users was developed, which allows to identify possible sources of personal information of users. The effectiveness of the proposed modeling trust algorithm was demonstrated by calculating the value of a training set of data necessary for outputting a wireless access protocol, a proxy through Tor.


Introduction
Tor Browser helps you to protect yourself from "data flow analysis" that threatens personal freedom and privacy, confidentiality of business contacts and connections.
This service provides protection by routing your network traffic across a distributed network of servers launched by volunteers from around the world: this does not allow an external observer to track your Internet connection to find out which sites you visit, and also does not allow the site to know your physical location.
This program works with many existing applications, including web browsers, instant messaging systems, remote access clients, and other applications using the TCP protocol.
Hundreds of thousands of people around the world use Tor for a variety of reasons: journalists and bloggers, human rights organizations, law enforcement officers, military personnel, corporations, people in countries with repressive regimes, and just ordinary citizens.
This article describes a scenario in which a client interacts with a server through Tor.Assuming that the communication protocol used by the client can be represented by a Hidden Markov Model (HMM), we can derive a model that is an exact representation of the underlying protocol using the time information collected on the server side.The suggested trust model approach is applied to the attack experiment to determine the size of the data required to construct a statistically significant representative of the protocol.
Therefore, the aim of this work is the traffic analysis of anonymity protocol using Hidden Markov Model (HMM) based on model confidence.Let's take a sample .The null hypothesis is the expected value that is a given value .Then we can write test statistics like this: where and is the dispersion .The conversion process (2) is called standardization or normalization, and the result is called the standard estimate or z-count.z determines how many standard deviations below or above the population means that the average value of the sample is under the null hypothesis.However, in most cases is unknown and may be replaced by sample variance , if is big enough.Using test statistics z for a given level of significance , we compute one way or two way x , x ,..., x n  pvalue.We reject the null hypothesis if p -value is less than and takes it another way. Or, since the statistics z follows the standard normal distributions, if the null hypothesis is correct, the decision to reject the null hypothesis can also be made by comparing the statistics z with a critical value without converting it to p -value.

Hidden Markov Model
The standard Hidden Markov Model (HMM) is N -Markov chain observed at discrete points in .Let us assume that t 0, • probability of observations As shown in fig. 1, two HMM probability distributions generate two parallel stochastic processes [1]: a process of states and state of observations.In this paper, we consider the problems of HMM inference and a specific inference algorithm is the causal splitting restoration algorithm (CSRA).This approach of the HMM [2] creates state machines deterministic at the transition exit, i.e. when each observation is displayed in no more than one transition, leaving the state.In addition, the main Markov chain HMM generated using the Shalizi method when all transition states are removed [2].

Partially Observable Markov Decision Process
In the language of stochastic control, Partially Observable Markov Decision Process (POMDP) are control problems with partial observation.They usually simulate stochastic environments with hidden processes.By summarizing the Markov decision process (MDP) and providing greater uncertainty, the POMDP provides a more powerful formalism for modeling realistic problems, especially for managing systems with noisy data or limited sensitivity.
Formally, POMDP is defined as a 6-tuple S, A, O, T, Z, r , where

 
T s,a,s P   r s s,a  represents the probability of transition to state s after taking action in the state s.

Decentralized POMDP
When decision making becomes a collective work in which several agents need to be coordinated without effective communication and even unclear about their own local situation, the decentralized Controlled Markov Processes with partial observations (DC-solve this problem [3].As an extension of POMDP to the case of several agents, DC-POMDP is a more general and more powerful modeling tool.However, the DC-POMDP solution usually leads to excessive computational overhead. DC-POMDP can be POMDP) is the main tool used in decision theory to formally defined by a topple , A ,..., A , O ,..., O , R, P, Q , S, K where is the finite set of states.S K is the number of agents.

FSC)
be represented by a s a local policy.But joint actions affect both the dynamics and the global reward.

Final State Controller (
Although any POMDP policy may policy schedule, for some policies of an infinite horizon, infinite policy schedules may be required [4].Therefore, most policy-based algorithms limit their search to finite political graphs, i.e.FSC, which can be defined as an extension of a probabilistic automaton 0 N, A, y, b , E [5] along with the probability distribution   x : N A 0,1   and the output set O , where N is the finite set of internal states of the controll he course of action by POMDP; r solvi SQP is one of the most successful methods fo ng problems of nonlinear limited optimization.It consists of a set of algorithms, not just one algorithm and is based on a deep theoretical foundation.SQP has demonstrated excellent performance in solving general problems of large-scale non-linear programming.In this section, we look at the following NLP (natural language processing) problem: where is a factor from component; -   NLP.T basic idea of SQP is similar to the methods of Newton and quasi-Ne .However, the presence of constraints makes the analys and implementation of SQP methods much more difficult [6].There are many NLP for which individual SQP methods exist to solve the LP include optimizations and non-linearly limited optimizations.We speak POMDP as NLP and rely on SQP tools to find solutions.

Anonymity Protocol Analysis
To use the z-test, let us offer a simple algorithm for operational testing of the sequence of observations.The algorithm determines whether the built model will statistically represent the data flow in the collection process.First, he wton is m.This N unconditional optimization systems, linearly limited we collect a sequence of observational data model from the coll , we define statis y of some length D and build a ected data.With a built model ztics and find if experimental statistics provides confidence that the transition with probability  does not occur.If y is not long enough, we will not be able to build a model from the data; it is necessary to collect additional data.The algorithm is presented below.

Algorithm enote the transitional probability Let us d
 , when the system is in the state s : Test z-statistics for the state is defined by the formula .
The model certainty is defined as  Assuming this, our approach is limited to finding "known-unknowns" [7] at a given degree of statistical likelihood a .If the observation is not in the alphabet, i.e. is an "unknown-unknown" [7], the transition does not affect the confidence or the probability of an unknown transition.Also, if s K O  and s U  Ø, then the state has no possible fail-safe outgoing transitions.Transitions are not available to exit the state, and state testing does not change the confidence in the model.

POC
The exp ts prese d in this section are simple.The goal is to test the HMM construction algorithm [8] and provide readers with simple illustrations of the concept of mode erimen nte l confidence.This secti s two examples.Below we consider the details of the application of the suggested algorithm for deter oing transition occurs with an appropriate probability and the corresponding symbol is obse itial process was set up to generate 10,00 on provide mining the model confidence in the detection of the Tor network protocol.

Example 1
The HMM used to generate the observation sequence of Example 1 is shown in fig. 3

Exa
The same steps were applied to the Markov chain in fig.4, a, as was done in the example.This is what makes the difference in the state structure and transition probabilities.The probability of not seeing the transition decreases with increasing observation time.To create a model that represents the underlying process, there must be enough data in the training kit to fully describe this process.
Thus, we have shown how to determine at a given degree of statistical likelihood, if an "unknown" transition does not occur taking into account two userdefined threshold value .The parameter s   determines the m that transitions with probabilities e included in the built mod Thu a given degree of statistical likelihood, if an "unknown" trans inimum probability of at least  should b el.s, we have shown how to determine at ition does not occur taking into account two userdefined threshold values  .The parameter  determines the minimum probability that transitions with probabilities of at least  should be included in the built model.A parameter a is the confidence level that shows the accuracy of the model result.In our demonstrations of the algorithm, we specifically looked at whether the built model corresponds to a model that acts as the main process.

Protocol Detection
Now we use the model trust approach presented above to determine the protocol that the sender uses when talking to a client over the Tor network, collecting time intervals between packets on the client.The time between sending each packet depends on the symbol associated with the transition.Each character is assigned a spe s ate anonymously and securely on th twork is a logical network conn s on top of a physical netw cified time delay in milliseconds, and the server waits for this amount of time before sending the packet to the client.This method links inter-packet delays with HMM transitions.In other words, the time delay between successive packets will be our observations of the main process.This is the behavior that we expect in actual protocols that the packet time will be associated with the processing required by a particular task in this process.
Tor is a low-latency overlay network that allows applications to communic e Internet.An overlay ne ected by virtual circuit ork.Links that connect individual systems in the overlay network are implemented as "tunnels" through the core network.Sent packets are encrypted multiple times so that they remain logically separate from normal traffic.The stability and deployment of Tor can be explained by its practical design [9][10].
Tor basically consists of computers serving two types of services: a repeater and a directory server.There are several thousand relays, also known as onion routers, which operate on a voluntary basis by individuals and organizations around the world.The path through Tor is built of a relay.Relays and clients exchange data according to the catalog [11] for the exchange of catalog information.By default, relays listen network on a re s, which is a set of nodes used by the client as lo onnections to the desti e iteratively encrypted using only the ascending node and the node down procedure described is illustrated in fig. 5. Whe ple of detecting a protocol tunne on TCP port 9001 for incoming requests.Active relays publish their router handles to the list of predefined directory servers (organs), reporting their current status.Directory servers store router handles on the relay list and constantly check the availability of these relays.In addition, each flag is assigned a different flag in accordance with their knowledge of the network status, that is, which ones should be displayed as working, valid, stable, etc. Directory servers exchange their views with each other on the gular basis, for example, every hour.After all servers match the list of available relays, which is called consensus over the network, the consensus is published on the TCP port (default 9030) and available for download.
To use Tor, the client will need an HTTP proxy to retrieve the Tor directory and an HTTPS proxy to receive the relay.The current version of Tor allows the client to use any HTTPS or SOCKS proxy server to access the Tor network.Once installed, Tor can be initiated as an onion proxy (OP) if it processes only local requests.SOCKS proxy listens on port 9050 by default for streams created by TCP-based applications, such as web browsing, SSH, instant messaging, etc.Then the traffic will be routed via Tor.
Tor starts building charts as soon as they have enough directory information.When the application flow arrives, it will be connected to a pre-built circuit, if it exists, or wait until the circuit is available.Before building the circuit, the client selects all relays (by default, by default) to use the launch with the output node.The entry node of the circuit must be one of the entry guard ng-term entry points to Tor.The connection between the client and the entry node is first established using TLS/SSLv3 for authentication and encryption [11].After creating the first connection, the path extends to the second and third nodes in a similar way.Using this incremental pathbuilding project, the client sets the session keys with each subsequent node independently [12].The final node of the scheme, known as the output node, is selected to ensure, at best, support for c nation.
Before joining a stream to a built scheme that can support a client request, Tor will send a test request.If the request is not completed, Tor will send an error to the user.
All traffic going down the scheme is packed into 512-byte cells, which is an effective measure against leakage of packet size information passing through the side channels.Then these cells ar the key of each serial relay circuit.That is, the outermost layer of the packet is encrypted using the public key of the input node.
And so on, the innermost level of encryption is performed through the key of the output node.When a cell moves down the chain and comes to each relay node, the node "expands" the cell with its private key to identify where it should send the decrypted cell, for example, clear the onion skin.Thus, each node in the chain knows stream and cannot evaluate the entire panorama of the circuit.Thus, the compromise of a single node does not violate anonymity.
The n the addressee, Alice, responds to Bob's request, the same process is performed in the reverse order.There are many other details of the process, such as encryption schemes, integrity checking, congestion handling, path selection, etc.A detailed specification of the Tor protocol can be found in [12].
Here is a practical exam led through Tor to illustrate the usefulness of the application of the suggested model trust algorithm.We use the approach [13] introduced to derive a protocol model that the server uses when talking to a client through the Tor network, by collecting time intervals between packets on the client.First, we have a valid HMM that represents the protocol used.The time between sending each packet depends on local processing and is represented by the symbol associated with the transition.Each character is assigned a time delay range in milliseconds, and the server waits for this amount of time before sending the packet to the client.This method links inter-packet delays with HMM transitions.In other words, the time delays between successive packets will be our observations of the main process.In actual protocols, the packet time will be related to the processing required by a particular task in the process.After designating the data that we record, the model building algorithm is used to create the model used by the server.

Model Building
The model used by the server in this experiment is shown in fig.6.

Fig. 6. Original five-state model for the pruning experiment
The server starts the process by randomly selecting a state in its model as the start state.To send each packet, the transition is taken from the current state, and the corresponding time delay waits before sending the packet to the client.If there is more than one possible transition from a state, the transition is selected randomly, weighted by the probability of each transition.All data collection was performed on proce e article [13] program was used to capture packets within the network.Calculate the d sses sent via Tor.In th ifference between each successive packet time t  .We then symbolize the data, grouping them into ra and assigning something in that range to a unique character, such as or B. We start with nges A L 2  and increase it as needed.We follow the process described in the flowchart in fig.7, to create the models required by our attack.e t ata k fects rough one out of every 200 packets.These glitches cause the packet to arrive later than it should have and bec e of that, it is incorrectly symbolized.All of these ew events are very low probability, which results in a lower minimum asymptotic state probability for each new set.This lower probability causes the confidence test to increase the amount of data required To prune these unsubstantiated states and transitions from the model we use the method of thresholding the asymptotic state probabilities.
After a model has been built with CSRA, it may have transitions that are taken very rarely and states that are vi the asymp med from Fig. 7. Flowchart summarizing the process of building a model When the confidence test is run on the model, we find that it requires 20 624 750 data samples.This means we need to capture more data and rebuild.Because the amount of required data is so large, it has to be generated in lengths of 200 000 packets at a tim .After each set of 200 000 we rebuild the model and run the confidence test again.
Oddly enough, he required amount of d eeps increasing with each set.In a Tor connection, there are times when a circuit fails or changes, or a relay gets too busy and delays a packet.There is some extra variable latency that af ly aus n .sited very rarely.By setting a threshold on totic state p ents are trim robabilities, rare ev the model.The pruning process is carried out mainly in three steps: 1. Any state with an asymptotic probability below the threshold is removed from the model.
2. Any transitions going to or leaving from that state are also removed.
3. Finally, any state or set of states that cannot be reached due to a removal are also deleted.
This leaves the model with only the states and transitions for which we have enough data.When we are unable to collect enough data to be confident in the full model, we leave out the parts where we would need more data to achieve confidence.
The value of the probability threshold is how often we should expect the process to deviate from the model.The smallest asymptotic state probability and corresponding result from the confidence test are plotted against the number of packets captured in fig.8.The steady increa re enough data to rebuild the model confidently.As for our experiment, analysis of the asy abilities shows a large gap between 71 of the states and the other eight.The 71 have probabilities below 0.06%, while the other eight have probabilities above 8.2%.That is a break of over two orders of magnitude.This division makes a good lev ing.Following the pruning process, the model in fig. 9 results with a threshold of 0.01 (or 1%).
It is appropriate at this point to recall that the states of the putative HMM are characterized as having the same probability distribution over the next

Conclusion
This article analyzes the traffic of the anonymity protocol using a hidden model of the model based on the Markov model, reveals its main features.
Thus, the work describes the temporal side of the synchronization channel attack to detect a communication protocol tunneled through Tor.Model trust algorithm is applied to the implementation of the attack.A proof-of-concept experiment on our private Tor network showed that a model could successfully be reconstructed from inter-packet timings, and also proved the practical application of the model trust algorithm.
The direction of further research should be considered the development of methods for increasing the confidentiality of traffic in public networks.

Fig. 1 .
Fig. 1.The HMM process and its two stochastic processes: a probability of  and the observation process    t S s r : S A7 R   is the direct renumeration function.

Fig. 2 .
Fig. 2.An influence diagram showing the relationships between the various elements containing POMDP, which meanwhile prove the Markov property Solid arrows represent the dependencies of existence (for example, depend both on , and on programming (QP).At each iteration, the original NLP is reformulated as a subtask QP, linearizing the constraints and replacing the objective function   f x with its local quadratic approximation.QP subtas is: ming up, we make the following assumptions abou owledg underlying process.First, the process under con r of states and the transition probabilities are stationary.This assumption ensures that the training data set fully reflects the proce by equation (11); 9. Go to step 1. Sum t the observation data and our kn e of the sideration has a finite numbe ss.In addition, the alphabet O was сompleted and contains all expected observations.
, a. Let's start with a random selection of the initial state in this model.At each step, an outg rved.The in 0 data symbols, fig.3, b.We can see that the reconstructed model has the same state structure and almost the same transition probabilities as the original model.

Fig. 3 .
Fig. 3. Example 1: a -original model; b -Model built of 10 000 packages; c -Model built of 100 000 packages model built of the first 10 000 observations is shown in fig.4, b.In this case, the state structure model is different from the original model, since state 4 has not been since the first 10 000 observations.How large amount of data collected, the state el.In addition, the t visited once ever, with a structure of the reconstructed model using 100 000 observations matches the original mod ransition probabilities are also closer to the actual values.a b c Fig. 4. Example 2: a -Original model; b -Model built of 10 000 characters; c -Model built of 100 000 characters The results of both examples illustrate a point made earlier.If an excessive number of data samples is used, the algorithm creates only a model that represents only the data used to create it and not the main process.

Fig. 5 .
Fig. 5.The Tor Cascade, which originated from Bob, intended for Alice, is sent through a Tor circuit consisting of 3 relays

Fig. 8 .
Fig. 8.The plot of model confidence results as more data is captured

Fig. 9 .
Fig. 9.The result after pruning low-probability states and transitions Nodes 6 and 7, although both have the same outpu