Methodology of stakeholders’ behaviour modelling based on time

The methods presented in this article were created to model and describe the behaviour of the web users of a bank institution web portal. The source dataset is represented by a log file of the commercial bank web server. The analysis is oriented on examining the behaviour of visitors over an extended period (2009-2012). The years 2009-2010 represent the years of the financial crisis, and the years 2011-2012 represent the years after the financial crisis. The following method describes the sequence of steps necessary to pre-process the raw log file and model the web user behaviour using the multinomial logit model. The introduced methods can be used also for other domains in the case of appropriate data preparation.• Data preparation- data cleaning, user/session identification, path completion, variables determination;• Data analysis- model definition, parameters estimation, logits estimation, probabilities estimation;• Results evaluation- comparison of empirical and theoretical values in term of counts, probabilities and logits.

Methods described in this article were created to analyse and model the behaviour of the visitors of a web portal -bank web portal for a research article [4] . The data source was log files obtained from the webservers and it contained the visitors' accesses to the web parts of the web portal. A detailed description of the log files is located in Data in Brief [3] . Data related to Pillar 3 were gathered from bank webserver log files [5] . The research methodology was inspired by [6][7][8] . The model of bank visitors' behaviour was created based on a multinomial logit model [9] . The web usage analysis was done based on a sample of 2 071 235 logged accesses that were obtained after data preparation. The investigated categorical dependent variable is a variable category that represents a group of web parts that deal with a similar issue. The variable contains these categories: Pricing List, Reputation, Business Conditions, Pillar3 related, Pillar3 disclosure requirements and We support . The analysis is oriented on examining the behaviour of visitors over an extended period of time (2009)(2010)(2011)(2012). The years 2009-2010 represent the years of financial crisis (variable crisis = 1 ). On the other hand, the years 2011-2012 represent the years after the financial crisis (variable crisis = 0 ). The time independent variable was chosen as the variable week that was created from the date of the access of the visitor. The variable was created based on the standard ISO 8601 and the variable acquires the values 0-53. If the week number equals 0, it means that the given date belongs to the preceding year. The applied methodology is based on [10] and is as follows: 1. Obtaining log files from multiple servers. 2. Data preparation involving the following multiple tasks: a. Data cleaning -removing the unnecessary data from the log files (requests for pictures, styles and so on; and also accesses of robots of search engines) which leads to raw data of only accesses to the web portal. b. User/session identification -the visitors were identified based on the variables IP address and user agent; and sessions were identified based on the Reference Length method. c. Path completion -used to complete the records of the users' path that the user followed using the Back button in the web browser (these visited pages are not recorded in the log file since they have already been stored on the client side under the previous steps). d. Variables determination -the log file contains the variables in a typical Extended Log Format (ELF), so a transformation and variable definition are needed for the user behaviour analysis of the examined web portal. A dependent variable category is created, and it represents the web parts of the web portal. In case the web parts have low traffic, it is appropriate to create wider categories based on their relevance to the content [9] . The variable category will in the case of the examined web portal of the bank institution contain the following web categories: Pricing List, Reputation, Business Conditions, Pillar3 related, Pillar3 disclosure requirements, and We support . It is also necessary to identify independent variables-predictors that represent the time variables created from the timestamp of the access to the web category. In case of the weeks of the year, it is the variable week that was created based on ISO 8601 and will have values of 0-53. The variable will be 0 in case it is a week that begins in the previous year. The next predictor will be the dummy variable crisis that identifies the period of years during a financial crisis and after a financial crisis. Next a dummy variable internal was created for the identification of accesses from inside and outside of the organization's network. In this way the behaviour of users accessing from the inside/outside (internal/external access) of the organizations' network (the variable was created based on the sets of IP addresses) can be analysed.
, where the last category is chosen as the reference category η iJ = 0 , and it is assumed that the logits η i j are linear functions of the independent variables η i j = α j + x T i β j . Using inverse transformation, it is Based on the assumption that the expected counts ˆ y i j = n i ˆ π i j are big enough (they are not zero and no more than 20% from ˆ y i j is less than 5) to compare the actual model with the saturated model that is used to predict the probabilities independently for i = 0 , 1 , . . . , 53 , then the statistics G 2 (deviance, likelihood ratio) can be used After the last edit it is following: The hypothesis H0 : π i j = ˆ π i j can be testes using the LR test. LR test makes it possible to compare the estimates ˆ y i j with y i j . The saturated model has 54( J − 1 ) free parameters and the current model k ( J − 1 ) , then the degrees of freedom df are equal ( 54 − k )( J − 1 ) . S tatistics G 2 has approximately χ 2 ( df ) distribution. Pearson statistics can be used to compare the estimates ˆ y i j with y i j either: is Pearson residue that has also the χ 2 ( df ) distribution.
In the given application field is often the condition of using the LR test/ Pearson statistics violated. Usually, the examined variable has a considerable number of levels that are web parts of the portal or system (pages, content categories, activities, etc.). This results in the violation of using the LR test/Pearson statistics-the expected counts are not large enough. For this reason are used alternative methods to evaluate the model [ 9 , 11 ] -visualization of empirical and theoretical counts differences, extreme identification, comparison of the distribution of empirical relative counts of accesses and estimated probabilities of the examined web part j in time i , and empirical and theoretical logit visualization of each web part, except the reference web part.
The created model was evaluated based on the following steps: a. Empirical counts determination y i j . b. Theoretical counts estimation ˆ y i j = ˆ π i j j y i j .
c. Visualization of differences in the empirical and theoretical counts of accesses d i j = y i j −ˆ y i j .
d. Extreme values identification d i j , where d i j > d j + 2 s j represents an underestimated prediction and d i j < d j − 2 s j represents overestimated prediction where s j is standard deviation and d j is the mean of differences of the category j. e. Calculation of relative empirical counts of accesses p i j = h Visualization of empirical and theoretical logits for individual web categories except for the reference one.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.