Application of ARIMA ( 1 , 1 , 0 ) Model for Predicting Time Delay of Search Engine Crawlers

World Wide Web is growing at a tremendous rate in terms of the number of visitors and number of web pages. Search engine crawlers are highly automated programs that periodically visit the web and index web pages. The behavior of search engines could be used in analyzing server load, quality of search engines, dynamics of search engine crawlers, ethics of search engines etc. The more the number of visits of a crawler to a web site, the more it contributes to the workload. The time delay between two consecutive visits of a crawler determines the dynamicity of the crawlers. The ARIMA(1,1,0) Model in time series analysis works well with the forecasting of the time delay between the visits of search crawlers at web sites. We considered 5 search engine crawlers, all of which could be modeled using ARIMA(1,1,0).The results of this study is useful in analyzing the server load.


Introduction
Crawlers also known as 'bots', 'robots' or 'spiders' are highly automated programs which are seldom regulated manually [1] [2].Crawlers form the basic building blocks of search engines which periodically visit the web sites, identify new web sites, update the new information and index the web pages in search engine archives.The log files generated at web sites play a vital role in analyzing user as well as the behavior of the crawlers.Most of the works in web usage mining or web log mining is related to user behavior as they have application in target advertising, online sales and marketing, market basket analysis, personalization etc.There is open source software available like Google Analytics which measures the number of visitors, duration of the visits, the demographic from which the visitor comes etc.But it cannot identify search engine visits because Google Analytics track users with the help of JavaScript and search engine crawlers do not enable the JavaScript embedded in web pages when the crawlers visit the web sites [3].The search engine crawlers initially access the robots.txtfile which specifies the Robot Exclusion Protocol.Robots.txt is a text file kept at the root of the web site directory.The crawlers are supposed to access this file first before it crawls the web pages.The crawlers which access this file first and proceeds to crawling are known as ethical crawlers and other crawlers who do not access this file are called unethical crawlers.The robots.txtfile contains the information about which pages are allowed for crawling and which all folders and pages are denied access.Certain pages and folders are denied access because they contain sensitive information which is not intended to be publically available.There may be situations where two or more versions of a page will be available one as html and other one as pdf.The crawlers can be made do avoid crawling the pdf version to avoid redundant crawling.Also certain files like Ja-vaScript, images, style sheets etc. can be avoided for saving the time and bandwidth.There are two ways to do this.First one is with the help of robots meta tag and the other one is with the help of robots.txtfile.The robots.txtfile contains the list of all user agents and the folders or pages which are disallowed [30].The structure of a robots.txtfile is follows.
User-agent: Disallow: "User-agent:" is the search engine crawler and "Disallow:" lists the files and directories to be excluded from indexing.In addition to DOI: 10.12948/issn14531305/17.4.2013.03"User-agent:" and "Disallow:" entries, comment lines are included by putting the # sign at the beginning of the line.For example all user agents are disallowed from accessing the folder /a.#All user agents are disallowed to see the /a directory.
User-agent: * Disallow: /a/ The crawlers which initially access the robots.txtand then the subsequent files or folders are known as ethical crawlers whereas others are known as unethical crawlers.Some crawlers like "Googlebot", "Yahoo!Slurp" and "MSNbot" cache the robots.txtfile for a web site and hence during the modification of robots.txtfile, these robots may disobey the rules.Roughly, a crawler starts off with the URL for an initial page p0.It retrieves p0, extracts any URLs in it, and adds them to a queue of URLs to be scanned.Then the crawler gets URLs from the queue (in some order), and repeats the process.Every page that is scanned is given to a client that saves the pages, creates an index for the pages, or summarizes or analyzes the content of the pages [26].Certain crawlers avoid too much load on a server by crawling the server at a low speed during peak hours of the day and at a high speed during late night and early morning [2].A crawler for a large search engine has to address two issues.First, it has to have a good crawling strategy, i.e., a strategy for deciding which pages to download next.Second, it needs to have a highly optimized system architecture that can download a large number of pages per second while being robust against crashes, manageable, and considerate of resources and web servers [24].There are two important aspects in designing efficient web spiders, i.e. crawling strategy and crawling performance.Crawling strategy deals with the way the spider decides to what pages should be downloaded next.Generally, the web spider cannot download all pages on the web due to the limitation of its resources compared to the size of the web [28].The mobile crawlers that always stay in the memory of the remote system occupy a considerable portion of it.This problem will fur-ther increase, when there are a number of mobile crawlers from different search engines. all these mobile crawlers will stay in the memory of the remote system and will consume lot of memory that could have otherwise been used for some other useful purposes;  it can also happen that the remote system may not allow the mobile crawlers to reside permanently in its memory due to security reasons;  in case a page changes very quickly then the mobile crawler immediately accesses the changed page and sends it to the search engine to maintain up-to-date index.This will result in wastage of network bandwidth and CPU cycles etc [30].Recently web crawlers are used for focused crawling, shopbot implementation and value added services on the web.As a result more active robots are crawling on the web and many more are expected to follow which will increase the search engine traffic and web server activity [4].The Auto Regressive Integrated Moving Average (ARIMA) Model was used to predict the time delay between two consecutive visits of a search engine crawler.We used the differenced first-order autoregressive model, ARIMA(1,1,0) for forecasting the time delay between two consecutive visits of search engine crawlers.

Background Literature
There are several works that mentions about the search engine crawler behavior.A forecasting model is proposed for the number of pages crawled by search engine crawlers at a web site [3].Sun et al has conducted a large scale study of robots.txt[2].A characterization study and metrics of search engine crawlers is done to analyze the qualitative features, periodicity of visits and the pervasiveness of visits to a web site [4].The working of a search engine crawler is explained in [5].Neilsen NetRatings is one of the leading internet and digital media audience information and analysis services.NetRatings have provided a study on the usage statistics of search engines in United States [6].Com-DOI: 10.12948/issn14531305/17.4.2013.03mercial search engines play a lead role in World Wide Web information dissemination and access.The evidence and possible causes of search engine bias is also studied [7].An empirical pilot study is done to see the relationship between JavaScript usage and web site usage.The intention was to establish whether JavaScript based hyperlinks attract or repel crawlers resulting in an increase or decrease in web site visibility [8].The ethics of search engine crawlers is identified using quantitative models [9].Analysis of the temporal behavior of search engine crawlers at web sites is also done [10].There is a significant difference in the time delay between and among various search engine crawlers at web sites [11].Search engines do not index sites equally, may not index new pages for months, and no engine indexes more than about 16% of the web [23].A crawling technique to reduce the load of the network using mobile agents were developed by Bal and Nath [25].The working of a comprehensive full text search engine called WebCrawler is also studied [27].An optimal algorithm for distributed web crawling is done by compressing the crawled web data before sending it to the central database of the search engine and thereby reducing the load and processing bottleneck of the search engine database [29].

Methodology 3.1 Pre Processing
Web log files need considerable amount of preprocessing.The user traffic needs to be removed from this file as this work focuses on the search engine behavior.Improper preprocessing may bias the data mining tasks and lead to incorrect results.About 90% of the traffic generated at web sites is contributed by search engine crawlers [13].The advantages of preprocessing are:  the storage space is reduced as only the data relevant to web mining is stored;  the user visits and image files are removed so that the precision of web mining is improved.The web logs are unstructured and unformatted raw source of data.Unsuccessful status codes and entries pertaining to irrelevant data like JavaScript, images, stylesheets etc. including user information are removed.The most widely used log file formats are Common Log File Format and Extended Log File Format.The Common Log File format contains the following information: a) IP address b) authentication name c) the date-time stamp of the access d) the HTTP request e) the URL requested f) the response status g) the size of the requested file.The Extended Log File format contains additional fields like a) the referrer URL b) the browser and its version and c) the operating system or the user agent [14] [15].Usually there are three ways of HTTP requests namely GET, POST and HEAD.Most HTML files are served via GET method while most CGI functionality is served via POST or HEAD.The status code 200 is the successful status code [14].Search engines are identified from their IP addresses and user agents used for accessing the web.The log file of a business organization www.nestgroup.net of 30 days ranging from May 1, 2011 to May 31, 2011 comprising of 31 days.Table 1 shows the results of preprocessing.

Auto Regressive Integrated Moving Average Model (ARIMA)
Forecasting is an important aspect of statistical analysis that provides guidance for decisions in all areas.It is important to be able to make sound forecasts for variables such as sales, production, inventory, interest rates, exchange rates, real and financial asset prices for both short and long term business planning.Autoregressive Integrated Moving Average (ARIMA) models provide a unifying framework for forecasting.These models are aided by the abundance of high quality data and easy estimation and evaluation by statistical packages [16].We found the time delay between the visits of search engine crawlers could be predicted using the ARIMA Model.ARIMA(p,d,q): ARIMA models are, in theory, the most general class of models for forecasting a time series which can be made stationary by transformations such as differencing and logging.In fact, the easiest way to think of ARIMA models is as fine-tuned ver-sions of random-walk and random-trend models.The fine-tuning consists of adding lags of the differenced series and/or lags of the forecast errors to the prediction equation, as needed to remove any last traces of autocorrelation from the forecast errors.Lags of the differenced series appearing in the forecasting equation are called "auto-regressive" terms, lags of the forecast errors are called "moving average" terms, and a time series which needs to be differenced to be made stationary is said to be an "integrated" version of a stationary series [20].Lag 1 is the time period between two observations yt and yt-1.time series can also be lagged forward, yt and yt+1.
A non-seasonal ARIMA model is classified as an ARIMA(p,d,q) model, where:  p is the number of autoregressive terms,  d is the number of non-seasonal differences, and  q is the number of lagged forecast errors in the prediction equation.The autoregressive element, p, represents the lingering effects of preceding scores, the integrated element, d, represents trends in the data and q represents the lingering effects of preceding random shocks.When the time series is long, there are also tendencies for measures to vary periodically called seasonality or periodicity in time series.Time series analysis is more appropriate for data with autocorrelation.If all patterns are accounted for in the model, the residuals are random.In many applications of the time series, identifying and modeling the patterns in the data are sufficient to produce an equation, which is then used to predict the future of the process.

Model Identification
Let y1, y2, y3…yT represent a sample of T observations of a variable of interest y and {yt} represents the time series.Since the stationary property is essential for the identification of an ARIMA model, the first step is always to test for stationary property of the underlying series.Many data in real time including the web data chosen for our study is not stationary.The series can be made stationary by DOI: 10.12948/issn14531305/17.4.2013.03differencing with or without pretransformations.Formally, {yt} is said to be stationary if the mean, E(yt)=µ,the variance Var(yt)=E(yt -µ) 2 and the covariance Cov(yt, yt-s)= E(yt -µ)(yt-s -µ)= γs are all stable over time.For the series to be stationary, it must not exhibit any stochastic trend (changing mean) or varying volatility (changing variance) [16] [21].

Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF)
The principle way to determine which Auto-Regressive (AR) or Moving Average(MA) model is appropriate is to look at the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) of the time series.The plot of the autocorrelation function and partial autocorrelation function also serves as a visual test for stationary property [18] [19].At lag k, the ACF is computed by In time series, we may want to measure the relationship between yt and yt−k when the effects of other time lags 1, 2,...,k − 1 have been removed.The autocorrelation does not measure this.However, Partial autocorrelation is a way to measure this effect.The partial autocorrelation of a time series at lag k is denoted αk and is found as follows (1) Fit a linear regression of yt to the first k lags (i.e.fit an AR(k) model to the time series): Then αk = φˆk, the fitted value of φk from the regression (Least Squares).The set of partial autocorrelations at different lags is called the partial autocorrelation function (PACF) and is plotted like the ACF.The Box-Jenkins procedure is concerned with fitting an ARI-MA model to data [17].It has three parts: identification, estimation, and verification.
Figure 1 shows the Box-Jenkin's model building process.

Fig. 1. Box-Jenkins Model Building Process
The Box-Jenkins approach suggests short and seasonal (long) differencing to achieve stationary in the mean, and logarithmic or pow-er transformation to achieve stationary property in the variance.In case the series are seasonal, the poses multiplicative seasonal models coupled with long-term differencing, if necessary, to achieve stationary property in the mean.The difficulty with such an approach is that there is practically never enough data available to determine the appropriate level of the seasonal ARMA model with any reasonable degree of confidence.Users therefore proceed through trial and error in both identifying an appropriate seasonal model and also in selecting the right long-term (seasonal) differencing.In addition, seasonality complicates the utilization of ARMA models as it re-quires using many more data while increasing the modelling options available and making the selection of an appropriate model more difficult [22].
We have chosen 100 time delay between consecutive visits for the crawlers Baiduspider, Bingbot, Googlebot, Feedtetcher-Google and Slurp.The time delay in seconds were plotted and Autocorrelation Function(ACF) and Partial Autocorrelation Function(PACF) were plotted.The obtained plots for Baiduspider are given in Figure 2 and Figure 3 respectively.
where µ represents the constant and φ is the autoregressive coefficient.The observed and forecasted values of time delay between visits of crawlers namely Baiduspider, Bingbot, Feedfetcher-Google, Googlebot and Slurp are shown in Figure 12, Figure 13, Figure 14, Figure 15 and Figure 16 respectively.

Conclusion
The results revealed that Autoregressive Integrated Moving Average, ARIMA(1,1,0) model suits well for predicting the time delay between visits of search engine crawlers like Baiduspider, Bingbot, Feedfetcher-Google, Googlebot and Slurp.The Autocorrelation Function (ACF) and Partial Autocorrelation Function suggested to opt for ARIMA(1,1,0) model.The crawlers like Baiduspider, Bingbot and Feedfetcher-Google showed more accuracy with this model than Googlebot and Slurp.This forecasting is helpful to calculate the server load and traffic.This work can be extended to find the time delay between visits of crawlers on hourly basis to identify the crawlers visiting the web site during peak hours.The visits of such crawlers can be regulated and assigned to off hours so that the server load could be minimized.

Table 1 .
Results Those search engines whose number of visits less than 5 in a month is eliminated before further analysis.There were 13 distinct search engine crawlers.Certain search engine crawlers made several visits on one day itself where as some others made one or two visits DOI: 10.12948/issn14531305/17.4.2013.03 Box-Jenkins methodology pro- DOI: 10.12948/issn14531305/17.4.2013.03