Attributing hacks with survival trend filtering

In this paper we describe an algorithm for estimating the provenance of hacks on websites. That is, given properties of sites and the temporal occurrence of attacks, we are able to attribute individual attacks to joint causes and vulnerabilities, as well as estimate the evolution of these vulnerabilities over time. Specifically, we use hazard regression with a time-varying additive hazard function parameterized in a generalized linear form. The activation coefficients on each feature are continuous-time functions over time. We formulate the problem of learning these functions as a constrained variational maximum likelihood estimation problem with total variation penalty and show that the optimal solution is a 0th order spline (a piecewise constant function) with a finite number of adaptively chosen knots. This allows the inference problem to be solved efficiently and at scale by solving a finite dimensional optimization problem. Extensive experiments on real data sets show that our method significantly outperforms Cox’s proportional hazard model. We also conduct case studies and verify that the fitted functions of the features respond to real-life campaigns.


Introduction
Websites get hacked whenever they are subject to a vulnerability that is known to the attacker, whenever they can be discovered efficiently, and, whenever the attacker has efficient means of hacking at his disposal. This combination of knowledge, opportunity, and tools is quite crucial in shaping the way a group of sites receives unwanted attention by hackers.
Unfortunately, as an observer we are not privy any of these three properties. In fact, we usually do not even know the exact time t s a site s was hacked. Instead, all we observe is that a compromised site will eventually be listed as such on one (or more) blacklists. That is, we know that by the time a site lands on the blacklist it definitely has been hacked. However, there is no guarantee that the blacklists are comprehensive nor is there any assurance that the blacklisting occurs expediently. Another shortcoming of blacklists is that they do not reveal which aspect of the website was to blame.
On the other hand, metadata exists for each website and it allows us to measure the potential vulnerability of the websites quantitatively. This includes specific string snippets on websites that are indicative of certain versions of software which might have been identified as exploitable or containing bugs that lead to possible security breach. An interesting method that uses these features to identify websites at risk was recently proposed by Soska and Christin [27]. However, it is unclear how each of these features contribute to the "hazard" of a particular website getting hacked at a given time.
In this paper, we propose a novel hazard regression model to address this problem. Specifically, the model provides a clear description of the probability of a site getting hacked conditioned on its time-varying features, therefore allowing prediction tasks such as finding websites at risk, or inferential tasks such as attributing attacks to certain features as well as identifying change points of the activations of certain features to be conducted with statistical rigor.

Related work
The primary strategy for identifying web-based malware has been to detect an active infection based on features such as small iFrames [16]. This approach has been pursued by both academia [e.g., 1,8] and industry [e.g., 6,17,18]. While intuitive, this approach suffers from being overly reactive, and defenders must compete against adversaries in an arms race to detect increasingly convoluted and obscure forms of malice.
Soska and Christin [27] propose a data driven (linear classification) approach to identify software packages that were being targeted by attackers to predict the security outcome of websites.
Compared with [27], our method is able to predict the time a site will be hacked in a survival analysis framework. Our method naturally handles censoring of observations (i.e. inconsistency of exact hacking time and the time listed on blacklists), automatically identifies a small number of features as exploits and allows the activation coefficients on each feature to be functions over continuous time.
Finally, our hazard regression model is quite generic and much more powerful than the widely-used Cox model [3] in our applications, therefore it can be viewed as a novel and alternative way of estimating nonparametric hazard functions at scale, and be used as a drop-in replacement in many other applications having similar structures. Towards the end of the paper, we provide one such application on studying the user dropout rate (churn rate) at Alipay -a major online payment services used by hundreds of millions of people -and illustrate some interesting insight.

Background
Our work is based on two key sets of insights: the specific way of how vulnerabilities are discovered, exploited and communicated in the community, and secondly, the mapping of these findings into a specific statistical model.

Attacks on websites
We start by describing the typical economics of hackers and websites.
Exploits are first discovered by highly skilled individuals (hackers) who tend to reserve these exploits for their own purposes for an extended period of time. Typically, a hacker will retain an exploit as long as there is an ample supply of vulnerable sites that can be discovered efficiently. Once the opportunity for such hacks diminishes due to an exhausted supply, the exploits are often sold or freely published to the benefit of the author's reputation.
Once this knowledge enters the public domain, the availability of available tools increases with it. It is added to the repertoire of popular kits, at the ready disposal of "script kiddies" who will attempt to attack the remaining sites. The increased availability of tools often offsets the reduced opportunity to yield a secondary wave of infections.
An important aspect in the above scenario is the way how sites are discovered. Quite frequently this is accomplished by web queries for specific strings in sites, indicative of a given vulnerability (e.g. database, content management system (CMS), server, scripting language). In other words, string matches are excellent features to determine the vulnerability of a site and are therefore quite indicative of the likelihood that such a site will be attacked. Unfortunately, we are not privy to the search strings a potential attacker might issue. However, we can use existing fingerprints to learn such sequences, e.g. the tags and attributes in the pages of a site.
In a nutshell, the above informs the following statistical assumptions on the nature of website vulnerabilities. Firstly, sites are only effectively vulnerable once an exploit has been discovered. Second, changes in attack behavior are discrete rather than gradual. In the following we design a statistical estimator capable of adapting to this specific profile.

Hazard regression
Hazard regression is commonly used in survival analysis of patients suffering from potentially fatal diseases. There, one aims to estimate the chances of survival of a particular patient with covariates (attributes) x, as a function of time, such as to better understand the effects of x. Unfortunately, each patient only has one life, and possibly different attributes x, hence, it is impossible to estimate the fatality rate directly.
Instead, one assumes that the hazard rate λ(x, t) governs the instantaneous rate of dying of any x at any given time t: where T is a random variable denoting the time of death. That is, the density of dying at time t is given by .
This leads to a differential equation for the survival probability with solution Here we assumed, without loss of generality, that time starts at 0. Note that a special case of the above is λ( This is the well-known nuclear decay equation (also an example of survival analysis). In our case, death amounts to a site being infected and λ(x, t) is the rate at which such an infection occurs. An extremely useful fact of hazard regression is that it is additive. That is, if there are two causes with rates λ and γ respectively, (2.2) allows us to add the rates. We tacitly assume here that once a site is infected, the attacker will take great care to keep further attackers out, or at least, it will remain blacklisted as long as it is infected in some manner. In terms of (2.3) we have and The reason why this is desirable in our case follows from the fact that we may now model λ as the sum of attacks and can treat them as if they were independent in the way they affect sites. One challenge in our analysis is the fact that we may not always immediately discover whether a site has been taken over. The probability that this happens in some time interval [t 1 , t 2 ] is given by F (t 1 |x) − F (t 2 |x), i.e. by the difference between the cumulative distribution functions.
Finally, the absence of evidence (of an infection) should not be mistaken as evidence of absence of such. In other words, all we know is that the site survived until time T . By construction, their probability is thus given by F (T |x). In summary, given intervals [t i , T i ] of likely infection for site i, at time T we have the following likelihood for the observed data: (2.5) Up to this point, the model is completely general and little assumptions are made. The key of statistical modeling boils down to specify a tractable parametric or nonparametric form of the hazard rate function λ(x, t), which by construction only needs to be nonnegative and obeys that ∞ 0 λ(x, t)dt = ∞ for all possible x.
Arguably the most commonly used hazard regression model is the Cox's proportional hazard model, where λ(t|x) = λ 0 (t) exp(w x) [3]. The linear dependence on feature vector x makes it very appealing for interpretability and inference and leaving λ 0 (t) unspecified allows the model to handle global variations over time that is not captured by covariates x.
There is a large body of work devoted to extending the Cox model by coming with useful specifications of the hazard rate as more generic functions of the covariate x and time t [26,2,20], and inventing techniques to address the curseof-dimensionality associated with nonparametric modeling [30,33,21,7]. We cannot possibly enumerate these work exhaustively, so we encourage curious readers to check out the wonderful textbooks [12,28], the documentation for the "survival" package in R [29] and the references therein. Part of our contribution in this paper is to connect various bits and pieces of the statistical literature for the task of modeling hacker activities.

Trend filtering
Trend filtering [11,32] is a class of nonparametric regression estimators that has precisely the required property. It is minimax optimal for the class of functions [0, 1] → R whose αth order derivative has bounded total variation. In particular, it has the distinctive feature that when α = 0, 1 it produces piecewise constant and piecewise linear estimates (splines of order 0 and 1) and when α ≥ 2 it gives piecewise smooth estimates. The local adaptivity stems from the sparsity inducing regularizers that choose a small but unspecified number of knots. When α = 0, trend filtering reduces to the fused lasso [31] which solves for a given loss function L. The advantage of this model is that each discrete change in the rate function effectively corresponds to the discovery or the increased (or decreased) exploitation of a vulnerability -after all, the rate of infection should not change unless new vulnerabilities are discovered or a patch is released.
There has been an increasingly popular body of work on extending of trend filtering to estimate functions with heterogeneous smoothness over graphs [35], in multiple dimensions (e.g., images, videos) [25] and with additive structures [24]. It will be clear later that our proposed method can be considered as an additive survival trend filtering model, which as far as we know, is the first time trend filtering (or additive trend filtering) is used for (time-varying) survival analysis.

Attributing hacks
We will now assemble the aforementioned tools into a joint model for attributing hacks.

Additive hazard function and variational maximum likelihood
Given the hazard function λ(t, x i ) of each website i ∈ {1, ..., n} with feature vector x i (t) ∈ R d at time t, we have the following survival problem: where τ i is the unobserved random variable indicating the exact time that website x i is being hacked. All we know for websites on the blacklist 1 is the time already been hacked T i and the last time it was alive t i . This is what we call an "interval-censored" observation. Time T denotes the end of the observation interval, e.g., now. Websites that were still alive at T are considered "right censored" because all we know is that their hack time will be beyond of T . Under the survival analysis framework, we have It remains to specify the hazard function. In our setting, x is a high-dimensional non-negative feature vector, so we need to impose further structures on the hazard function λ(x, t) to make it tractable. We thus make an additive assumption and expand the hazard function into an inner product This is still an extremely rich class of functions as x i (t) can be different over time and w i (t) is allowed to be any univariate nonnegative functions 2 . Leaving it completely unconstrained will inevitably overfit any finite data set. However, standard nonparametric assumptions on w i (t) would require the function to be Lipschitz continuous or even be higher order differentiable. These assumptions make it a poor fit for modeling the sharp changes of w i (t) in response to sparse events such as an release of an exploit on hacker forums.

Standard and monotone model with Total Variation penalty
To address this issue, we propose to penalize the total variation (TV) of w i for each i and optionally, impose a monotonicity constraint on these functions. This gives rise to a variational penalized maximum likelihood problem below: where is the negative log-likelihood functional; z i is the indicator of censoring type for observation x i , i.e. interval-censored or right-censored; pertise of the curator. Entries in a blacklist always contain the website in question, but are also furnished timestamps of the security event and information regarding the nature of the malice. 2 A natural alternative formulation would be to take λ(x, t) = exp(w 0 (t)+ d i=1 x i (t)w i (t)) as in Cox's model, so we do not have to impose any constraints on feature x and coefficient vector w. In this paper, since the features are all {0, 1} indicators of the existence of certain html snippets, so we naturally have nonnegative feature vectors and also it seems more reasonable to assume the contribution to the hazard from each feature adds up linearly, rather multiplied together exponentially.
is the associated censoring time;F are the set of all functions [0, T ] → R; and w j (t) is the evaluation of function w j at time t. We define total variation as follows: When f is differentiable almost everywhere, total variation reduces to the L 1 norm of the first derivative This should be compared to the squared L 2 norm penalty with degree 1, commonly used as a regularizer in the variational form of smoothing splines: By construction, a piecewise constant function f would have small total variation but unbounded L 2 norm of its first derivative. We now make two remarks on the constraints. First of all, the non-negativity constraints are necessary to ensure the hazard function λ(x) to be a valid hazard function for all x ≥ 0. Secondly, whether the monotonicity constraints are needed is completely application specific. We call the model with or without the monotonicity constraints the "monotone model" and "standard model" respectively. There could also be a "mixed" model where we impose monotonicity constraints only on a subset of the features where there are reasons to believe the corresponding hazard only increase over time. From the modeling point of view, when we impose the "monotone" constraints, it trivializes the total variation penalty as TV(w j ) = w j (T ) − w j (0); as a result, in cases where monotonicity of hazard makes sense, we often take γ = 0 which makes the model attractively parameter-free.

Non-convex variant of total variation
While the total variation penalty is convex and it often induces a sparse number of change points, it also results in substantial estimation bias due to the shrinkage of the magnitude of the changes. The issue is exacerbated when we use a monotonicity constraint. Since TV(w j ) = w j (T ) − w j (0), the total variation remains the same no matter there is only one change point or there are infinite number of change points. This is a nice feature as the class of functions contains functions that are smoothly changing, but on the other hand, it does not always give us a sparse number of change points that are useful for interpreting the results. This motivates us to consider a non-convex variant of the total variation penalty of the following form: (3.3) where 0 < ≤ 1 is a tuning parameter. Note that the functional is 0 only when f is a constant function if taking = 1. We call this variant of the model the "log" model or the "log + Monotone" when the monotonicity constraints are imposed. The new functional has a number of desirable properties: Proof. We use two elementary properties of log(1+x), which follow from concavity. Firstly, the tangent at 0 majorizes the function, i.e. log(1+x) ≤ x for all x > −1. Secondly, for any x, y ≥ 0 we have that log(1+x+y) ≤ log(1+x)+log(1+y).
The first property shows that for any partition of [a, b] the value of the supremum in (3.3) is smaller than that of its counterpart for TV. Hence the supremum is majorized, too.
The second property follows from the fact that for TV we can take the limit of infinitesimally small segments without decreasing its value. In other words, the limit of an infinite partition has the same value as TV(f ). On the other hand, Hence, any δ-partition of the interval [a, b] for a Lipschitz continuous function f with Lipschitz constant L will yield a value of the log(1 + x) penalty that is at most L 2 δ|b − a| smaller than T V (f ). Hence, for δ → 0 this converges to T V (f ).
A consequence of this is that TV log favors functions that are not Lipschitz, i.e. functions that have a small number of larger jumps, which is exactly what we want. The following example illustrates this.
Note that f has one change point, g has two and h has an infinite number of change points.
Assuming ≤ 1, then TV log (f ) = log( + 1) and the optimal partition is a single block at t 0 = 0, t 1 = 1. It is clear that since ≤ 1 the more block we partition it into, it can only make it smaller. Turns out that we can also calculate the expression for g and by Jensen's inequality and concavity of log, we can clearly see that

When = 1, it is not hard to check that the supremum is achieved at
For other ≤ 1, there exists an optimal δ > 0 that maximizes the expression. However, taking δ → 0 will provide a valid lower bound for all 0 < ≤ 1. By l'Hospital's rule, we get which is bigger than TV log (g) with two change points (and TV log (f ) with onechange point). We conjecture that it is also bigger than any of the TV log of any functions with a finite number of change points but the same total variation.
We will experiment with different variants of the model in simulation and real data experiments to illustrate the merits of the various approaches.
Comparison to the Cox model Note that our model (regardless of which variant) is a much richer representation comparing to Cox's proportional hazard model [3]. Our method handles time-varying coefficients and feature vectors while Cox model is static. Also, the semiparametric nature of Cox model by construction leaves out the baseline hazard λ 0 (t) such that it becomes nontrivial to produce a proper survival distribution. For example, a common trick is to parametrize the baseline hazard rate λ 0 (t) by a either a constant or a log-Weibull density. Our formulation does not require a parametric assumption and produces a nonparametric estimate of it to account for all the effects that are not explained by the given feature.
The only remaining issue is that (3.2) is an infinite dimensional functional optimization problem and could be very hard to solve. In the subsequent two sections, we address this problem by reducing the problem to finite dimension and deriving scalable optimization algorithms to solve it.

Variational characterization
The following theorem provides a finite set of simple basis functions that can always represent at least one of the solutions to (3.2).
for some set T that collects all censoring boundaries and time steps where feature The proof, given in the appendix, uses a variational characterization due to De Boor [4], Mammen et al. [15] and a trick that reparameterizes our problem using the cumulative function W i (t) = t 0 w i (t)dt. Extra care was taken to handle the non-negativity and monotone constraints. We remark that the above result also applies trivially to the case when γ = 0 (unpenalized version).
The direct consequence of Theorem 4 is that we can now represent piecewise constant functions by vectors in R |T | and solve (3.2) by solving a tractable finite dimensional fused lasso problem (with a nonnegativity constraint and possibly a monotonicity constraint) of the form: where we abuse the notation to denote w j as the vector of evaluations of function w j at the sorted time points in T ; and D ∈ R (|T |−1)×|T | is the discrete forward difference operator. Although the above result does not cover the cases when we replace the TV-penalty with the nonconvex the log penalty described in (3.3). we can still restrict our attention to the class of piecewise constant functions, and penalizẽ instead of (3.3). We claim that this is a sensible approximation because this is effectively choosing a specific partition according to the time points that comes from the data, therefore is always a lower bound of (3.3). In fact, when = 1 and we can check that the number of non-zero summand in (3.6) is exactly the same as the number of jumps in the fitted function w j and for the same total variation of w j , this penalty would be smaller when the number of change points are larger.

Remark 5 (Higher order Trend Filtering).
For k-th order trend filtering with k ≥ 1, we do not get the same variational characterization. Although it is still true that the optimal solution set contains a spline W * j , there is no guarantee that the knots of W * j are necessarily a subset of T . Fortunately, by Proposition 7 of Mammen et al. [15], restricting our attention to the class of splines with knots in T will yield a solution that is very close to the W * j at every τ ∈ T and it has total variation of its kth derivative on the same order as TV(W * j ). In addition, a spline is uniformly approximated by the class of functions that can be represented by the falling factorial basis [32,34], therefore, the function from kth order trend filtering defined on T will be a close approximation to the optimal solution of the original variational problem.
Remark 6 (A sparse and memory-efficient update scheme). The theorem suggests a memory-efficient scheme for optimization, as one can only keep track of the coefficients of the step functions rather than representing the dense vectors w 0 , ..., w d . Moreover, each stochastic gradient update will be sparse since each user has only a handful of changes in his feature vectors over time and at most two censoring brackets.
Remark 7 (Sparsity in the features as implicit regularization). On top of that, while the feature vectors are ultra high-dimensional, they are extremely sparse. When x j (t) = 0, the additive structure of the model ensures that changes in w j (t) do not affect the hazard rate, therefore w j (t) will be chosen such that the TV penalty is minimized. This suggests that sparsity structures in features also serve as an implicit "regularization" which helps to improve the generalization of the model.

Algorithms
The previous section reduces the optimization over functions to an optimization problem over vectors in finite dimensional Euclidean space. However, it does not imply that the problem is easy to solve. Due to the interval-censoring, the loss functions are not convex, and the penalty is either convex (TV) or nonconvex (log(|·|+ )) but non-smooth. In addition, there are non-negativity and possibly monotonicity constraints. In this section, we derive numerical optimization techniques for solving problem (3.5) and discuss the convergence rate to a stationary point.
Specifically, we propose to train all four variants of our models (standard and monotone / TV or TV-log) using the proximal gradient algorithm with a stochastic variance-reduced gradient approximation [10]. The proposed method is computationally efficient and system-friendly by design, which allows us to scale up the method to work with hundreds of thousands of data points. While the problem is nonconvex, we find the proposed techniques extremely effective in simulation and real data experiments.
In the subsequent discussions, we will first derive the gradient of the loss function and write down the variance-reduced unbiased estimate of the gradient as per [10]. Then we will describe the key steps of the algorithms. While the proximal gradient algorithm itself is standard, the main challenge for this specific problem is to be able to solve the proximal map efficiently, since there are several non-smooth terms. We show using a general technique due to Yu [37] that we can decompose the proximal map into two simpler ones which both admit lineartime algorithms.
Gradient calculations Note that the probability of each interval-censored data − log(p(t i ≤ τ i < T i )) in (3.1) can be decomposed as As a result we only need to calculate the integral between t i and T i for interval-censored data while evaluating the gradients: where g i (·) is the negative log-likelihood function on x i ; τ ∈ R d,T denotes sorted times in T for each feature j ∈ {1, ..., d}.
Here δ : R |T | → R ∪ {+∞} is the standard indicator function in convex analysis that evaluates to 0 when condition is true and +∞ otherwise.
and then solving for the standard model and monotone model respectively.
The proof makes use of Theorem 1 of Yu [37], and basic definition and properties of subdifferential in convex analysis (see, e.g., [23]). The full proof is presented in the appendix.
The above result ensures that we can solve the proximal map in two steps. The first step update is simply a projection to the first orthant, which involves trivially projecting each coordinate separately to R + . The second step can also be solved in linear time by a dynamic programming algorithm [9].
Using the same algorithm for the non-convex penalty We now show how to use the same algorithm for the log-model. The idea is that we decompose the non-convex penalty as follows TV log (w) = Dw 1 + ξ(w).
The following lemma describes the convenient property of ξ(w) that allows us to group it into the smooth loss function.

Lemma 9. ξ(w) is continuously differentiable. Also, the gradient of ξ is
Note that ξ(w) always exists so the only thing we need to prove is that ξ(w) is continuously differentiable. The proof uses the definition of multivariate differentiability by checking that at all point w, all directional derivatives exists. It is mostly elementary calculus but is somewhat long ans technical, so we defer the detailed arguments to the appendix.
The lemma implies that we can rewrite the objective function of the log-model into: which is of the same form as the TV-model, except for an additional smooth term γ d j=1 ξ(w j ). In other word, the same prox-TV algorithm with a slightly modified gradient can be used to find a stationary point for the log-model.

Convergence rate
Lastly, Reddi et al. [22] proved fast convergence rate of proximal SVRG to a stationary point for nonconvex loss functions, which guarantees that with an appropriately chosen learning rate, only O(1/ε) proximal operators calls and O(n + n 2/3 /ε) incremental gradient computation are needed to get to ε accuracy.
In our experiments, we find that using diagonal preconditioning with Adagrad [5] improves the convergence of both algorithms with little added system overhead; also we find that it is helpful to run a few data passes of prox-SGD before starting prox-SVRG and updating the full gradient only a few data passes (rather than every data pass). Understanding the convergence properties with these additional heuristics included is beyond the scope of this paper.

Experiments
In this section, we show the accuracy of the learned latent hazard function evaluated on synthetic data with ground truth and by conducting case studies on real data evaluated using domain expertise. We evaluate the out-of-sample predictive power measured by log-likelihood, which significantly outperforms Cox's models. The hope is that the hidden hazard function of each attack over time could reveal the underlying reason why and how some of the websites were hacked. Finally, we show the generality of the model by applying it to other applications.
The experimental code is publicly available 4 .

Synthetic experiments
To demonstrate the effectiveness of the model, we simulate two kinds of attacks. The first type of attack possess a monotonically increasing hazard rate. This corresponds to our statistical model with monotone constraint on the hazard rate which is easy to understand since once a exploit is known it will become easier and easier for hackers to attack as more tools are available. The second type of simulated attack does not have a monotonic hazard rate. This leads to our "standard" or "non-monotone" model. It's a practical assumption because in reality the attack campaigns could be quite complex. We will talk about both the pros and cons of these two schemes in the analysis of the real-world data. Data. In both cases, we simulate the data as follows: 3. Given the ground truth hazard rate we got in step 2, we sample the exact hacked times for each of 1000 data points. 4. Independently, we assume another uniformly sampled checking points served as censoring times. Finally we obtain our experimental intervalcensored times by finding the nearest censoring times around each exact hacked time.
Comparison methods. The purpose of this section is to verify the ability of our models, other time-varying coefficient hazard regression models and their abilities to recover hazard curves with sharp changes. First, most of the existing state-of-art methods rely on splines where the change points are pre-fixed or automatically estimated based on model selection. Second, to best of our knowl-edge, we are not aware of any existing work allowing both time-varying coefficients and interval censored data. We study the popular "polspline" [13] which is a time-varying coefficient hazard regression tool archived in the R repository 5 . The "polspline" automatically learn splines including constant functions, linear functions of time and so on. Note that "polspline" only works on uncensored data and right censored data.
We denote " 1 " as 1 penalized Total Variation, and "log" as log penalty defined in Eq. (3.6). For convenient, we use the term "non-monotone" as standard model in Eq.(3.7) without Total Variation or log penalty. Similarly we use the term "monotone" as monotone model in Eq.(3.7) without Total Variation or log penalty. In our experiments, we will see the effects of placing " 1 " and "log" onto "monotone" and "non-monotone" models. The results for monotonic hazard rates are reported in Figure 1 and 2. The convergence in Figure 2 shows that compared with " 1 " and monotone, "log" penalty works a bit better. The reason for this can be seen from Figure 1  out of 4 exploit) where the "log" penalty produces a much sharper hazard curve and approximates the ground truth quite well. Figures 3 and 4 and show the results on data generated without the monotonic hazard rate constraint. In this case both the " 1 " and "log" penalties work well. It is expected that the non-monotone model without any regularizer will overfit the data quite aggressively. The minor difference between " 1 " and " 1 + log" is that " 1 + log" produces sharper curves but tends to ignore weak signals, e.g. the second knot, when the signal-to-noise ratio is relative small, i.e. it prefers significant signals. The convergence in Figure 4 shows that both " 1 " and "log" penalty significantly outperform the plain non-monotone model. Note that our model consistently assigns all features that are not exploits to zero.
In order to compare with our models, we build the data for "polspline" as follows. First, we start with the interval and right censored data simulated in the beginning of this section. We leave right censored data unchanged. Second, we generate two data points (uncensored and right censored) for every interval censored data point, i.e., the left end point of interval censored time serves as right censored time, and the exact hacking time serves as uncensored time.
We show the results of "polspline" in Figures 1 and 3. It can be seen that spline based models which allow relatively smooth changes struggle to model the sharp hazard rate well enough. It is expected from the shape of the hazard rate curve that the negative log-likelihood of "polspline" compared with our models is inferior. Due to the algorithm of "polspline" is not iterative-like, we are not able to obtain the results of estimated hazard rate in each iteration, therefore we instead report the final results after the algorithm is done in Table 1.

Real-world data
The data used for evaluation was sourced from the work of Soska and Christin [27] and was comprised as a collection of interval censored sites from blacklists   The other blacklist that was used contains websites that perform search redirection attacks [14] and was sampled from October 20, 2011 to September 16, 2013. In total the sample contained 738,479 unique links, from 16,173 unique domains. The Wayback Machine contained archives in the acceptable range for 14,425 (89%) of these sites.
These two blacklists are particularly well suited for providing labeled samples of attacked websites as manual inspection has shown, an overwhelmingly large proportion of these sites were compromised by a hacker. This is contrary to other websites which may simply have been maliciously hosted or contained controversial content without even being vulnerable or hacked.
Lastly, the .com zone file from January 14th, 2014 was randomly sampled, ignoring cases where an image of the site was not available in The Wayback Machine. In total 336,671 archives distributed uniformly between February 20th, 2010 and September 31st, 2013 were collected. These samples were checked against our blacklists as well as Google Safe Browsing to ensure that as few compromised sites remained in the sample as possible.
We automatically extracted raw tags and attributes from webpages, that served as features (a total of 159,000 features). Examples of these tags and attributes include <br>, and <meta> WordPress 2.9.2</meta>. They are useful for indicating the presence of code that is vulnerable or may be the target of adversaries. Our corpus of features corresponds to a very large set of distinct and diverse software packages or content management systems.

Real-world numeric results
There are a total of 120 million websites registered in .com zone file at the end of our observation. According to a rough estimates of the distribution of hacked and non-hacked sites, we reweigh the non-hacked websites by 200 times. To report the results, we randomly select 80% for training and validation, and the rest as test data.
Comparison methods. The baseline method is the classic Cox Proportional model [3] which has been extensively used for hazard regression and survival analysis. Ever since its invention, has been considered a "gold standard" in epidemiology, clinical trials and biomedical studies [see e.g., 36]. The Cox model is parametrized based on the features just as our model is, but is not time-varying. As has been discussed in section 3.1, to estimate the survival probabilities we specify a uniform distribution for the baseline hazard function. We are not able to compare with other time-varying models in R repository due to the scale of the data.
An experimental comparison between our models and the Cox model on the aforementioned dataset are shown in Figure 5. By comparison, the Cox model underfits the data quite a bit. Our "monotone" model which allows only a nondecreasing hazard rate also underfits the data slightly but still significantly outperforms Cox. Moreover due to much smaller parameter space need to search, we find that it converges faster than the " 1 +non-monotone" model. Additionally, our "log+monotone" model performs nearly the same on convergence (overlapped). Again it is well expected that "non-monotone" model without any constraint overfits the data severely. As a consequence, the " 1 +non-monotone" model which is well-regularized performs the best. These results clearly show that the latent hazard rate recovered by our models improves upon the classic Cox model in this setting.
Due to the sparsity of our models, table 2 shows that we require only about three times the storage to give significantly better estimates compared with the Cox model. Most importantly, identifying the changes of each feature's susceptibility over time can help people understand the latent hacking campaigns and leverage these insights to take appropriate action. We will discuss more in section 4.3.1. Table 2 Empirical model size (active breakpoints for our methods, number of parameters for Cox) estimated by different statistic models.

Methods
Empirical model size non-monotone 2 · 10 6 monotone 4.04 · 10 5 1 +nonmonotone 5.16 · 10 5 Cox 1.59 · 10 5 Finally it is imperative that the model does not assign non-zero hazard rate to features that are uncorrelated with the security outcome of a website. The hazard curve for 200 random features believed to be uncorrelated with security (such as code for custom font colors, styles, and links to unique images) were manually studied, 182 (9% false positive rate) of which generated a hazard value of 0 for the entire duration of the experiment. Of the 18 features that were assigned a non-zero hazard curve, all of them reported a value of less than 0.04 which can be ignored.

Real-world case study
In this section, we manually inspect the model's ability to automatically discover known security events. To this end, the model was trained on the aforementioned dataset and λ i (t) and was measured for features i that corresponded directly to websites that were known to be the victim of attacks. Figure 6 (left) demonstrates some of the differences between the well penalized monotone and non-monotone models by following the hazard assigned to features that correspond to Wordpress 3.5.1. In early 2013, our dataset recorded a few malicious instances of Wordpress 3.5.1 sites (among some benign ones). These initial samples appeared to be part of a small scale test or proof of concept by the adversary to demonstrate their ability to exploit the platform. Both models detect these security events and respond by assigning a non-zero hazard.
Following the small scale test was a lack of activity for a few weeks, during which the non-monotone model relaxes its hazard rate back down to zero, just before an attack campaign on a much larger scale is launched. This example illustrates the importance of not letting a guard down in the context of security. Once a vulnerability for a software package is known, that package is always at risk, even if it is not actively being exploited.
Despite not taking the most prudent approach to security, the non-monotone model captures the notion that adversaries tend to work in batches or attack campaigns. Previous work [27] has shown that it is economically efficient for adversaries to compromise similar sites in large batches, and after a few attack campaigns, most vulnerable websites tend to be ignored. This phenomena is shown in Figure 6 where Wordpress 3.2.1 was attacked in late 2011 and then subsequently ignored with the exception of a few small attacks that were likely the work of amateurs or password guessing attacks which are orthogonal to the underlying software and any observable content features. The monotone model in this case is very prudent while the non-monotone model captures the notion that the software is not being targeted.
It can be observed from Figure 6  This type of correlation between the hazard of features corresponding to different versions of a software package is expected. This correlation often occurs when adversaries exploit vulnerabilities which are present in multiple versions of a package, or plugins and third party add-ons that share compatibility across the different packages.
Manual investigation revealed that a number of impactful CVEs 9 such as remote file inclusion and privilege escalation were found for these versions of Wordpress as well as a particular plugin around the time of July 2011. While it is impossible to attribute with certainty any particular vulnerability, the observed behavior is consistent with vulnerabilities that impact large number of consecutive iterations of software.
Another spot check for the model is the ability to corroborate existing literature on malicious web deteciton. Figure 7

Studies on dropout rate from Alipay.com
It is worth noting that our time-varying hazard regression model with capacity of adaptively learning local estimates of hazard rate could be potentially useful in other scenarios but not limited in analyzing hacking procedures. In this section, we study another dataset from Alipay 10 , a third-party online payment platform that serves 400 million users in China and control of half of China's online payment market in October 2016. Basic services provided by "Alipay app" include the digital wallet and personalized investment through various commercial funds. Here, the dropout rate of users and how does the rate respond to campaigns promoted by Alipay is the object of interest.
We define one "dropout" as a user will not login in seven consecutive days. The Alipay service would simulate or promote new campaigns for the users who are apt to dropout.
We randomly select 3,760,455 users who were active during 20 and 26 Feb as long as the user login Alipay app in any of these days. There are around 8 million features representing users' demographic background and past behaviors. We observe the users' dropout behaviors during 20 Feb and 30 Apr. The ratio of users who survive until 30 Apr is a confidential number due to commercial privay.
We analyze the data with our "non-monotone" models to study the dropout rate and how does it change as new campaigns are promoted. The convergence rate on training data and test data are shown in 8. No surprisingly, the "nonmonotone" model without any constraint overfit the data severely. The "l1+nonmonotone" model with Dw 1 generalized well.
It is interesting to study the curves of hazard rate associated with each feature. We show examples in Figure 9 that how the estimated hazard rates change over time. The overall trend of users who are older than 55 and youth were declining. We align some change points with important campaigns launched at Alipay. On 10 Mar, Alipay announced a brand new program that every proper  payment is eligible for a cashback. On 18 Apr, Alipay announced another brand new healthcare program that each payment can accumulate users' own health insurance amount. The two change points along "red" line in figure 9 respond to such two campaigns quite clearly. On the other hand, we observed a sharp decline around 5 March among youth users, this is probably because of the consecutive warm-up campaigns for 8 March shopping festival. One explaination of this phenomenon may be that the youth are interested in consuming rather than saving compared with the habits of older users. Such analyses would be useful for understanding the promotion strategies in the near future.

Conclusion
In this paper, we propose a novel survival analysis-based approach to model the latent process of websites getting hacked over time. The proposed model attempts to solve a variational total variation penalized optimization problem, and we show that the optimal function can be linearly represented by a set of step functions with the jump points adaptively estimated. This allows us to solve the problem by either Lasso or fused lasso efficiently using proximal stochastic variance-reduced gradient algorithm. The results suggest that the model significantly outperforms the classic Cox model and is highly interpretable. Through a careful case study, we found that at least some of the active features and jump points we discovered by fitting the model to data are indeed important components of known vulnerability, and major jump points often clearly marks out the life cycles of these exploits. Finally, we applied the model to estimate user churn rate with Alipay.com data. This demonstrates the effective usage of the proposed model to other applications of survival analysis.
We will now show thatW j also defines a set of optimal solution using these properties.
Note that the loss function L({τ , Ψ, Z}, W ) can be decomposed into the sum of negative log-probability of form as described in (3.1), and when there are no uncensored data, the value of the loss function is completely determined by the survival function S(t) evaluated at t ∈ T . There is a one-to-one mapping between survival functions and the cumulative hazard functions through S(t) = exp(−Λ(t)). It follows from (A. By TV( ∂ ∂tW j ) ≤ TV ∂ ∂t (W * j ), we know thatW has a smaller overall objective function than the optimal solution.
It remains to show thatW is feasible. First note that the only spline of order 1 that satisfy the first and second condition is the piecewise linear interpolation of W * j (τ ) the knots in T . For each j, the constraints require that W * j obeys that W * j is non-negative, non-decreasing. This ensures that the piecewise linear interpolation of any subset of points in the domain of W * j to be also nonnegative, monotonically nondecreasing, which ensures the feasibility ofW j .
In the monotone model, the monotonicity constraints on w j translates into a condition that says W j is convex. Since W * j is feasible then it is convex. A piecewise linear interpolation of points on a convex function is also convex. This follows directly by checking that all the sublevel sets are convex sets.
Finally,W j can be represented by a nonnegative linear combination of truncated power basis functions defined on T and the corresponding hazard function w j can be represented by the same nonnegative combination of step functions defined at T . This completes the proof.
Proof of Proposition 8.. Define proximal operator Prox f (x) := argmin y 0.5 x − y 2 + f (y) for function f . Theorem 1 of Yu [37] states that Prox f +g = Prox f • Prox g if for any x, ∂g(Prox f (x)) ⊇ ∂g(x). (A.4) We will verify this condition for the "Standard model", namely, the case when f (x) = δ(x ≥ 0) and g(x) = γ Dx 1 or g(x) = γ Dx 1 + δ(Dx ≥ 0). In the first case [Dx] i = x i+1 − x i . Prox f (x) is simply the projection to the nonnegative cone. There are only for cases of x i+1 and x i . When at least one of them is positive, the projection does not change the sign of x i+1 − x i , so the condition easily checks out. When both of them are not positive, [DProx f (x)] i = 0 which only enlarges the subdifferential set. Therefore, Yu's condition (A.4) is true for f (x) = δ(x ≥ 0) and g(x) = γ Dx 1 and the proximal operator decomposes.
n − 1 and the subdifferential of indicator function is When x is feasible, the subdifferential is known as the "normal cone". First of all, we have already shown that if u ∈ S 1 (x) then u ∈ S 1 (Prox f (x)). It remains to check that if S 2 (x) ⊆ S 2 (prox f (x)). First of all if x is not feasible, then the inclusion holds trivially. If x is feasible, namely, Dx ≥ 0, then we need to show that if v ∈ S 2 (x), then v ∈ S(prox f (x)).
The condition for v ∈ S 2 (x) also decomposes to every coordinate, since Proof of Lemma 9. We prove by a direct calculation.
always exists. It remains to prove differentiability. For any h ∈ R |T | , for convenience in notation, we do a change of variable and define u := Dw and v := Dh. By Taylor's theorem, there exists 0 ≤ η i ≤ |u i + v i | − |v i | such that For any w, we decompose the coordinate of Dw into S = {i ∈ [|T | − 1] | [Dw] i = 0} and its complement S c , and we look at the above summation. If i ∈ S, we get |v i | − |v i | + (|v i |) 2 2( + η i ) 2 = (|v i |) 2 2( + η i ) 2 = O(|v i | 2 ). as claimed. Since the gradient is a Lipschitz function in w, we conclude that ξ is continuously differentiable.