Modeling and Characterizing Software Vulnerabilities

With the association of software security assurance in the development of code based systems; software developers are relying on the Vulnerability discovery models to mitigate the breaches by estimating the total number of vulnerabilities, before they’re exploited by the intruders. Vulnerability Discovery Models (VDMs) provide the quantitative classification of the flaws that exists in a software that will be discovered after a software is released. In this paper, we develop a vulnerability discovery model that accumulate the vulnerabilities due to the influence of previously discovered vulnerabilities. We further evaluate the proportion of previously discovered vulnerabilities along with the fraction additional vulnerabilities detected. The quantification methodology presented in this article has been accompanied with an empirical illustration on popular operating systems’ vulnerability data.


Introduction
Despite the progress made in computer programming and the respective software engineering practices, almost all the software program we often use in our day to day life still contain numerous bugs. However, post release of a software, some of the defects encountered are clearly more hazardous than the others. These flaws may affect the safety of the software system, henceforth termed as software vulnerabilities. A software vulnerability can be defined as "an instance of a mistake in the requirement, development, or implementation of a software such that its execution may violate the security policy" (Krsul, 1998). It has always been top most priority for a software engineer to discover the flaws, and also mitigate the risk by quickly distributing the patches. During the development of a software system, developer unintentionally inject some vulnerabilities in the source code repository, which are later noticed and resolved. All the potential vulnerabilities in a software are not discovered at the same time. Consequently, on the basis of the degree to which an individual vulnerability is discovered in the software, the developer can categorize the individual vulnerability based on a Common Vulnerability Scoring System (CVSS). The categorization procedure is suggested by the FIRST (www.first.org) as an effort to offer a vendor independent score system and reports a CVSS based vulnerability distribution to catalog various vulnerabilities based on their types. The National Vulnerability Database (NVD) maintained by National Institute of Standard Technology (NIST) provides the score report and distribution of each vulnerability. A common vulnerability scoring system is an open framework for assessing the characteristics and severity of software vulnerabilities. Development of scoring system is important because they can assist in, investigating the intrinsic qualities of a vulnerability, and the penetration capabilities for breaching a soft spot. A CVSS suggests an approach to capture both quantitative and qualitative characteristic of a vulnerability to the software developer. The numeral score allows a developer to rank a vulnerability based on its severity and further helps the organization to assess the risk and prioritize the patching process.
The extent of an impact to the confidentiality, integrity and availability due to the exploitation of a vulnerability affects the security of the whole system. When a vulnerability is discovered, various metrics such as: base, temporal and environmental are calculated that captures several properties based on the intrinsic characteristic, change in time and process environment of a vulnerability. The vulnerability discovery refers to examine and locating the possible bug, flaw or weakness of the software system using various statistical tools and techniques. Post-release, both testers and users attempt to discover the vulnerable points in the software, and a certain proportion of users are attackers trying to breach the software. In this regard, software testers have to effectively monitor the vulnerability discovery process and evaluate the threat level corresponding to each vulnerability. Further, quantifying the vulnerabilities in a software system is similar to the detection of underlying faults in a software. Like the categorization of software faults help software engineers to check for reliability; in a similar fashion categorization of vulnerabilities helps the developer to counter-measure the threat due to any potential breach. These adequate measures are like, assigning resources for security testing, development and scheduling the safety patches.
In the context of vulnerability discovery, a flaw present in the software is a type of defect that can imply a high degree of risk to a software system. Due to its analogous behavior, various researchers have incorporated the concept of software reliability growth modeling in order to quantify the trends in the vulnerability discovery process. With proper modeling of software vulnerability process, the developer might be aware of the dormant flaws present in the software and can apply the adequate resources to inhibit the threats. As a brief review of related research, a substantial number of Vulnerability Discovery Models (VDMs) have been developed recently. These vulnerability models consider various aspects of vulnerability scenario ranging from exponential to s-shaped vulnerability discovery curves. The taxonomy of major VDMs, can be divided into two groups: time-based and effort based models. The time-based VDMs are parametric functions that can predict the total number of vulnerabilities discovered at a given time point. Most of the VDMs developed in the literature considers time as the governing factor. Since, the vulnerability repository uses calendar time intervals for the vulnerability disclosure. It was Anderson (2002) who first introduced the VDM, and the model developed was explicitly based on the SRGM outline. Anderson (2002) applied the Brady et al. (1999) model to capture the trend of vulnerability discovery. Yet, the empirical results suggest worst fitting of data. Needham (2002), Alhazmi and Malaiya (2005) argued that the difference in fitting the data for the Anderson Thermodynamic (AT) model is due to sociological factors like: decrease in vulnerability discovery rate can be described due to the losing attractiveness of software version over time rather than the difficulty in discovering vulnerabilities (Massacci and Nguyen, 2014). Later, in 2005, Rescorla (2005 attempted to classify the trends in the vulnerability discovery data by considering the linear and exponential model to predict the number of vulnerabilities. Alhazmi and Malaiya (2005) proposed a logistic, s-shaped model to capture the phenomenon considering the impact of vulnerability detection rate during the three phases. The model also advocated that the security defect discovery differs distinctly from the normal software bugs. At first, users need to understand the target system in order to infiltrate, so they discover few vulnerabilities. With the increased attractiveness over the time, a significant number of users begin targeting the system resulting an amplified growth. Finally, the discovery process gets saturated due to a substantial switching of user to a newer technology. Furthermore, Alhazmi and Malaiya (2005) also focused on effort based modeling, where the discovery of vulnerabilities is based on efforts applied rather than time alone. In other work, Joh et al. (2008) considered that, in some situations the discovery growth curve could be asymmetric in nature and suggested to use Weibull distribution for vulnerability discovery due to the skewness present in its pdf. On the similar lines, Younis et al. (2011) inspected the applicability of Folded VDM. Moreover, Kapur et al. (2015) examined the logistic detection rate while discovering vulnerabilities. Recently, Anand and Bhatt (2016) proposed a hump-shaped model to capture the vulnerability exposure pattern due to the attractiveness of a software product in the market. They used a weighted criteria based ranking approach to judge the performance of proposed model with the existing VDMs. Besides the single version VDMs, multi-version VDMs have received less attention. A few authors have considered the discovery pattern in a multi-version software. Kim et al. (2007) examined the influence of shared source code in multi-versions vulnerability discovery process of a software. Lately, Anand et al. (2017) have formulated a multi-version VDM to quantify the number of vulnerabilities discovered. The model was based on the feature enhancement and shared code phenomenon that considered the vulnerability discovery rate attributed for the latest offering that is also accountable in previous version of the software due to code sharing.
The purpose of our article is that most of the past studies considered the discovery of a loop holes as a single vulnerability, while estimating the potential security defects. They did not elucidate exactly what number of additional flaws are being discovered, when a conceptual flaw is discovered as one vulnerability. Several researchers showed the existence of many vulnerabilities reported for the latest offering were also present in its preceding releases due to the existence of shared code. Moreover, a software can be a part of software product family and the vulnerabilities discovered in any version may indeed present in the other versions of the same product family. For example, a CVE entry, CVE-2011-5046, discovered in the Microsoft Windows family had affected different products namely Microsoft Windows XP, Windows Server 2003, Windows Vista, Windows Server 2008, and Windows 7. Hence, a single vulnerability discovered in any version may affect the discovery process of other versions and consequently might escalate the flaw count. Additionally, it has been observed that, one vulnerability count could exploit different types of vulnerabilities that exists in a software due to one conceptual flaw. As in the case of a CVE entry, CVE-2016-3375, a single flaw in the OLE Automation mechanism and VBScript scripting engine causes four different vulnerabilities, viz. Denial of Service, Execute Code, Overflow, and Memory corruption. Therefore, a single vulnerability can also be attributed for the discovery of additional vulnerabilities during the vulnerability discovery process (Windows 7, 2016).
In this paper, we first describe how vulnerability discovery model can be used to capture the above-mentioned phenomenon to model the vulnerability discovery process, then present an analytical derivations needed to compute the fraction of additional vulnerabilities discovered during the vulnerability discovery process. Next, in section 3, the predictability of the proposed model is illustrated and the segregation of the vulnerabilities lying dormant in the software is made into two classes. In section 4, the conclusion is given followed by the references.

Model Development
The vulnerability discovery model, suggested by the Alhazmi and Malaiya (2005) considered the impact of two factors that governed the rate of change of vulnerability discovered. The first factor deals with the number of installed base due to the rising popularity of a software and the second factor capture the decreasing phenomenon of the number of undetected vulnerabilities with time. The model advocated an s-shaped discovery curve due to its logistic behavior. The vulnerability discovery process modeled here considers the association of conceptual flaws being discovered with the flaws that triggers the detection of some supplementary flaws during the discovery process. In the model we assume that the discovery process is initiated by a certain number of vulnerabilities discovered during the testing process and the rate of change of vulnerability discovered at a given time comprises of two components that administrate the vulnerability discovery process; the first factor constitute the vulnerabilities discovered with a detection rate r and the second factor represents additional vulnerabilities which are detected due to the influence of the vulnerabilities discovered by time t . The differential equation describing the discovery process can be modeled as: where r represent the vulnerability detection rate and s resemble the rate that constitute the discovery of additional vulnerabilities influenced by the discovered vulnerabilities.   It is interesting to note that, the behavior of our proposed model can be comparable to the model proposed by Kapur and Garg (1992) in the software reliability studies. Moreover, if we take b r s   and s r   , then, the proposed model reduces to the model given by Kapur et al. (2015). If we define, ( ) f t as the probability of vulnerability discovered at time t and ( ) F t as the fraction of vulnerabilities being discovered by time t , then the likelihood of vulnerability at a given time t or the equation (1) can be expressed as: The solution to the equation (4) yield the s-shaped cumulative vulnerability distribution and is given as: Further, the differentiation of ( ) F t gives the non-cumulative vulnerabilities distribution representing the stated discovery process as: Hence, if N is the total number of vulnerabilities in the software, then the cumulative number of vulnerabilities discovered by time t , ( ) t  given in equation (2) can be rewritten as: The noncumulative vulnerability distribution given in equation (6) can be illustrated in the Fig. 1 We can validate that the curve, as shown in the above Fig. 1, for the noncumulative vulnerability distribution is symmetric with respect to time. It can be shown that ( 0) ( is, the proportion of noncumulative vulnerability discovered around the peak time T  up to 2T  confirming the symmetric behavior of the vulnerability discovery rate for the proposed s-shaped vulnerability discovery model. In fact, Younis et al. (2011) mentioned that, there is no assertive reasons advocating the rise and fall should be symmetric in the case of Alhazmi and Malayia Logistic model. However, Anand and Bhatt (2016) claimed that vulnerability discovery rate follows a hump-shaped curve showing the symmetric behavior of the discovery rate and hence can be backed using the equation (6).
It can be noted that, the term   1 ( ) r F t  in equation (4) represents the proportion of vulnerabilities discovered by developer with a vulnerability discovery rate, r . Here, the vulnerability count resembled by this proportional are influenced by the advisory reports by software vendor. In contract, the second term represents the fraction of additional vulnerabilities discovered due to the influence of the previously disclosed vulnerabilities. The proportion of vulnerability discovery is depicted in Fig. 2, representing the vulnerabilities discovered due to the security bulletin by the software vendor and the additional vulnerabilities attributed because of the previously disclosed vulnerabilities. Moreover, in this work we try to capture the proportion of discovered vulnerabilities due to the influence of above stated two factors. As assumed, ( ) F t is cumulative fraction of vulnerabilities discovered by time t . Therefore, as per the proposed model, we can anticipate that ( ) F t inhibit two components. Viz., 1 ( ) F t the proportion of vulnerability discovered mentioned in the advisory reports by software vendor (leading vulnerability) and 2 ( ) F t corresponding to the additional vulnerabilities discovered. Because,   gives the fraction of vulnerabilities discovered by developer at time t , then the total fraction of discovery, 1 ( ) F t , between any two time periods, say 0 t and F t , is given by: Because ( ) F t is given by equation (5), Hence 1 ( ) F t can be inferred as This gives after integrating, Substitution of 0 0, F t t t   in the above equation yields: Therefore, the proportion of additional vulnerabilities discovered is given by 2

Data Analysis
This section addresses the practical relevance of the proposed VDM by predicting the future trends of the vulnerabilities. We assess the predictability of the proposed model by fitting the VDM to an observed sample and evaluate the goodness-of-fit criterion of the fitted model on the observed samples to predict the future behavior of the vulnerabilities. We apply the non-linear least square methodology to evaluate the estimation procedure on the security vulnerability data set of four different Operating Systems of two product family namely Microsoft Windows and Apple Macintosh (Mac Os X Server, 2016; Windows Xp, 2016;Windows Server, 2016;. The proposed VDM can only make sense if it closely fits the historical data and perfectly forecasts the future. Here, we compare the proposed model with the Alhazmi and Malaiya Logistic model. The parameter estimation, and comparison criteria of the two models are given in Table 1 Table 2 reports the comparison criterions for the proposed VDM in each data set. Here, we observe that the proposed VDM fits the observed sample perfectly. All the comparison criterions are comparatively lesser than the AML model. Further, R-square shows a close fit which can be exhibited by the Fig. 3 to 6.

Conclusion
VDMs have the potential to help the developer in allocating resources to predict the future trends of vulnerabilities, optimize the test effectiveness and scheduling the updates and patches for an exploitation free working of a software. The work presented here involved the empirical methodology to capture the involvement of additional vulnerabilities discovered during the discovery process. The quality and predictability of the proposed VDM are evaluated by the parametric function that evaluate the ability to forecast the future vulnerability as function of time. To validate the methodology, we assessed the proposed VDM on four major operating system of two distinguished product family. The results show a better insight about the vulnerability discovery process and revealed that it is better to use the proposed s-shaped model to estimate the vulnerabilities.