Bmc Medical Informatics and Decision Making Construction of an Odds Model of Coronary Heart Disease Using Published Information: the Cardiovascular Health Improvement Model (chime)

Background: There is a need for a new cardiovascular disease model that includes a wider range of relevant risk factors, in particular lifestyle factors, to aid targeting of interventions and improve population models of the impact of cardiovascular disease and preventive strategies. The model needs to be applicable to a wider population including different ethnic groups, different countries and to those with and without cardiovascular disease. This paper describes the construction of the Cardiovascular Health Improvement Model that aims to meet these requirements.


Background
There are several reasons for calculating the risk of cardiovascular disease in an individual or a population. Health care providers need to model future patterns of need for health services, and to identify the cost effectiveness of different intervention strategies. [1,2] Insurance companies and pension funds must evaluate risk in both individuals and populations when assessing portfolio risks. In clinical medicine, cardiovascular risk is increasingly accepted as the appropriate criterion to use to identify those who will most benefit from interventions designed to prevent cardiovascular disease and death. [3,4] Another, perhaps overlooked requirement, is to inform shared decisionmaking with patients. [5] This paper describes a cardiovascular disease (CVD) model which has been developed specifically for use in consultations with patients as an aid to risk communication and to shared decision making. Most CVD models focus on coronary heart disease (CHD) events, such as myocardial infarction. However, it is sometimes difficult to categorize an individual as either having or not having experienced a CHD event, since the collection of data on such events varies according to methods and definitions used. Consequently, an evaluation of all CHD or CVD events will be less reliable than one with a more concrete outcome measure such as CHD and CVD death. [6] The model we propose therefore estimates death from CHD.
There are a variety of CVD risk estimators available, the best known are summarized in Table 1. Each has strengths and weaknesses. [6][7][8][9][10][11][12] The principal problems include limited applicability to different geographic areas or ethnic groups, application to men but not women, and the omission of important risk factors. [13,14] The best known estimators are the Framingham equations. These have been criticized for their inaccuracy in some countries, in particular Southern Europe where they tend to over-estimate risk significantly. [14] This variation is an inevitable consequence of the exclusion of significant risk factors from the model. If a model is derived in a particular population, the prevalence and impact of any missing risk factors is tacitly embedded in coefficients of the risk equations. When applied to a population with different prevalences or one in which risk factors have different impacts, the model's predictions will be less accurate. Attempts have been made to recalibrate the Framingham equations for different ethnic groups in the United States and the United Kingdom. [11,15] However, the recalibrated equations have not been validated and questions about their applicability to other geographic areas remain unanswered.
The models in Table 1 all include age, gender, blood pressure, cholesterol, cigarette consumption and diabetes as risk factors. All omit some important independent risk factors such as family history, existing CVD, obesity but also diet, alcohol consumption and exercise. We are particularly interested in risk factors related to lifestyle: if an

Framingham (Anderson)[7]
Age, gender, smoking, blood pressure (BP), total cholesterol (TC)/high density lipoprotein (HDL) ratio, diabetes, left ventricular hypertrophy (LVH) estimate of risk is to be used in consultations as part of discussions with patients about lifestyle modification, it is important that the estimate should include the fullest possible range of risk factors relating to lifestyle.
To improve CVD risk equations, it is necessary both to expand the number of risk factors used and to devise a method of calibrating the results to different populations. Including additional risk factors should improve the accuracy at the level of the individual and increase the portability of any risk equation to different populations, however, there will always be some residual variability not accounted for by included risk factors. National mortality statistics can be regarded as containing all possible information about risk, both known and unknown. Recalibrating such national mortality statistics according to the mean values for a broad set of known risk factors will leave a residual value for the remaining variability due to unknown factors. The 2003 Health Survey for England collected information on cardiovascular disease risk factors and prevalence which can be used to recalibrate national mortality statistics in this way. [16,17] This paper describes how to take publicly available information on CHD prevalence, CHD death rates and CHD risk factors, and use it to calculate the risk of coronary heart disease for individuals, using an approach that should be applicable in different geographical areas and different ethnic groups.

Method
In this section, we first explain the mathematics underlying our approach and then describe how the data items required by the model were obtained. The approach uses an odds model. The odds of dying of cardiovascular disease in time t are: where P t is the probability of dying of cardiovascular disease in time t.
If we know the average odds of death in time t for the given population (PopO t ), we can calculate an odds ratio adjustment for any individual based on known risk factors, and use it to estimate the odds for the individual as: The odds ratio for the individual (IndOR) is the product of the odds ratios for each of n risk factors: This is often expressed as: where λ is the sum of terms corresponding to each risk factor, each term consisting of a coefficient β -a measure of the contribution of the risk factor -adjusted according to the extent to which the risk factor is present in an individual compared to the average for the population in question. Thus: where β i is the coefficient associated with the ith risk factor (equal to the log of the odds ratio), s i is the value for the individual of the risk factor and is the average population level. This is a well established method of adjusting models to different populations, used in SCORE and ETHRISK. [11,18,19] In logistic regression, β i are constants representing a linear relationship between the log odds and the level of the risk factor. This approach can be applied whether the risk factor, s i , is a continuous, categorical or a binary variable. However, in the literature, continuous risk factors are frequently treated as categorical variables: for example, a study might give odds ratios for each quintile of waist-tohip ratio (using the first quintile as the reference category). While these values could be used directly, that would produce artefacts in the model near quintile boundaries, so it is sensible to convert back to a continuous variable by applying smoothing. However, with some risk factors, the resultant relationship is not linear. Our approach here is to calculate an interpolated and smoothed function for how the odds ratio varies with s i (which is equivalent to considering β i not as a constant but as a function of s i ). In these cases, instead of calculating a term we calculate a term: which is the log of the odds ratio for the individual for the ith risk factor (i.e. associated with the measured value s i ), divided by the odds ratio for the mean level in the population of the same risk factor (i.e. associated with ).
These terms are referred to as the log of the normalized odds ratio (LNOR) and are represented by ξ i . So our λ is calculated as: The model therefore requires: • estimates of the baseline mortality from CHD (PopO t ); • a set of risk factors with known odds ratios; • LNORs for each risk factor.
In the following three subsections of the paper we explain (1) how estimates were derived for baseline CHD mortality, (2) choice of a set of risk factors, and (3) how adjusted LNORs were determined for each risk factor. We then go through a worked example for an individual patient.

Baseline Mortality
The mortality of CHD was extracted from the UK national mortality statistics 2003. [20] The ICD-10 codes included for CHD were I20-I25 inclusive. A probability of death from CHD for each age band was calculated for each gender by dividing the number of deaths in the age band by the number of individuals in the population in that age band. The annual death rates for each age from 35 yearsold upwards were then smoothly interpolated using methods described below. The probability of death was set to zero below the age of 35 as the death rates in this group were negligible.
National mortality statistics include all CHD deaths in the population. This includes CHD death in those with preexisting CVD as well as those who were free of CVD. If we know the proportion of the population who had pre-existing CVD, the number of CHD deaths and the relative risk of CHD in those with CVD as compared to those without, then separate estimates can be made of the baseline CHD mortality in the two groups. If: M = Mortality from CHD for that age and gender. We calculate baseline estimates of CHD mortality for an individual with given age, gender and CVD status. M is then calculated from national mortality statistics and PR from the Health Survey for England 2003, interpolated using the approach described below. A figure of 3.3 was used for the RR for CHD death or sudden death in those with existing CHD, taken from the Framingham study. [21]

Smooth interpolation of mortality and prevalence rates
The prevalence rates for CVD are given in the Health Survey for England 2003 in 10-year age bands. The mortality statistics are given in 5-year age bands. To obtain accurate annual estimates of baseline mortality rates it is necessary to interpolate from these totals. A number of different methods were explored, including simple linear, cubic spline and fractional polynomials, but all proved unsatisfactory. [22,23] Linear interpolation using the mid points of the 5-year age bands fails to preserve the area under the curve within the age bands where there is a high rate of change of risk. Also, the effect of the sharp changes in risk at the the mid-point inflections is magnified in subsequent calculations to give artefactual 'edge effects'.
Interpolating with a spline function would generate a polynomial for each age band, requiring thirty or forty coefficients to describe a mortality curve from age fifteen to ninety. In addition, ensuring that the average value matches the average value for the age band, can result in values below zero at very low risks. Fractional polynomials can be fitted for narrow intervals, but as the polyno- mial functions may tend towards plus or minus infinity, it is difficult to fit one fractional polynomial over the wide age ranges needed without experiencing what is called Runge's phenomenon, where the included data points are fitted very well, but with dramatic error between them. [24] A key problem is that the area under the cumulative mortality curve needs to be conserved. A two-step process was developed in which a smoothing algorithm generates a curve which is then modeled as a weighted sum of sixteen Normal distribution curves.
An interpolated curve is first generated by redistributing the area under the stepped curve obtained from the initial data: the sharpest angle in each age band is identified, by finding the biggest change in angle and dividing it by the absolute value of y-axis point. That data point is shifted towards the further of the two adjacent data points. The amount by which the data point is increased or decreased is then redistributed to the other data points in the age band. The process is repeated, iteratively reducing the maximum angle and resulting in a smooth curve that does not fall below zero, and preserves the area under the curve in each age band.
Below the age of 35, the prevalence and mortality are set to zero. For ages 35 and above, a function of a set of normal distribution curves was generated from the points in the smoothed curve. This produces a more tractable equation. Generating all data points prior to finding the best fit function prevents Runge's phenomenon. The parameters and weightings -determined by the least squares method -of the Normal distributions are shown in Table 2. Figure  1 shows the result of the interpolation of coronary heart disease results against the original stepwise mortality for the five-year age bands. This curve can be generated using no more than 16 numbers, does not violate the normal bounds of probability, is not affected by Runge;s phenomenon, and preserves the total risk in each age band.

Risk factors
The most comprehensive data on the odds ratios associated with a set of risk factors has come from the INTER-HEART study, which collected data from 15,152 patients admitted for a first MI at 262 centres in 52 countries across the world, and 14,820 matched controls. [25] The INTER-HEART study identified nine risk factors in addition to age and gender, which accounted for 90% of population attributable risk (PAR) in men and 94% in women for first myocardial infarction. We assume that the odds ratios for the risk of MI will be very similar to the odds ratios for CHD in general. The nine risk factors identified in addition to age and gender were: smoking status, a diagnosis of hypertension, apolipoprotein B/apolipoprotein A1 ratio, diabetes mellitus, waist/hip circumference ratio, alcohol consumption, consumption of fresh fruit and vegetables, exercise and psychosocial stress. A tenth factor, a family history of CHD, is also given in the paper but was omitted from the list of nine as it had minimal impact on the PAR. It has been included here as it enhances the individualization of the calculation regardless of the overall impact on the calculated population mortality.
The nine risk factors identified in the INTERHEART study are shown in Table 3 along with the unadjusted odds ratios.

Estimating the adjustments for each risk factor
We use an odds model in which the impact of risk factors on an individual's risk for CHD death is determined as the product of a set of coefficients, one for each risk factor. The coefficients (the log of the normalized odds ratios, LNOR) provide a measure of the influence of the measured risk factor for that individual.
The population mean values were derived from the Health Survey for England 2003 for different gender and age groups, with the values interpolated using polynomials; details are given in tables 4. The following subsections detail the calculations of the LNORs for each of the risk factors used.
showing the resulting curve generated for the risk of death from coronary heart disease in men compared to the original step-wise 5-year age band mortality Figure 1 showing the resulting curve generated for the risk of death from coronary heart disease in men compared to the original stepwise 5-year age band mortality.

ApoB/ApoA1 ratio
Total cholesterol (TC) and high density lipoprotein (HDL) values are the most common measures of lipid level used in calculating CVD risk. However, the INTER-HEART study found that using the ratio of apolipoprotein B (ApoB) and apolipoprotein A1 (ApoA1) is a more sensitive measure of risk than the TC/HDL ratio.
The INTERHEART study explored the relationship between the deciles of ApoB/A1 ratio and the odds ratio for MI compared to the first decile. The relationship is plotted on a doubling scale in the original paper, but it would appear that the relationship between the odds ratio and the ApoB/A1 ratio is linear from the second decile upwards, but with the odds ratio having a floor of 1 from just below the second decile. This can be seen in Figure 2.

Linear regression gives the equation:
where x is the ApoB/ApoA1 ratio.
The LNOR ξ i for ApoB/A1 ratio is the log of the of the normalized odds ratio for an individual's ApoB/A1 ratio (Ind-OR Apo ) divided by the odds ratio for the population average (PopOR Apo ) calculated from the above equation.
From equation (2), LNOR ξ Apo coefficient is then: The ApoB/A1 ratio is often not known as it is more customary to use TC/HDL clinically. An approximate conversion factor is applied: [26]

Smoking
The relationship between the odds ratio for first MI and smoking and cigarette consumption with respect to nonsmokers appears non-linear in the original INTERHEART paper. [25] However, as can be seen from Figure 3, if it is assumed that the odds ratio for 20 cigarettes a day is an outlier -there may have been some rounding down in the cigarette consumption to the standard packet size of 20 by either the subjects or observers -then this too is a linear relationship. Linear regression gives an equation: where N is the daily cigarette consumption.
To calculate the odds ratio corrected for the population, the interim odds for the population average cigarette consumption needs to be calculated. The population average cigarette consumption is the cigarette consumption for the whole population, not just smokers. This can be calculated: Av. cigarette consumption = av. consumption for smokers × proportion that are smokers The LNOR is then:   [25] An unreported algorithm 2.67 Family history of premature CVD A first degree relative <55 for men and <60 for women 1.45 Consumption of fruit and vegetables [25] Daily consumption versus not 0.70 Regular alcohol consumption [25] At least three days a week 0.91 Regular physical activity [25] At least four hours a week 0.86

Systolic blood pressure
The odds ratios given in the INTERHEART study are for self reported hypertension only. We can make an estimate of the odds ratio by systolic blood pressure versus the average systolic in a non-hypertensive if we assume that the odds are proportional to changes in the systolic blood pressure, and if we know the average values for the hypertensive and non-hypertensive groups. This information was not available to us, so an estimate needed to be made from another source. The average systolic in the ASCOT study was 164 mmHg, which was also the value in a study of home monitoring of Danish hypertensives. [27,28] This seems to be a reasonable estimate for the hypertensive group. Estimating the average value in the non-hypertensive group is more difficult, as this is highly dependent on age and gender. However, a value of 130 mmHg was used as this would be a typical value in the 35 to 64 year old age group in the Health Survey for England. [16] If we assume that the OR for a hypertensive with a systolic of 164 mmHg is 1.91 (table 3), then the gradient of the function relating the odds ratio to the systolic BP can be calculated as: Then using the intercept -2.4794 derived from the INTER-HEART data we can use the equation to determine the odds ratio for systolic blood pressure with reference to the average normal systolic blood pressure of 130 mmHg: OR Syst = -2.4794 + 0.0268 × Systolic Armed with the gradient of the line and the intercept, we can calculate the odds ratio for any systolic blood pressure with reference to the assumed normal value of 130 mmHg. The individual odds ratio, IndOR Syst is calculated using the individual systolic blood pressure, and the population odds ratio PopOR Syst is calculated using the average systolic blood pressure for that age. The LNOR ξ for systolic blood pressure can then be calculated using equation (2).

Obesity
The INTERHEART study found that waist-hip circumference ratio (WHR) was a better measure of the contribution of obesity to the risk of first myocardial infarction than body mass index (BMI). However, since data is more readily available for BMI than WHR, a conversion function from BMI to WHR was derived. The function used to showing the relationship between the odds ratio for first MI and deciles of ApoB/A1 ratio with respect to the first decile Figure 2 showing the relationship between the odds ratio for first MI and deciles of ApoB/A1 ratio with respect to the first decile.
estimate waist hip ratio from the BMI and age was derived by linear regression using data from the Health Survey for England: For men: WHR = 0.409665 + 0.000945*Age + 0.017275*BMI For women: The INTERHEART team published odds ratios for each of the upper four quintiles of WHR compared to the lowest. [29] The odds ratio is not a linear function of the WHR, so a fractional polynomial was fitted to interpolate the data with reference to the mean WHR of the lowest quintile (Tables 4). The odds ratio for the individual (IOR-WHR ) and the population (POR WHR ) can then be calculated to give the LNOR:

Implementation
The model was implemented first in Matlab and then in Microsoft Excel to ensure freedom for errors.

Results
Here we will describe a worked example. We will find the 10-year coronary heart disease risk for a 57 year-old nondiabetic male who smokes 30 cigarettes a day with no personal but a positive family history of cardiovascular disease, a systolic blood pressure of 137 mmHg, a total cholesterol (TC) of 6.2 mmol/l, a high density lipoprotein (HDL) of 1.3 mol/l, and a body mass index of 21. He neither drinks regularly nor exercises. He can give no reliable showing the relationship between the odds ratio for first MI and cigarette consumption with respect to non-smoking Figure 3 showing the relationship between the odds ratio for first MI and cigarette consumption with respect to nonsmoking.
information about his mental health or fruit and vegetable intake.

ApoB/ApoA1 ratio
The ratio is not known in this case, so must be estimated from the TC/HDL ratio using equation (6). The log odds ratio for the individual (OR I ) is calculated from equations (5) and (2):

Smoking
The average cigarette consumption in the population for this age and gender is calculated using the coefficients fond in the table in Additional file 1. The product of this and the proportion of smokers for any given age is the population average cigarette consumption. In this case: Using equation (7) the odds ratio (PopOR) is thus: The odds ratio (OR) for the individual is calculated in the same way.

Waist hip ratio
The WHR is estimated from the age and the BMI using the equation (10): As this is less than 1.0, the individual WHR is taken as 1.
And for the population average WHR is: PopOR WHR = 1.3207 using the same formula.
The LNOR ξ for WHR using equation (2) is thus:

Psychosocial stress and fruit and vegetable consumption
There is no information available for either of these risk factors, and so an assumption is made that the individual is exactly average for the population. As that means the difference between the individual risk factor value and the mean population value is zero, the LNOR will be zero.

Family history of CVD
The baseline probability of having a family history of CVD at this age (P FH ) is calculated from the prevalence of CVD in men at age 57 (CVD 42 ): And the LNOR ξ FH using equation (1) is: ξ FH = ln(1.45)*(0-0.24) = -0.0892

Calculating mortality
The 10 year mortality rate BM 10 can be calculated as: The baseline mortality odds (BMO) is therefore: and the baseline mortality at a given age i (BM i ) is found using the set of Normal distribution curves with the means and standard deviations in set A: The prevalence of CHD is calculated in a similar manner using the coefficients in the Table 2  We can then correct the mortality for the presence or absence of CVD, using equation: Converting this probability to odds, the value remains at 0.0048.
We then calculate the λ as the sum of all the LNORs using equation (

Discussion
This exercise demonstrates how published information can be used to construct a mathematical model of cardiovascular risk. The same methods should be applicable to other disease groups where there is sufficient information available. The method requires: the odds ratios for each of the risk factors when controlled for all other risk factors; mortality rates and prevalences for the diseases of interest; and prevalence rates and mean values for the risk factors in the relevant population.
Before use, this model must be tested in different populations to assess its accuracy. The results of the INTER-HEART study would suggest that it should be applicable in different geographical locations and to different ethnic groups without adjustment, since the predictions are anchored in a dataset for which there is a great deal of information on mortality rates and mean values. The INTERHEART study would suggest the residual variation at a population level is significantly less than ten percent. However, it should still be possible to apply the same principles by substituting the mortality data and the prevalence data for any population where that information is available to improve accuracy.
A major advantage of this model is the comprehensive set of independent risk factors used. It is likely that other risk factors have very little residual independence once all these factors are taken into account. For example, social and economic deprivation is included in other CVD and CHD models such as QRisk and Assign. [9,10] The INTER-HEART study was conducted in 52 countries including low and middle income countries, and yet this factor did not emerge as significant when all nine risk factors were included. Equally, country and ethnicity did not remain as independent risk factors suggesting that the odds ratios derived are applicable in all 52 countries. It would seem plausible that the odds ratios are also applicable in countries not included in this study.

Assumptions
A large number of assumptions were made in the construction of this model. The more important assumptions that might limit the accuracy of the model are described below.

That the underlying pathological processes and aetiological factors are the same for atheromatous disease, whether it is myocardial infarction, cerebrovascular or angina pectoris. This excludes death from haemorrhagic stroke
The odds ratios for the different cardiovascular pathologies should be highly correlated because there is a common underlying process at work, the formation of atheroma. However, there may be variations that are specific to certain pathologies, such as atrial fibrillation and stroke. In the INTERHEART study the subjects had experienced a first MI. Factors such as blood viscosity have a greater impact on MI than chronic ischaemia. It is possible that some of the modeled risk factors -such as psychosocial stress -may affect MI and chronic ischaemia, in different ways.
We assume that the risk factor profiles and odds ratios for those risk factors are similar in those who die from an MI before reaching hospital and those who survive. In a crosssectional study like INTERHEART, the outomes are not entirely equivalent to the prospective predictions of death from MI or CHD. In the INTERHEART study, subjects were identified on presentation with a first MI. Many potential subjects will have not survived to be recruited into the study. If there are significant differences between those that survive to hospital and those that don't, then some error will be generated in this model. The systolic blood pressure was modeled here with an assumption of a linear relationship with the odds ratio. However, the results in Lewington et al 2002 would suggest that the age-adjusted absolute risk varies on a doubling scale with systolic blood pressure [30] and, previous work would suggest that this linear relationship does not hold for very severe hypertensives. [31] The INTERHEART study was unable to determine the relationship between the systolic and odds ratio adjusted for all nine risk factors and so we felt an assumption of a linear relationship was reasonable.
4.1.1.5 That the relative risk for coronary heart disease death in those with pre-existing CVD is 3.3, regardless of the type of preexisting CVD This is a weak assumption, and based on a figure after Kannel. [21] Different types of existing cardiovascular disease will have different degrees of impact on risk. [31] The figure found by Kannel may be an average of these differing values. The value of this assumption will need to be tested in an evaluation of the model on external data.

That the odds ratios for the different risk factors remain constant over time and at different ages
The odds ratios given in the INTERHEART study relate to the occurrence of first MI, and the risk factor data was collected at that time. It is unclear how those odds ratios differ over time and with the age of subjects. Also, older subjects will have higher risks of competing causes of death, and this may in turn affect the odds ratios for the risk factors predicting CVD.

Other limitations
Using a wider range of risk factors can reduce the accuracy of the model if the available data on the additional risk factors is poor. Models developed using fewer risk factors embed information pertinent to the missing risk factors within the regression coefficients for factors that interact with the missing risk factors. With the larger models, the coefficients will have been regressed in the presence of those risk factors and so if that information is missingfor example if patients' fruit and vegetable consumption is not recorded in a dataset -their effect is lost to the model. Body mass index is used an approximation to waist-hip ratio, and TC/HDL ratio as a proxy for ApoB/ApoA1 ratio. Use of these proxy measures will reduce the accuracy of the model and waist-hip ratio and the ApoB/A1 ratio should be used in preference, when available.
Individuals at high risk of CHD are often at high risk from competing causes of death. Consequently, some individuals at high risk may die from another cause prior to a predicted CHD event. This could lead to overestimation of risk from CHD in those at highest risk.
Our method takes a population mortality rate and prevalence rates for CVD and adjusts them using the mean values for risk factors in the given population. This is valid provided the distribution of risk factor values is not heavily skewed and the relationship between the risk factor values and mortality rates obey the assumptions described above.

Conclusion
This paper demonstrates how a comprehensive, mixed odds model can be constructed using widely available information and without access to training data sets. The method could be useful in modeling a broad range of disease areas. Further research needs to be done to evaluate the accuracy of the model in different population groups using historical cohort data.

Competing interests
Dr Martin is the author and owner of the intellectual property rights of the Laindon Survival Model. He also works for RMS Ltd, a risk modeling company.

Authors' contributions
CJM conceived of and designed the model, and is the principal author of the paper. PT supervises CJM's PhD and contributed to the paper. HWWP gave statistical advice and contributed to the paper.