Don’t use quotients to calculate performance

Abstract Quotients, ratios, are among the most applied tools for measuring performance of mutual funds and investment portfolios, the Jensen index being an exception to the general rule. In this paper, we show some problems that arise when quotients are applied, closely related to their statistical meaning, which is too often forgotten. We also raise some advantages of the use of linear penalization, introducing a little known methodology for performance measurement. With this purpose, this paper’s approach is comprehensive: we conceptually analyze performance indexes’ geometric and statistical meaning, complementing this with a numerical example and empirical testing that confirm our view. This paper’s main contribution is to demonstrate and empirically test how the use of quotients to measure performance may create problems due to their denominators, which may be solved by applying linear penalization. Video Abstract Read the transcript Watch the video on Vimeo © 2015 The Author(s). This open access article is distributed under a Creative Commons Attribution (CC-BY) 4.0 license


PUBLIC INTEREST STATEMENT
Academics, practitioners, and general public very often apply quotients, like the ratio between return and risk, to measure the performance of their investments (e.g. the Sharpe and Treynor indexes or the Information Ratio). In this paper, we show how the application of these quotients may have problems and could lead analysts to take wrong investment decisions, partially due to the statistical implications of these quotients' use. We propose linear penalization by risk as an interesting alternative and introduce a little known methodology for performance measurement. With this purpose, this paper's approach is comprehensive including conceptual, geometric, and statistical analyses that we complement with a numerical example and empirical testing. We think that the explained ideas are especially interesting under the current economic environment, in which optimal prioritization of investments and accurate analysis of balance between return and risk are a must.

Introduction
In order to calculate the performance of investment portfolios and to evaluate whether a mutual fund outperformed the market or its peer funds, we use several ex-post performance measures, usually based on the average return obtained and the risk assumed. As we consider that investors prefer more wealth to less, and that they are risk averse, we understand that performance improves as return increases and it worsens as risk grows (some nuances to this general affirmation may be seen in . Many performance indexes have been developed since the classical ones proposed in the 1960s by Sharpe, Treynor, and Jensen. A frequently used idea is to calculate the excess of average return above the risk-free rate and divide it by a risk measure, just as Sharpe and Treynor ratios do. There are many other options to measure performance depending on the method applied to calculate risk (some quite well-known ones can be seen in Eling, 2008).
Despite their appearance being somewhat different, the M 2 index (proposed by Modigliani & Modigliani, 1997) and the M 2 for beta (proposed by Modigliani, 1997) are equivalent to the Sharpe and Treynor indexes, respectively. On the other hand, the information ratio (IR), which can be defined as the average tracking error divided by the standard deviation of tracking error 1 , will have similar characteristics to other indexes that are calculated through a quotient. Definitively, many of the indexes we apply are "ratios," this is, quotients (the Jensen index is an exception).
The problem is that quotients, ratios, have some peculiarities that make them less attractive for their use in performance measuring, and this is what we want to show in the current paper. Some studies have affirmed that the application of one or another performance index has little relevance (i.e. Eling, 2008, after analyzing several indexes and obtaining similar results, tends to prefer the Sharpe ratio), but we disagree and will provide empirical evidence that supports our view. Many studies (Eling's one among them) affirm that different indexes rank investments performance almost identically, with minor differences in ordering; but we think that quantitative differences must be considered apart from the strict rank ordering. On the other hand, there could be few but dramatic changes in rank ordering that can be very interesting for an analyst.
There is a wide literature facing this topic of performance measuring and we have cited a few authors already. Only to mention some of the many who use the three classical indexes, we have Chua and Koh (2007) who use the Sharpe ratio, Hodges, Taylor, and Yoder (2003) or Hübner (2007) who use the Treynor ratio, and Sainz, Grau, and Doncel (2006), Lin (2006), Fama andFrench (2010), or Miralles, Miralles, andLisboa (2012) who use the Jensen index. On the other hand, practitioners use them continuously, and there exist empirical studies analyzing which are the most commonly applied indexes, as the one by Amenc, Goltz, and Lioui (2011).
Literature has analyzed, among other topics, what happens when the numerator of the Sharpe ratio is negative (Ferruz Agudo & Sarto Marzal, 2004;Israelsen, 2005Israelsen, , 2009, with different solutions to this situation). However, in this paper, we want to study a different matter, as we focus on denominators of quotient-based indexes and find advantages in applying linear indexes. This paper contributes by giving an integrated comparative view, both from a geometric and from a statistical perspective, of the Sharpe, Treynor, Jensen, M 2 , M 2 for beta , and IR indexes. The paper also adds an original proposal to measure performance that was developed by the authors and allows seeing advantages of linear penalization. All this is corroborated with a numerical example and empirical testing. Such popular indexes as Sharpe, Treynor, or the IR may cause relevant problems if applied to compare performance of funds and their denominators take certain values. At the same time, linear indexes show higher consistency. Practitioners must be aware of this.
We will now proceed to review the mentioned indexes, introducing a complementary and less known one: the penalized internal rate of return (PIRR). We will see the indexes' geometric and statistical meaning, as well as their main weaknesses. We will then analyze some empirical results and reach our final conclusions.

Indexes review and proposal of the PIRR
We begin by reviewing the three classical indexes, including their geometric explanation. Their formulae are the following: where S, T, and J are the corresponding values of the Sharpe, Treynor, and Jensen indexes; μ is the average return of the investment in a certain period; r 0 is the risk-free rate of return; σ is the standard deviation of the investment return in the period of analysis; β is the respective systematic risk; and μ m is the average return of the market portfolio during the period.
Consequently, we can observe that we have two indexes defined as ratios (Sharpe and Treynor) and one defined as a difference, and therefore linear (Jensen). At the same time, we have one index (the Sharpe ratio) that penalizes return with total risk (σ), while the other two (Treynor and Jensen) apply systematic risk (β). Focusing on the type of risk considered, it would seem more adequate to use Sharpe when we analyze the performance of a mutual fund with a vocation to diversify or when we analyze an investor's total portfolio (in these cases, diversifiable risk should have been eliminated, and if not, performance will be negatively affected by penalization with total risk). Treynor and Jensen would be more adequate to analyze a specialized investment fund or a specific investment, understanding that systematic risk is the only one to be observed in these cases, as the diversifiable risk will be eliminated at another level.
If we observe Figure 1, we may geometrically see the Sharpe ratio: on a μ-σ map of ex-post returns, we can draw the result of the market portfolio (R m ) with its average return (μ m ) and risk (σ m ); starting from the risk-free rate (r 0 ) and passing through R m , we have the capital market line (CML), whose slope (tangent of α m angle) is the market Sharpe (Formula 1 applied to the market case). For the case of portfolios R a and R b (with their respective values of μ and σ), their corresponding Sharpe ratios will be the tangents of α a and α b (as it is obtained from the application of Formula 1). We see how the Sharpe ratio, for being a quotient, is a tangent.
If we now want, with the same Figure 1, to reason as Modigliani and Modigliani (1997) did in order to obtain their M 2 index, we can do the following: starting from the result of one portfolio (R b ), we can leverage the investment (by borrowing) at the risk-free rate, until obtaining the same risk as the market (σ m ); we get to this by moving up along the straight line that starts from the risk-free rate (r 0 ) and passes through R b until we position ourselves on the vertical of σ m , reaching an average return equal to M 2 b . We can do the same with the result of the other portfolio (R a ) by deleveraging the investment (lending) at the risk-free rate up to obtain a risk level of σ m ; we get to this by moving down from R a along the straight line that links R a with r 0 , until we position ourselves on the vertical of σ m , reaching an average return of M 2 a . Modigliani and Modigliani propose these M 2 values as performance measures, and it is clear that their ranking must coincide exactly with Sharpe's ordering (in our case: α b > α a > α m and M 2 b > M 2 a > M 2 m = μ m ). Therefore, those problems associated to the Sharpe ratio, which we will explain later, will also be applicable to M 2 . This index has the advantage of being measured in points of return, while at the same time, its logic endorses Sharpe ratio's validity: the best portfolio or investment fund for M 2 (which coincides with the best one for Sharpe) will be the one that, after matching its risk with the market, obtains a better return (thanks to the process of leveraging/deleveraging).
We can do a very similar reasoning with Figure 2 to geometrically see the Treynor ratio and its equivalent the M 2 for beta index (proposed by Modigliani, 1997); consequently, we will give a much shorter explanation. The reader can see how, by applying Formula 2, the Treynor ratios of portfolios with results R a , R b , and R m will, respectively, be the tangents of angles α a , α b , and α m on a μ-β map, where betas of different portfolios and the security market line (SML) appear 2 .
Starting now from R b and leveraging (or deleveraging from R a ) up to reach the vertical of β m , the market beta (equal to one), we will obtain the M 2 for beta values of these portfolios. This index is equivalent to the Treynor ratio (it gets to an identical performance ranking), and it will therefore suffer the same problems we will mention on the use of quotients, while at the same time, it has the advantages already explained when commenting the M 2 and Sharpe indexes.
From the review of these indexes, it is easily deduced that one is missing: an index that applies linear penalization (as Jensen does) and refers to the total risk (as Sharpe does). This is what Gómez-Bezares, Madariaga, and Santibáñez (2004) did, arriving at the PIRR. The formulation could be: where we need to obtain the t value. With that purpose, we can reason this way: the PIRR would be like a certainty equivalent of a risky investment (characterized by μ-σ values); considering that the risk-free rate (r 0 ) is the certainty equivalent of the market portfolio (R m ) with its average return (μ m ) and risk (σ m ), we could write, by applying Formula 4, r 0 = μ m −t.σ m , so: which coincides with the market Sharpe ratio, this is, the tangent of angle α m in Figure 1 and the slope of the CML. In consequence, applying the PIRR is geometrically equivalent to moving along parallel to the CML straight lines, giving as certainty equivalent (the PIRR value) the crossing point with the y-axis of the straight line parallel to the CML that passes through the corresponding portfolio. The PIRR values for portfolios with results R a and R b can be seen in Figure 1. Substituting the t value of Formula 5 in the formula of the PIRR index (Formula 4), we have: The reader can confirm (geometrically or by applying Formula 6) that the PIRR of the market portfolio is the risk-free rate (r 0 ): we assume that the market as a whole is indifferent between the market portfolio and the risk-free rate, and this is why it combines both investments 3 .
If we understand that the CML slope represents the price for risk, the return increase that the market demands per each unit of risk growth, the functioning of the PIRR (in Formula 6) is absolutely logical. The reader may also confirm that the ranking derived from the PIRR in Figure 1 is different to the one obtained from the Sharpe ratio.
If we applied the PIRR methodology to the systematic risk, subtracting t betas from μ and giving to t the market Treynor value (which can be reasoned in an identical manner as we did before to use the market Sharpe ratio as t value), we would get to: On this Formula, we could make very similar comments to those mentioned above about the PIRR, and the corresponding results may be seen in Figure 2. Analyzing Formulae 3 and 7, it is evident that the PIRR for beta exactly coincides in its ranking with the Jensen index (this is reasonable, as both apply linear penalization using systematic risk), but applying the PIRR logic (concept of certainty equivalent, price for risk, etc.) allows us to understand the classical Jensen index in a new way. Additionally, it is clear that for many individual investors, who cannot consider diversifiable risk elimination as something given and are therefore interested in the trade-off between obtained returns and total assumed risk (including both systematic and diversifiable risk), applying an index like the PIRR (using σ as the relevant risk measure) can be more realistic than applying the Jensen index (which ignores diversifiable risk behavior).

Some problems of quotients
The geometric view of the Sharpe ratio shows us that it is a slope in a μ-σ map, more specifically the slope of the straight line starting from r 0 and passing through the corresponding portfolio. If we assume a normal distribution of returns, the value of that slope has a clear statistical meaning: it indicates the probability of the portfolio return falling below r 0 4 . Therefore, when we rank mutual funds by their Sharpe ratio, we are actually ranking them by the probability of their return to fall below r 0 in a period; mutual funds will be more attractive if they have a lower probability of having a return below r 0 . This seems very reasonable at first sight, but it is not obvious that a mutual fund with a probability, let's suppose, of falling below r 0 of one ten-thousandth, is necessarily worse than a mutual fund with a probability of one hundred-thousandth. It is clearly better to have a low probability of falling below r 0 , but if both investments show small probabilities, it is likely that the investor may also consider other aspects. Even when not having such low probabilities, we do not find logical that investors exclusively consider this point. This is precisely what happens in Figure 1: for the Sharpe ratio, R b is better to R a (the tangent of α b is bigger than the tangent of α a ), but many investors will prefer R a to R b (this is what the PIRR sustains: PIRR a > PIRR b , which may help to intuitively see the advantages of linear penalization).
In any case, this is more clearly seen with small probabilities; assuming two mutual funds with the following data (in percentage): If we assume r 0 = 1.5, the Sharpe ratios will be: S C = (5−1.5)/1 = 3.5; and S D = (1.9−1.5)/0.1 = 4, so mutual fund D is better to C for the Sharpe ratio, when we think that C would definitively be preferred by most investors 5 .
In consequence, the statistical meaning of the Sharpe ratio warns us about its correctness when ranking mutual funds by their performance, and we can intuitively see that too little standard deviations may take the Sharpe ratio to exaggerated values, assigning excellent Sharpe ratios to mediocre mutual funds.
We could reason in a similar way with the Treynor ratio, as we have seen in Figure 2 that it is a slope as well 6 . However, in this case, the situation is even more serious, as it is relatively easy to find betas close to zero (which would increase the ratio values extremely) or even with negative values (which would change the sign of the ratio) 7 .
We can also reason in a similar way with the IR that we have defined as the tracking error average divided by the standard deviation of tracking error. We could draw a tracking error μ-σ map (similar to the existing one in Figure 1) and we would see that the IR of a portfolio is the slope of the straight line starting from the origin of coordinates and passing through the corresponding point for that portfolio. If we assume that the tracking error follows the normal distribution, we can make a statistical interpretation similar to that one we made for the Sharpe index: the IR shows us the probability of the tracking error to be below zero for a period, and ranks portfolios by their probability of the tracking error to be positive (the best one is that portfolio with the smallest probability of falling below zero). We could repeat here the criticism we made of the Sharpe ratio, which could mostly be extended to other ratios applied to measure performance.
It is argued as one advantage of the Sharpe ratio that, just as with the M 2 , we can leverage or deleverage our portfolios up to reach that, for a determined risk, the portfolio with a bigger Sharpe outperforms the others. However, borrowing or lending at the risk-free rate is not easy for particular investors, neither is it possible to fall into debt without limit; on the other hand, performance results are obtained ex-post, when leverage cannot be modified already. It can also be argued that linear penalization (as in the case of the PIRR), when considering as indifferent two mutual funds at the same vertical distance to the CML, does not treat risk adequately. It is not the same to exceed the CML in 2% with a low risk than with a high risk; in the second case, for instance, this is easier to happen by chance. However, the Jensen index has the same problem and is profusely used.
One advantage of the PIRR indexes is that, as it occurs with the M 2 indexes, they are measured in points of return, which makes them easier to understand for the standard investor. And above all, they avoid the evident problems of quotient-based measures.

Empirical results
In order to test whether any of the above-mentioned problems occur in practice, we took the 413 largest equity mutual funds in the USA, with the data series running from August 2006 to the same month in 2011 and Bloomberg being the utilized source 8 . We calculated the monthly returns of the mutual funds 9 , their averages, standard deviations, and betas (using the S&P 500 total returns as market portfolio, which was also used to obtain μ m and σ m ); the Treasury Bill 1M was used as risk-free rate. We calculated the S, T, J, and PIRR indexes for each mutual fund, and the Pearson and Spearman correlations. Results can be seen in Table 1 (and a summary including key statistical parameters of the analyzed data is included in Appendix A).
Mutual fund C: C = 5; C = 1 Mutual fund D: D = 1.9; D = 0.1 We see that in Table 1, there appears one negative Pearson correlation coefficient, more specifically between the Treynor and Sharpe ratios; as well as this, the values of the Pearson correlations between Treynor and the other indexes are extremely low, which is surprising. However, if we focus on the Spearman coefficients, they are all quite high, above 0.97. From this we can deduce that rank correlation may hide very serious problems, and above all, that something unexplained must be happening to find so strange coefficients. We analyzed the data and found a mutual fund with a negative beta value and very small in absolute terms, this was fund number 53 by order of capitalization. This is the case of a fund with an objective of long-term value creation, focused on capital preservation under adverse   market conditions. As the analyzed period has a negative market Sharpe, this fund's negative beta is reasonable (see Figure 3).
This very small and negative value of the beta for fund number 53 is precisely what is causing a very high and positive value of the Treynor ratio and what is totally distorting the Pearson correlations. We then thought of repeating the exercise excluding fund number 53, and we reached Table 2.   Results are surprising. By excluding one single fund, all the Pearson coefficients become close to or higher than 0.9 and Spearman coefficients exceed 0.98. We repeat the exercise with the largest 100 funds, reaching Table 3, and if we exclude the fund with negative beta, we reach Table 4.
If fund number 53 is not excluded (Table 3), results are even worse for the Treynor ratio, as all its Pearson coefficients become negative, which is reasonable having one fund with an illogical Treynor ratio within a smaller sample. The Spearman coefficients are high, while smaller than with the 413 funds sample. Repeating the same exercise excluding fund number 53 (Table 4), problems get solved.
The analyzed period (August 2006-August 2011) gives a negative market Sharpe (−0.01265); so, we considered it convenient to extend the time period, taking August 2003-August 2011, and we obtained a positive market Sharpe (0.03952). We focused on the largest 100 funds (same ones as before) and we calculated their performance indexes in the same way. Fund number 53 that previously had a negative beta now has a small but positive value (β = 0.00348), which is consistent with the goals of the fund. Based on our previous exercise's experience, we first analyzed whether there was any fund with a negative or very small beta. This was only the case of fund 53 (positive but very small beta), and we also analyzed the standard deviations. We can see the σ values of the 100 funds in Figure 4, and we observe there are four funds with smaller values, which may cause problems in the Sharpe ratio; as they have a very small denominator, the index value could be abnormally high.
If we ignore those four funds, we also eliminate the fund with a very small beta; so, we decided to make the analysis both with all the 100 funds (Table 5) and with 96 funds (excluding those four funds with a smaller σ), and we reached Table 6.  The Pearson coefficient, in the analysis with 100 funds, gives low values for Treynor (despite being higher than before, they are still unusually low); the Spearman coefficient keeps hiding the problem. Making the analysis with 96 funds, the coefficients clearly improve, as the problems caused by abnormally small denominators have been avoided.
Summarizing, in this empirical section, we took a sample of the most important equity mutual funds in the USA and calculated their performance indexes (S, T, J, and PIRR). Expectations would be, according to most of the literature, that relations among these indexes were high. However, Pearson correlation coefficient has abnormally low values for the Treynor index. The problem is mainly caused by a fund with a negative beta value; by excluding this fund, the problem is solved. We have repeated the exercise with the 100 most important funds, reaching substantially similar results. Considering that the time period of the selected sample had a negative market Sharpe, and in order to get more robust results, we have repeated the same analyses with a longer time period and with a positive market Sharpe, concluding that problems of applying quotients persist.
All these calculations support the theoretical argumentation made at the beginning, summarized in the problems caused by applying quotients to measure performance. It is clear that a negative beta may lead us to error if we use the Treynor ratio, but we have seen that we can also make a mistake in case of a very low beta. This is the problem of using ratios. In the sample we have used, there were no cases with very small σ, which is reasonable in a complicated investment period analyzing equity mutual funds; this is why the Sharpe ratio did not show major problems. Even with this, when excluding funds with lower σ, its Pearson coefficients improve a little, and it is clear that we could have searched for funds with σ close to zero, which would have caused unusual Sharpe indexes and therefore lower Pearson coefficients. We preferred not to do this in order to avoid falling in a data mining exercise (one of the major problems of current financial research; see Gómez-Bezares & Gómez-Bezares, 2006); and this is why we have described in detail the followed path to execute the different analyses while looking for robustness in our results. In any case, we consider we have shown clearly enough that the use of ratios may cause problems, which is what we wanted to prove.
On the other hand, it is important to emphasize the high correlations that usually exist between the Jensen and PIRR indexes, despite them using different risk measures (β and σ, respectively), which proves the great importance of the penalization methodology.

Conclusion
Given that many authors show how the Sharpe ratio is the most commonly applied (Amenc et al., 2011;Eling, 2008), and that IR or Treynor and other quotient-based ratios are also widely used (Amenc et al., 2011), this paper's title may be considered provocative. Should we not use these ratios anymore?
We have tried to warn the analyst, the fund manager, about some important failures these ratios may have, which does not imply that these ratios must not be used or that other indexes do not have any problems of their own. They must simply be used cautiously, especially under certain circumstances, as we have explained along this paper.
Contrary to Eling's view (2008), we think that the way in which we measure performance does have its importance, at least in some cases, and it is not enough with finding high Spearman correlations among the indexes because this may be hiding the reality.
Our empirical results show problems with the Treynor ratio in some cases, but not with the Sharpe ratio, as we have not used funds with too small σ values. However, this does not mean that the Sharpe ratio cannot have problems in other samples, as we have explained with a numerical example. It has been said that the Sharpe ratio is not valid if returns do not follow the normal distribution; but, even following the normal distribution, the statistical explanation that we have given forces us to be cautious with the interpretation of the Sharpe ratio.
In general, we must be careful when applying quotients for performance measurement, especially when denominators have very low values; in these cases, it is advisable to use additional performance measures, at least to complete the analysis. In case of having negative beta values, the most prudent practice is to disregard the Treynor index for those funds. The PIRR index, as it is not a quotient but a difference, does not have these problems.
Finally, we want to emphasize that we have worked with indexes that are easy to calculate and understand; among these, linear penalization, in the way we have presented it, is an interesting option.