Misleading Performance Reporting in the Supercomputing Field

In a previous humorous note, I outlined twelve ways in which performance figures for scientific supercomputers can be distorted. In this paper, the problem of potentially misleading performance repot'ting is discussed in detail. Included are some examples that have appeared in recent published scientific papers. This paper also includes some proposed guidelines for repor,,ing performance, the adoption of which would raise the level of professionalism and reduc,. • the level of confusion in the field of supercomputing.

2. Present inner kernelperformancefiguresasthe performance of the entireapplication.
3. Quietly employassemblycodeandother low-levellanguageconstructs,andcompare your assembly-coded resultswith others' Fortran or C implementations. 4. Scaleup the problem sizewith the number of processors, but don't clearly disclose this fact.
12. If all elsefails, show pretty pictures and animated videos,and don't talk about performance.
Somereadersof the "TwelveWays"haveinquired whetherthe article wasintendedto bea criticismof parallelsupercomputer vendorsor of the technologyof parallelcomputing in general.It wasnot. This misunderstanding may havebeenarisenfrom the fact that in the copy publishedin Supercomputing Review, the word "scientists" in an introductory paragraph was changed by the editors to "vendors" for unknown reasons. muchin the samewaythat the expansiveclaimsandpromisesof the artificial intelligence field in early yearshasled to the presentwidespreadskepticism. Thereis another reasonto upholda high level of forthrightnessand clarity in performancereporting. In orderfor highly parallel computersto achievewidespreada.cceptance in the scientificcomputingmarketplace,it is essentialthat they deliver superiorperformanceon a broad range of important scientificand engineeringapplications. Thus it is my opinion, sharedby othersin the field, that the best way to insure that future parallel scientificcomputerswill be successful is to provideearly feedbackto manufacturers regardingtheir weaknesses. Oncethe reasons for less-than-expected perform_ce rates on certainproblemsareidentified, then vendorscanincorporatethe requiredimprovements, both softwareand hardware,in the next generation.
In this paper, I will presenta numberof examplesof questionable performancereporting that haveappearedin publishedpapersduring the past few years. I haverestricted this study only to articlesthat haverecentlyappearedin refereedscientificjournals and conference proceedings. I havenot includedpressreleases, marketingliterature, technical reports,verbal presentationsor papersI havereadonly as a referee.
My only purposein citing theseexamplesis to provideconcreteinstancesof the performanceissuesin question.I do not wishfor thesecitationsto be misconstrued as criticisms of individual scientists,laboratoriesor supercomputer manufacturers. This is becauseI personallydo not believethis problem is restricted to a handful of authors, laboratories and vendors,but that manyof us in the field must shareblaine (seesection8). For this reasonI have decidedto take the unusualstep of not including detailedreferences for thesepapers,in order not to causethe authorsundueembarrassment.Thesereferences are available,however,to reviewersor others with a legitimate needto know. I suggest that readerswholearn the identitiesof theseauthorsusethis informationwith discretion.

Plots
Plots can be effectivevehiclesto presenttechnicalinformation, particularly in verbal presentations. Further,plots arein manycases all that high-levelmanagers(suchasthose with authority for computeracquisitions)havetime to digest. Unfortunately, plots can alsomisleadan audience, especially if preparedcarelessly or if presentedwithout important qualifyinginformation. Figure 1 is a reconstructionof the final performanceplot from a paper describinga defense application [13].The plot compares timings of the authors' nCUBE/10 codewith timingsof a comparablecoderunningon a Cray X-MP/416. The plot appearsto indicate animpressiveperformanceadvantage for the nCUBEsystemon all problemsizesexcepta smallregionat the far left.
However,examination of the raw data usedfor this plot, which is shownin Table  1, givesa different picture. First of all, exceptfor the largestproblem size(i.e. object count), all data points lie in the small regionat the far left. In other words,most of the two curvesshownare merelythe linear connectionsof the next-to-last data points with the final points. Further, the Cray X-MP is actually faster than the nCUBE for all sizes  exceptfor the largestproblem size. My personalview, sharedby severalcolleagueswho haveseenthis graph,is that a logarithmicscalewouldhavebeenmoreappropriatefor this data.
Other difficulties are encounteredwhen one readsthe text accompanyingthis graph andtable. First of all, the authorsconcedethat the runs on the Cray X-MP/416 (a four processor system)weremadeon a singleprocessor, andthat "the Cray versionof the code hasnot beenoptimized for the X-MP". The authorsassert,however, that tuning "would not be expectedto makea substantialdifference".
Secondly,for the largest problem listed, the only one where the Cray fails to outperform the nCUBE, the Cray X-MP timing is by the authors' admissionan estimate,an extrapolationbasedon a smallerrun. In the paper,as in Figure 1,the Cray curveleading out to the last point is dashed,possiblyintendingto indicatethat this is an estimate,but this featureis not explainedin either the captionor the text. Somereadersinterpret this featureas merelyat: indication of the regionwherethe nCUBE is faster.
To the authors' credit, they did includethe raw data, and they did clearly acknowledgethe fact that the Cray codeis not fully optimized and that the last Cray timing is extrapolated. Thus it appearsthat the authorswereentirely professional in the text of their paper. But the readeris left to wonderwhat fraction of the audiencethat has seen this plot fully appreciates the details behindit.
Figure2 is a reconstructionof anotherperformance plot from a paperdescribinga fluid dynamicsapplication [14]. This plot compares timings of the author's codesrunningon a 64K CM-2 with those of comparablecodesrunning on a one processorCray X-MP. The two curves shown for each computer system represent a structured and at: unstructured grid version of the code, respectively. As before, the plot appears to indicate a substantial performance advantage for the CM-2 for all problem sizes and both types of grids.
Once again, careful examination of the text accompanying this plot places these results in a different light. First of all, the author admits that his CM-2 results have been linearly extrapolated to a 64K system from a smaller system. The author then explains that the Cray version of the unstructured grid code is "unvectorized".
An additional difficulty with this plot can be seen by carefully examining the two Cray curves.
In the original, as in Figure 2, these are precisely straight lines. Needless to say, it is exceedingly unlikely that a Cray code, scaled over nearly three orders magnitude in problem size, exhibits precisely linear timings. Thus one has to suspect that the two Cray "curves" are simply linear extrapolations from single data points. In summary, it appears that of all points on four curves in this plot, at most two points represent real timings. The author of this article does not mention the precision of the data used in the CM-2 version, but from his description of the CM-2 as having 32-bit floating point hardware, it appears that the author is comparing a 32-bit CM-2 code with a 64-bit X-MP code.

Tuning
The thing lessthan comparabletuning efforts. In somecasesthis may happenbecausethe implementors of parallel codesareexpertsoll a particular parallelsystem,but they do not havea great deal of experienceprogrammingthe other system(usually a vector system) that they are comparing against. is shown in Figure .3. The performance of this code on the Cray-2 at NASA Ames is not as low as the author of this article reported --evidently the author's timing was based on an earlier version of the Cray-2 compiler. But it is not hard to see that without special compiler trickery, the performance of this code will be quite poor, since the inner loop vector length is only 16.    Admittedly, it may be a challenge to determine the minimal (i.e. BPSA) flop count for a given problem.
However, at the least a scientist should be expected to analyze the sourcecodeof an efficient implementationo11a serial or vector computer. Those with accessto a NEC SX systemor to a Cray X-MP/Y-MP systemcan take advantageof the hardwareperformancemonitorspresent on thesecomputersto obtain accurateflop counts. However, onemuststill be carefulto insurethat the codebeingmeasuredby the hardware performancemonitor employsthe bestavailablealgorithmsand is well optimized. Along this line, perhapsthoseof us performingresearchin the areaof numericallinear algebrashouldat somepoint reconsider the usageof classical formulasfor flop counts in favor of flop counts b_sedon implementationsthat employ Strassen'salgorithm [2]. $trassen'salgorithm is a schemeto multiply matricesthat requiresfewer floating point operationsthan the conventionalscheme. It has beendemonstratedthat Strassen'salgorithm is now practical a1:din fact producesreal speedupsfor matrices with dimensions larger than about 128[2]. Further, Strassen's algorithm can be employedto acceleratea varietyof linearalgebracalculations [3,9]by substituting a Strassen-based matrix multiply routine for the conventional matrix multiply routine in a LAPACK [1] implementation. If a Strassen-based flop count wereadoptedfor computingthe megaflopsrate in the solution of a 16,000x 16,000 linear system,the resulting rate would haveto be cut by roughly one-third from the usualreckoning.
In this vein, I myselfmustconfessto citing potentially misleadingperformancefigures in [3]. Thesearticles includeoneprocessorCray-2and Y-MP performancerates for some Strassenmatrix routines. Followingestablishedcustom,my co-authorsand I computed megaflops ratesbasedon the classical flop count for matrix multiplication (2n3). But since the Strassenroutines can producethe matrix product in fewerflops, it could be argued that these megaflops figures are inflated. it is essential that authorswho quotesuchfiguresclearly disclosethe fact that they have scaledtheir problemsizeto match the processorcount. It is alsoimportant that. authors providedetails of exactly howthis scalingwasdone. Anotheraspectof performancereportingthat needsto be carefullyanalyzedis howthe authorsmeasurerun time. Mostof the scientistsI havequeriedabout this issuefeelthat elapsedwall clock time is the most reliable measureof run time, and that if possibleit shouldbe measuredin a dedicatedenvironment. By contrast,CPU time figures,suchas thosefrequently quotedby usersof Cray systems, may maskextra elapsedtime required for input and output. Also, it is known that on the CM-2, "CM Busy Time" and "CM ElapsedTime" are quite different for somecodes,evenwith no I/O and 11oother users sharingthe partition. This canbe seenin [4,6].
In my view, quoting 32-bit performancerates is permissibleso long as (1) this data type is clearlydisclosedand(2) a brief statementis includedexplainingwhy this precision is sufficient. Along this line, it shouldbe kept in mind that with new computersystems it is now possibleto attempt much larger problemsthan before. As a result, numerical issuesthat.previouslywerenot significantnow aresignificant,and someprogrammersare discovering to their dismaythat higherprecisionis necessary to obtain meaningful results.
I suspectthat in the majority of caseswhere the authors do not clearly state the data type, the results are indeedfor 32-bit data. One exampleof this is [21], wherein an otherwiseexcellentten-pagepaper, the authors neverstate whether their impressive performancerates arefor 32-bit or 64-bit calculations,at leastnot in aalyplace wherea readerwould normally look for suchinformation. That.their resultsare indeedfor 32-bit data canhoweverbe deducedby a carefulreadingof their sectionon memorybandwidth, wherewereadthat operandsare four bytes long.

Responsibility
It is most likely true that none of the authors cited abovedeliberately intendedto misleadtheir audiences. After all, in mostcasesthe potentiallymisleadingaspectsof these paperswereevidentonly because of detailedinformationincludedin the text of the paper. It is also likely that in at least somecases,the authors' performanceclaims might be largelyupheldif the full factswereknown. Nonetheless, the overridingimpressionof these examplesis that whateverthe motivesand actualfacts may"be, the material aspresented generallygivesthe appearance of inflating the authors' performanceresultsin comparison to other systems.Suchmaterial certainly hasthe potential to misleadan audience.And; at the very least,onecanarguethat thesepapersrepresent sloppyscience. implementationsof the best practical serial algorithms. One intent here is to discouragethe usageof numericallyinefficientalgorithms,which may exhibit artificially high performancerateson a particular parallel system.
6. If speedupfigures are presented,the single processorrate should be basedon a reasonablywell tuned programwithout multiprocessingconstructs. I am not presenting these guidelines as mandatory, inflexible requirements. Clearly in a fast-moving field such as ours, it would be unwise to do so. However, if a paper or presentation does present results that significantly deviate from these guidelines, I suggest that the author has an obligation to clearly state and justify these deviations.

Conclusions
The examples I have cited above are somewhat isolated in the literature, and I see no evidence that the problem of inflated performance reporting is out of control. However, clearly those of us in the parallel supercomputing field would be wise to arrest any tendency in this direction before we are faced with a significant credibility problem.
As was mentioned above, scientists in many other disciplines have found it necessary to adopt rigorous standards for reporting experimental results, and ours should be no exception. It is my hope that this article, with the proposed guidelines in section 9, will stimulate awareness and dialogue on the subject and will eventually lead to consensus and formal standards ill the field.