ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Opinion Article
Revised

Rampant software errors may undermine scientific results

[version 2; peer review: 2 approved]
PUBLISHED 29 Jul 2015
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

The opportunities for both subtle and profound errors in software and data management are boundless, yet they remain surprisingly underappreciated. Here I estimate that any reported scientific result could very well be wrong if data have passed through a computer, and that these errors may remain largely undetected.  It is therefore necessary to greatly expand our efforts to validate scientific software and computed results.

Keywords

data management, software error

Revised Amendments from Version 1

I agree with both reviewers that my claims were too strongly worded.  I have softened the language throughout (including simply adding "may" to the title), and revised the abstract accordingly.  I believe it is now clear that I am expressing a justifiable anxiety about computational errors affecting scientific results, but that I do not provide empirical evidence as to how often results really are invalid for this reason.
 
I added the entire section: "Popular software is not necessarily less bug-prone."
 
In the conclusion, I clarified the relationship between correct results (our ultimate goal), software verification, and shared workflow systems.

See the author's detailed response to the review by Daniel S. Katz
See the author's detailed response to the review by C. Titus Brown

Computational results are particularly prone to misplaced trust

Perhaps because of ingrained cultural beliefs about the infallibility of computation1, people show a level of trust in computed outputs that is completely at odds with the reality that nearly zero provably error-free computer programs have ever been written2,3.

It has been estimated that the industry average rate of programming errors is “about 15 – 50 errors per 1000 lines of delivered code”4. That estimate describes the work of professional software engineers-—not of the graduate students who write most scientific data analysis programs, usually without the benefit of training in software engineering and testing5,6. The recent increase in attention to such training is a welcome and essential development711. Nonetheless, even the most careful software engineering practices in industry rarely achieve an error rate better than 1 per 1000 lines. Since software programs commonly have many thousands of lines of code (Table 1), it follows that many defects remain in delivered code–even after all testing and debugging is complete.

Table 1. Number of lines of code in typical classes of computer programs (via informationisbeautiful.net).

Software TypeLines of Code
Research code supporting a typical bioinformatics study, e.g. one graduate
student-year.
O(1000) – O(10,000)
Core scientific software (e.g. Matlab and R, not including add-on libraries).O(100,000)
Large scientific collaborations (e.g. LHC, Hubble, climate models).O(1,000,000)
Major software infrastructure (e.g. the Linux kernel, MS Office, etc.).O(10,000,000)

Software errors and error-prone designs are compounded across levels of design abstraction. Defects occur not only in the top-level program being run but also in compilers, system libraries, and even firmware and hardware–and errors in such underlying components are extremely difficult to detect12.

How frequently are published results wrong due to software bugs?

Of course, not every error in a program will affect the outcome of a specific analysis. For a simple single-purpose program, it is entirely possible that every line executes on every run. In general, however, the code path taken for a given run of a program executes only a subset of the lines in it, because there may be command-line options that enable or disable certain features, blocks of code that execute conditionally depending on the input data, etc. Furthermore, even if an erroneous line executes, it may not in fact manifest the error (i.e., it may give the correct output for some inputs but not others). Finally: many errors may cause a program to simply crash or to report an obviously implausible result, but we are really only concerned with errors that propagate downstream and are reported.

In combination, then, we can estimate the number of errors that actually affect the result of a single run of a program, as follows:

Number of errors per program execution =
    total lines of code (LOC)
    * proportion executed
    * probability of error per line
    * probability that the error
        meaningfully affects the result
    * probability that an erroneous result
        appears plausible to the scientist.

For these purposes, using a formula to compute a value in Excel counts as a “line of code”, and a spreadsheet as a whole counts as a “program”—so many scientists who may not consider themselves coders may still suffer from bugs13.

All of these values may vary widely depending on the field and the source of the software. Consider the following two scenarios, in which the values are nothing more than educated guesses (informed, at least, by deep experience in software engineering).

Scenario 1: A typical medium-scale bioinformatics analysis

  • 100,000 total LOC (neglecting trusted components such as the Linux kernel).

  • 20% executed

  • 10 errors per 1000 lines

  • 10% chance that a given error meaningfully changes the outcome

  • 10% chance that a consequent erroneous result is plausible

Multiplying these, we expect that two errors changed the output of this program run, so the probability of a wrong output is effectively 100%. All bets are off regarding scientific conclusions drawn from such an analysis.

Scenario 2: A small focused analysis, rigorously executed

Let’s imagine a more optimistic scenario, in which we write a simple, short program, and we go to great lengths to test and debug it. In such a case, any output that is produced is in fact more likely to be plausible, because bugs producing implausible outputs are more likely to have been eliminated in testing.

  • 1000 total LOC

  • 100% executed

  • 1 error per 1000 lines

  • 10% chance that a given error meaningfully changes the outcome

  • 50% chance that a consequent erroneous result is plausible

Here the probability of a wrong output is 5%.

The factors going into the above estimates are rank speculation, and the conclusion varies widely depending on the guessed values. Measuring such values rigorously in different contexts would be valuable but also tremendously difficult. Nonetheless it is sobering that some plausible values can produce high total error rates, and that even conservative values suggest that an appreciable proportion of results may be erroneous due to software defects–above and beyond those that are erroneous for more widely appreciated reasons.

Put another way: publishing a computed result amounts to asserting that the likelihood of error is acceptably low, and thus that the various factors contributing to the total error rate are low. In the context of a specific program, the first three factors (# LOC, % executed, and errors/line) can be measured or estimated. However the last two (“meaningful change” and “plausible change”) remain completely unknown in most cases. In the following two sections I argue that these two factors are likely large enough to have a real impact. It is therefore incumbent on scientists to validate computational procedures–just as they already validate laboratory reagents, devices, and procedures–in order to convince readers of the absence of serious bugs.

Software is exceptionally brittle

A response to concerns about software quality that I have heard frequently—-particularly from wet-lab biologists—-is that errors may occur but have little impact on the outcome. This may be because only a few data points are affected, or because values are altered by a small amount (so the error is “in the noise”). The above estimates account for this by including a term for “meaningful changes to the result”. Nonetheless, in the context of physical experiments, it is tempting to believe that small errors tend to reduce precision but have less effect on accuracy–i.e. if the concentration of some reagent is a bit off then the results will also be just a bit off, but not completely unrelated to the correct result.

But software is different. We cannot apply our physical intuitions, because software is profoundly brittle: “small” bugs commonly have unbounded error propagation. A sign error, a missing semicolon, an off-by-one error in matching up two columns of data, etc. will render the results complete noise16. It is rare that a software bug would alter a small proportion of the data by a small amount. More likely, it systematically alters every data point, or occurs in some downstream aggregate step with effectively global consequences. In general, software errors produce outcomes that are inaccurate, not merely imprecise.

Many erroneous results are plausible

Bugs that produce program crashes or completely implausible results are more likely to be discovered during development, before a program becomes “delivered code” (the state of code on which the above errors-per-line estimates are based). Consequently, published scientific code often has the property that nearly every possible output is plausible. When the code is a black box, situations such as these may easily produce outputs that are simply accepted at face value:

  • An indexing off-by-one error or other data management mistake associates the wrong pairs of X’s and Y’s14,15.

  • A correlation is found between two variables where in fact none exists, or vice versa.

  • A sequence aligner reports the “best” match to a sequence in a genome, but actually provides a lower-scoring match.

  • A protein structure produced from x-ray crystallography is wrong, but it still looks like a protein16.

  • A classifier reports that only 60% of the data points are classifiable, when in fact 90% of the points should have been classified (and worse, there is a bias in which points were classified, so those 60% are not representative).

  • All measured values are multiplied by a constant factor, but remain within a reasonable range.

Software errors and statistical significance are orthogonal issues

A software error may produce a spurious result that appears significant, or may mask a significant result.

If the error occurs early in an analysis pipeline, then it may be considered a form of measurement error (i.e., if it systematically or randomly alters the values of individual measurements), and so may be taken into account by common statistical methods.

However: typically the computed portion of a study comes after data collection, so its contribution to wrongness may easily be independent of sample size, replication of earlier steps, and other techniques for improving significance. For instance, a software error may occur near the end of the pipeline, e.g. in the computation of a significance value or of other statistics, or in the preparation of summary tables and plots.

The diversity of the types and magnitudes of errors that may occur1721 makes it difficult to make a general statement about the effects of such errors on apparent significance. However it seems clear that, a substantial proportion of the time (based on the above scenarios, anywhere from 5% to 100%), a result is simply wrong—-rendering moot any claims about its significance.

Popular software is not necessarily less bug-prone

The dangers posed by bugs should be obvious to anyone working with niche or custom software, such as one-off scripts written by a graduate student for a specific project. Still it is tempting to think that “standard” software is less subject to these concerns: if everyone in a given scientific field uses a certain package and has done so for years, then surely it must be trustworthy by now, right? Sadly this is not the case.

In the open-source software community this view is known as “Linus’s Law”: “Given enough eyeballs, all bugs are shallow”. The law may in fact hold when there are really many eyeballs reading and testing the code. However widespread usage of the code does not produce the same effect. This has been recently demonstrated by the discovery of major security flaws in two extremely widely used open-source programs: the “Shellshock” bug in the bash command line shell and the “Heartbleed” bug in the OpenSSL encryption library. In both cases, code that runs on a substantial fraction of the world’s computers is maintained by a very small number of developers. Despite the code being open-source, “Linus’s Law” did not take effect simply because not enough people read it–even over the course of 25 years, in the case of Shellshock.

This principle applies not only to the software itself, but also to computed results that are reused as static artifacts. For instance, it took 15 years for anyone to notice errors in the ubiquitous BLOSUM62 amino acid substitution matrix used for protein sequence alignment22.

Furthermore, even popular software is updated over time, and is run in different environments that may affect its behavior. Consequently, even if a specific version of a package running on a specific computer is considered reliable, that trust cannot necessarily be extended to other versions of the same software, or to the software when run on a different CPU or on a different operating system23.

What can be done?

All hope is not lost; we must simply take the opportunity to use technology to bring about a new era of collaborative, reproducible science2426. Open availability of all data and source code used to produce scientific results is an incontestable foundation2731. A culture of comprehensive code review (both within and between labs) can certainly help reduce the error rate, but is not a panacea. Beyond that, we must redouble our commitment to replicating and reproducing results, and in particular we must insist that a result can be trusted only when it has been observed on multiple occasions using completely different software packages and methods.

A flexible and open system for describing and sharing computational workflows32 would allow researchers to more easily examine the provenance of computational results, and to determine whether results are robust to swapping purportedly equivalent implementations of computational steps. A shared workflow system may thereby facilitate distributed verification of individual software components. Projects such as Galaxy33, Kepler34, and Taverna35 have made inroads towards this goal, but much more work is needed to provide widespread access to comprehensive provenance of computational results. Perhaps ironically, a shared workflow system must itself qualify as a “trusted component”–like the Linux kernel–in order to provide a neutral platform for comparisons, and so must be held to the very highest standards of software quality. Crucially, any shared workflow system must be widely used to be effective, and gaining adoption is more a sociological and economic problem than a technical one36. The first step is for all scientists to recognize the urgent need to verify computational results–a goal which goes hand in hand with open availability of comprehensive provenance via workflow systems, and with verification of the individual components of those workflows.

Comments on this article Comments (2)

Version 2
VERSION 2 PUBLISHED 29 Jul 2015
Revised
  • Reader Comment 16 Nov 2015
    Raymond Panko, University of Hawaii, USA
    16 Nov 2015
    Reader Comment
    I have some data on software error (fault) rates at my website, panko.com, under the human error branch. In unit testing, which occurs when a module has been coded and ... Continue reading
Version 1
VERSION 1 PUBLISHED 11 Dec 2014
Discussion is closed on this version, please comment on the latest version above.
  • Reader Comment 23 Dec 2014
    Konrad Hinsen, Centre de Biophysique Moléculaire (CNRS), France
    23 Dec 2014
    Reader Comment
    The problem discussed in this article is important indeed, and deserves experimental verification. The most obvious approch in my opinion is to have some computational method implemented twice, using tool ... Continue reading
  • Discussion is closed on this version, please comment on the latest version above.
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Soergel DAW. Rampant software errors may undermine scientific results [version 2; peer review: 2 approved] F1000Research 2015, 3:303 (https://doi.org/10.12688/f1000research.5930.2)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 2
VERSION 2
PUBLISHED 29 Jul 2015
Revised
Views
73
Cite
Reviewer Report 29 Oct 2015
C. Titus Brown, Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI, USA 
Approved
VIEWS 73
This is a sober (and sobering) perspective on the likely frequency of software ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Brown CT. Reviewer Report For: Rampant software errors may undermine scientific results [version 2; peer review: 2 approved]. F1000Research 2015, 3:303 (https://doi.org/10.5256/f1000research.7365.r9698)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
98
Cite
Reviewer Report 04 Aug 2015
Daniel S. Katz, Computation Institute, University of Chicago, Chicago, IL, USA 
Approved
VIEWS 98
This version of the paper is much improved, and in general, I agree with the author's response comments.  I still have some concerns with two issues, and reading the paper made me think of one more point of interest, but ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Katz DS. Reviewer Report For: Rampant software errors may undermine scientific results [version 2; peer review: 2 approved]. F1000Research 2015, 3:303 (https://doi.org/10.5256/f1000research.7365.r9699)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Version 1
VERSION 1
PUBLISHED 11 Dec 2014
Views
160
Cite
Reviewer Report 22 Dec 2014
Daniel S. Katz, Computation Institute, University of Chicago, Chicago, IL, USA 
Approved with Reservations
VIEWS 160
This opinion article makes a number of good qualitative points, and while I completely agree that there are errors in most software, I think the chances of those errors leading to incorrect published results are completely unknown, and could potentially ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Katz DS. Reviewer Report For: Rampant software errors may undermine scientific results [version 2; peer review: 2 approved]. F1000Research 2015, 3:303 (https://doi.org/10.5256/f1000research.6338.r7096)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 29 Jul 2015
    David Soergel
    29 Jul 2015
    Author Response
    Thanks very much for your insightful comments, and apologies for the long-delayed response.  I believe I have addressed the main point about softening the claims throughout the paper.  Some further ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 29 Jul 2015
    David Soergel
    29 Jul 2015
    Author Response
    Thanks very much for your insightful comments, and apologies for the long-delayed response.  I believe I have addressed the main point about softening the claims throughout the paper.  Some further ... Continue reading
Views
205
Cite
Reviewer Report 16 Dec 2014
C. Titus Brown, Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI, USA 
Approved with Reservations
VIEWS 205
David Soergel's opinion piece applies numerical calculations and common (software engineering) sense to thinking about errors in scientific software.  I have seen no other piece that so simply and brutally summarizes the likely problems with current software development approaches in ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Brown CT. Reviewer Report For: Rampant software errors may undermine scientific results [version 2; peer review: 2 approved]. F1000Research 2015, 3:303 (https://doi.org/10.5256/f1000research.6338.r7019)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 22 Dec 2014
    David Soergel
    22 Dec 2014
    Author Response
    Thanks for the kind and helpful comments!  The editors prefer to wait for more reviews before issuing a revision, but in the meantime:
    1. In the title, I could go with "...may
    ... Continue reading
  • Author Response 29 Jul 2015
    David Soergel
    29 Jul 2015
    Author Response
    Thanks again for the comments, and apologies for the long-delayed response.  The changes described above are reflected in the new version.
    Competing Interests: No competing interests were disclosed.
COMMENTS ON THIS REPORT
  • Author Response 22 Dec 2014
    David Soergel
    22 Dec 2014
    Author Response
    Thanks for the kind and helpful comments!  The editors prefer to wait for more reviews before issuing a revision, but in the meantime:
    1. In the title, I could go with "...may
    ... Continue reading
  • Author Response 29 Jul 2015
    David Soergel
    29 Jul 2015
    Author Response
    Thanks again for the comments, and apologies for the long-delayed response.  The changes described above are reflected in the new version.
    Competing Interests: No competing interests were disclosed.

Comments on this article Comments (2)

Version 2
VERSION 2 PUBLISHED 29 Jul 2015
Revised
  • Reader Comment 16 Nov 2015
    Raymond Panko, University of Hawaii, USA
    16 Nov 2015
    Reader Comment
    I have some data on software error (fault) rates at my website, panko.com, under the human error branch. In unit testing, which occurs when a module has been coded and ... Continue reading
Version 1
VERSION 1 PUBLISHED 11 Dec 2014
Discussion is closed on this version, please comment on the latest version above.
  • Reader Comment 23 Dec 2014
    Konrad Hinsen, Centre de Biophysique Moléculaire (CNRS), France
    23 Dec 2014
    Reader Comment
    The problem discussed in this article is important indeed, and deserves experimental verification. The most obvious approch in my opinion is to have some computational method implemented twice, using tool ... Continue reading
  • Discussion is closed on this version, please comment on the latest version above.
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.