Abstract
When basic or descriptive summary statistics are reported, it may be possible that the entire sample of observations is inadvertently disclosed, or that members within a sample will be able to work out responses of others. Three sets of univariate summary statistics that are frequently reported are considered: the mean and standard deviation; the median and lower and upper quartiles; the median and minimum and maximum. The methodology assesses how often the full sample of results can be reverse engineered given the summary statistics. The R package uwedragon is recommended for users to assess this risk for a given data set, prior to reporting the mean and standard deviation. It is shown that the disclosure risk is particularly high for small sample sizes on a highly discrete scale. This risk is reduced when alternatives to the mean and standard deviation are reported. An example is given to invoke discussion on appropriate reporting of summary statistics, also giving attention to the box and whiskers plot which is frequently used to visualise some of the summary statistics. Six variations of the box and whiskers plot are discussed, to illustrate disclosure issues that may arise. It is concluded that the safest summary statistics to report is a three-number summary of median, and lower and upper quartiles, which can be graphically displayed by the literal ‘boxplot’ with no whiskers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Skinner, C.: Statistical disclosure control for survey data. In Handbook of Statistics, vol. 29, pp. 381–396. Elsevier (2009). https://doi.org/10.1016/S0169-7161(08)00015-1
Derrick, B., White, P.: Comparing two samples from an individual Likert question. Int. J. Math. Statist. 18(3) (2017)
Derrick, B.: uwedragon: Data Research, Access, Governance Network: Statistical Disclosure Control. R package (2022). https://cran.r-project.org/web/packages/uwedragon/index.html
Lowthian, P., Ritchie, F.: Ensuring the confidentiality of statistical outputs from the ADRN. ADRN Technical paper (2017). https://uwe-repository.worktribe.com/output/888435
Dinur, I., Nissim, K.: Revealing information while preserving privacy. PODS 2003, 202–210 (2003)
Hozo, S.P., Djulbegovic, B., Hozo, I.: Estimating the mean and variance from the median, range, and the size of a sample. BMC Med. Res. Methodol. 5(1), 1–10 (2005)
Derrick, B., Green, L., Kember, K., Ritchie, F., White, P.: Safety in numbers: Minimum thresholding, Maximum bounds, and Little White Lies: The case of the Mean and Standard Deviation Scottish Economic Society Conference 2022 (2022). www.ses2022.org/sessions/protecting-confidentiality-social-science-research-outputs
R Core team: A Language and Environment for Statistical Computing (2021). https://www.R-project.org/
Hyndman, R.J., Fan, Y.: Sample quantiles in statistical packages. Am. Stat. 50(4), 361–365 (1996)
Tukey, J.W.: Exploratory Data Analysis, p. 9780201076165. Addison-Wesley, ISBN (1977)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Derrick, B., Green, E., Ritchie, F., White, P. (2022). The Risk of Disclosure When Reporting Commonly Used Univariate Statistics. In: Domingo-Ferrer, J., Laurent, M. (eds) Privacy in Statistical Databases. PSD 2022. Lecture Notes in Computer Science, vol 13463. Springer, Cham. https://doi.org/10.1007/978-3-031-13945-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-13945-1_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13944-4
Online ISBN: 978-3-031-13945-1
eBook Packages: Computer ScienceComputer Science (R0)