ABSTRACT
Surveys conducted by human interviewers are one of the principal means of gathering data from all over the world, but the quality of this data can be threatened by interviewer fabrication. In this paper, we investigate a new approach to detecting interviewer fabrication automatically. We instrument electronic data collection software to record logs of low-level behavioral data and show that supervised classification, when applied to features extracted from these logs, can identify interviewer fabrication with an accuracy of up to 96%. We show that even when interviewers know that our approach is being used, have some knowledge of how it works, and are incentivized to avoid detection, it can still achieve an accuracy of 86%. We also demonstrate the robustness of our approach to a moderate amount of label noise and provide practical recommendations, based on empirical evidence, on how much data is needed for our approach to be effective.
- Baker, R. P. New technology in survey research: Computer-assisted personal interviewing (CAPI). Social Science: Computer Review 10, 2 (1992), 145--157.Google ScholarDigital Library
- Bennett, A. Toward a solution of the "cheater problem" among part-time research investigators. J. Marketing 2 (1948), 470--474.Google Scholar
- Birnbaum, B. Algorithmic Approaches to Detecting Interviewer Fabrication in Surveys. PhD thesis, U. Washington, 2012.Google Scholar
- Birnbaum, B., et al. Automated quality control for mobile data collection. In DEV (2012), 1:1--1:10. Google ScholarDigital Library
- Blaya, J. A., et al. E-health technologies show promise in developing countries. Health Aff. (Millwood) 29, 2 (2010), 244--51.Google ScholarCross Ref
- Bredl, S., et al. A statistical approach to detect cheating interviewers. Tech. Rep. 39, University Giessen, Center for International Development and Environmental Research (ZEU), 2008.Google Scholar
- Breiman, L. Random forests. Machine Learning 45 (2001), 5--32. Google ScholarDigital Library
- Bushery, J. M., et al. Using date and time stamps to detect interviewer falsification. Proc. ASA (Survey Research Methods) (1999), 316--320.Google Scholar
- Caruana, R., et al. An empirical evaluation of supervised learning in high dimensions. In ICML (2008). Google ScholarDigital Library
- Chen, K., et al. USHER: Improving data quality with dynamic forms. IEEE Trans. Knowledge and Data Engineering 23, 8 (2010), 1138--1153. Google ScholarDigital Library
- Cho, M. J., et al. Inferential methods to identify possible interviewer fraud using leading digit preference patterns and design effect matrices. Proc. ASA (Survey Research Methods Section) (2003), 936--941.Google Scholar
- Couper, M. P. Usability evaluation of computer-assisted survey instruments. Social Science: Computer Review 18, 4 (2000), 384--396. Google ScholarDigital Library
- Couper, M. P., and Kreuter, F. Using paradata to explore item level response times in surveys. J. Royal Statistical Society: A (2012).Google Scholar
- Crespi, L. P. The cheater problem in polling. Public Opinion Quarterly 9, 4 (1945), 431--445.Google ScholarCross Ref
- DeRenzi, B., et al. Mobile phone tools for field-based health care workers in low-income countries. Mt. Sinai J. Medicine 78, 3 (2011), 406--418.Google ScholarCross Ref
- EpiSurveyor. http://www.episurveyor.org/.Google Scholar
- Evans, F. B. On interviewer cheating. Public Opinion Quarterly 25 (1961), 126--127.Google ScholarCross Ref
- Ghazarian, A., and Noorhosseini, S. M. Automatic detection of users skill levels using high-frequency user interface events. User Modeling and User-Adapted Interaction 20, 2 (2010), 109--146. Google ScholarDigital Library
- Hall, M., et al. The WEKA data mining software: An update. SIGKDD Explorations 11, 1 (2009). Google ScholarDigital Library
- Hansen, S. E., and Marvin, T. Reporting on item times and keystrokes from Blaise audit trails. Tech. rep., 2001.Google Scholar
- Hartung, C., et al. Open Data Kit: Tools to build information services for developing regions. In ICTD (2010). Google ScholarDigital Library
- Hilbert, D. M., and Redmiles, D. F. Extracting usability information from user interface events. ACM Comp. Surveys 32, 4 (2000), 384--421. Google ScholarDigital Library
- Hong, H. S., et al. Adoption of a PDA-based home hospice care system for cancer patients. Comput. Inform. Nurs. 27, 6 (2009), 365--71.Google ScholarCross Ref
- Hood, C. C., and Bushery, J. M. Getting more bang from the reinterview buck: Identifying "at risk" interviewers. Proc. ASA (Survey Research Methods Section) (1997), 820--824.Google Scholar
- Hurst, A., et al. Automatically detecting pointing performance. In IUI (2008). Google ScholarDigital Library
- Inciardi, J. A. Fictitious data in drug abuse research. Intl. J. Addictions 16 (1981), 377--380.Google ScholarCross Ref
- Judge, G., and Schechter, L. Detecting problems in survey data using Benford's Law. J. Human Resources 44, 1 (2009), 1--24.Google ScholarCross Ref
- Kiecker, P., and Nelson, J. E. Do interviewers follow telephone survey instructions? J. Market Research Society 38 (1996), 161--176.Google ScholarCross Ref
- Krejsa, E. A., et al. Evaluation of the quality assurance falsification interview used in the Census 2000 dress rehearsal. Proc. ASA (Survey Research Methods Section) (1999), 635--640.Google Scholar
- Lal, S. O., et al. Palm computer demonstrates a fast and accurate means of burn data collection. J. Burn Care Rehabil. 21, 6 (2000), 559--61.Google ScholarCross Ref
- Li, J., et al. Using statistical models for sample design of a reinterview program. Proc. ASA (Survey Research Methods Section) (2009), 4681--4695.Google Scholar
- Murphy, J., et al. A system for detecting interview falsification. In American Assoc. Public Opinion Research 59th Ann. Conf. (2004).Google Scholar
- Parikh, T. S., et al. Mobile phones and paper documents: Evaluating a new approach for capturing microfinance data in rural india. In CHI (2006), 551--560. Google ScholarDigital Library
- Pendragon Forms. http://pendragonsoftware.com/.Google Scholar
- Porras, J., and English, N. Data-driven approaches to identifying interviewer data falsification: The case of health surveys. Proc. ASA (Survey Research Methods Section) (2004), 4223--4228.Google Scholar
- Ratan, A. L., et al. Managing microfinance with paper, pen and digital slate. In ICTD (2010). Google ScholarDigital Library
- Rzeszotarski, J. M., and Kittur, A. Instrumenting the crowd: Using implicit behavioral measures to predict task performance. In UIST (2011). Google ScholarDigital Library
- Schreiner, I., et al. Interviewer falsification in census bureau surveys. Proc. ASA (Survey Research Methods Section) (1988), 491--496.Google Scholar
- Shäfer, C., et al. Automatic identification of faked and fraudulent interviews in surveys by two different methods. Proc. ASA (Survey Research Methods Section) (2004), 4318--4325.Google Scholar
- Stieger, S., and Reips, U.-D. What are participants doing while filling in an online questionnaire: A paradata collection tool and an empirical study. Computers in Human Behavior 26, 6 (2010), 1488--1495. Google ScholarDigital Library
- Turner, C. F., et al. Falsification in epidemiologic surveys: Detection and remediation. Tech. Rep. 53, Research Triangle Institute, 2002.Google Scholar
- United Nations Dept. of Economic and Social Affairs, Population Division. World Urbanization Prospects, 2011.Google Scholar
- United Nations Development Programme. The Human Development Report, 2011.Google Scholar
Index Terms
- Using behavioral data to identify interviewer fabrication in surveys
Recommendations
An Enhanced Technique to Clean Data in the Data Warehouse
DESE '11: Proceedings of the 2011 Developments in E-systems EngineeringData quality is a critical factor for the success of data warehousing projects. Improving the quality of data is important in data warehouse, because it is used in the process of decision support, which requires accurate data. There are many errors and ...
A Taxonomy of Dirty Data
Today large corporations are constructing enterprise data warehouses from disparate data sources in order to run enterprise-wide data analysis applications, including decision support systems, multidimensional online analytical applications, data mining,...
Comments