Curating for Accessibility

Accessibility of research data to disabled users has received scant attention in literature and practice. In this paper we briefly survey the current state of accessibility for research data and suggest some first steps that repositories should take to make their holdings more accessible. We then describe in depth how those steps were implemented at the Qualitative Data Repository (QDR), a domain repository for qualitative social-science data. The paper discusses accessibility testing and improvements on the repository and its underlying software, changes to the curation process to improve accessibility, as well as efforts to retroactively improve the accessibility of existing collections. We conclude by describing key lessons learned during this process as well as next steps.


Introduction
Digital technologies offer unique opportunities for making information widely accessible for people with disabilities. These opportunities extend to the accessibility of research data. In a 2015 article, Walker and Keenan coin the phrased "truly accessible research data": research data that are not merely available, but accessible to all users, including users with disabilities (Walker & Keenan, 2015). In the 6 years since, the conversation about accessible research data that Walker and Keenan hoped to start has, mostly, not occurred. In this paper, we briefly survey the current state of accessibility for research data and research curation and then describe efforts at the Qualitative Data Repository (QDR) to adjust curation processes to mainstream accessibility. This work is rooted in a collaboration between three of its authors, whose work on broader questions of accessibility for research data (Colón, Goben, & Karcher, 2022) led to a review of -and significant updates to -processes at QDR.

Background
There are an estimated 1 billion people with disabilities worldwide. Until recently, the needs of this population to access digital resources have not been met, and they continue to be underserved (Goggin, 2021;Kazuye Kimura, 2018). As we show elsewhere, this is particularly true for research data (Colón et al., 2022). There is nothing inherently inaccessible about digital data itself; in fact, many people with disabilities are proficient with the technologies needed to access data (Beyene, 2018). Instead, decisions regarding the organization, formation, and portrayal of data, can determine whether the content will be accessible to such a large population (Kazuye Kimura, 2018). By designing data repositories and their content with the needs of people with disabilities in mind, we can better include an often-overlooked population.
Moreover, disability advocates are calling for large institutions, such as universities and libraries, to only work with vendors who adhere to accessibility guidelines (Kazuye Kimura, 2018). This, coupled with changing accessibility policies worldwide, and the potential to face litigation when accessibility guidelines are not being met, makes it beneficial pragmatically to proactively work toward greater accessibility (Kazuye Kimura, 2018). Often, it is more difficult and more expensive to retrofit content for accessibility than it is to include accessibility from the beginning (Kazuye Kimura, 2018), making it advantageous to start considering accessibility concerns as early on as possible.
Additionally, while accessibility is generally thought of as benefitting the disability community, things created with disability needs in mind can also benefit the public at large (Goggin, 2021;Vollenwyder, Iten, Brühlmann, Opwis, & Mekler, 2019). For instance, making digital repositories easier to access visually can increase the usability of the platform overall, not just for those with visual disabilities. Additionally, many people using screen readers prefer content to be as streamlined as possible, making it easier for them to access material using keyboard (rather than mouse) navigation. At the same time many users, especially users with learning disabilities, express that online resources are overly complex and difficult to navigate (Beyene, 2018). Thus, thinking through how to streamline content for a screen reader user may not only benefit the user themselves, but improve the overall functionality of the content.

Defining Accessibility
As any literature search for "accessibility" and "research data" shows, the terms "accessible" and "accessibility" can have multiple definitions when related to research data.
1. Accessible data are in a format and location where it can be retrieved.
2. Accessible data are discoverable. For instance, can a researcher locate a record of the data in a repository, on a website, or through a publication? Is sufficiently rich metadata available for such discovery?
3. The FAIR (Findable, Accessible, Interoperable, Reusable) Principles (Wilkinson et al., 2016) adds to this an emphasis on not just human, but machine access. Here, "accessible" means that either the data can be retrieved using a standard automated protocol (such as HTTP or SFTP) or that instructions for access, e.g., for restricted data, are specified in machine-readable metadata.
4. Accessibility can also signify socio-economic access. In other words, do only those with the ability to pay get access to data. This definition is most familiar in the context of open access to scholarly literature, but it extends equally to data, access to which requires subscriptions or licenses.
We focus here on a fifth definition of access: making data easy to locate, obtain, interpret, use, share, and analyze for everybody, deliberately including disabled people. This includes compliance with relevant laws as conventions such as the Americans with Disabilities Act (ADA) in the US, or the United Nations' Convention on the Rights of Persons with Disabilities (CRPD). Active inclusion, however, is a more radical concept that may require significant rethinking across many sectors. Compliance with applicable legislation is a minimal threshold, not the end goal of efforts for accessibility and inclusion.

Current Efforts for Accessible Curation
To date, accessibility of research data has received surprisingly little attention. A review of the literature did not find any publication directly addressing accessibility of research data for people with disabilities or the role of data curation for accessibility, beyond the above-mentioned Walker and Keenan (2015). Academic works on data curation that we consulted do not mention data curation for accessibility (Hudson-Vitale et al., 2017;Johnston, 2016). We also reviewed several documents on data preparation and/or curation published by social science data repositories (including one co-authored by one of the authors of this paper) -none of them discuss how curation can and should facilitate access to research data for disabled people (Demgenski, Karcher, Kirilova, & Weber, 2021b;Eynden, Corti, Woollard, Bishop, & Horton, 2011;ICPSR, 2021).
We were similarly unsuccessful in finding references to accessibility for disabled researchers in widely used standards for data repositories and curation. Neither the FAIR Principles (Wilkinson et al., 2016), nor the US National Institute of Health's new guidance for data repositories, nor the certification requirements for the Core Trust Seal for Data Repositories include guidance about accessibility for all (CoreTrustSeal Standards and Certification Board, 2019; NIH, 2020).
As we start to think about practices for curation for accessibility, it is worthwhile to highlight some of the existing exceptions: For example, the curation primer for R, part of the data curation primers developed under the auspices of the Data Curation Network, contains specific sections on data accessibility (Kellam, Koziar, & Pejša, 2019). However, at the time of this writing, no other curation primer contains sections related to accessibility.

Curating for Accessibility in Practice
The amount of research data deposited in repositories is growing rapidly, and making all of these data accessible at the highest standards of accessibility is a daunting task -so daunting that 4 | Curating for Accessibility Anderson, Colón, Goben & Karcher we worry it may deter repositories to even take meaningful first steps. We suggest three pragmatic steps to get started and to make the majority of our resources accessible to most of our users. To be clear -partial accessibility is not sufficient, but given the task at hand, we believe identifying and prioritizing high-impact efforts is nonetheless useful.
1. Ensuring the (web) accessibility of the data repositories themselves. In spite of existing standards for web accessibility and the legal requirement to implement these in many countries (e.g., under section 508 of the Rehabilitation act in the US), the websites of many academic institutions (Acosta-Vargas, Luján-Mora, & Salvador-Ullauri, 2017), libraries (Spina, 2019), and databases provided through academic libraries (Falloon & O'Reilly, 2020;Rysavy & Michalak, 2020) remain inaccessible. Initial testing indicates that this extends to many popular repository and repository software solutions including, e.g., Harvard Dataverse (and the Dataverse software), Zenodo (and the Invenio software) and the GESIS data catalog. 2 2. Ensuring the accessibility of common data formats. While many traditional data formats (such as CSV) may be accessible by default, common formats for documentation and qualitative data (PDF, .docx) as well as domain specific formats can vary significantly with respect to accessibility. Frequently, a set of relatively easy steps can ensure that they meet basic accessibility requirements.
3. Providing supplementary information for multimedia files. Such files are increasingly common in data repositories, both as qualitative data and as documentation. They pose particular challenges for users that are blind or low vision as well as those who are deaf or hard of hearing. Wherever possible, repositories should include auxiliary information that makes these files more accessible.

Accessibility Efforts at the Qualitative Data Repository (QDR)
Starting in Fall 2021, QDR undertook a multi-step internal accessibility audit. In line with the steps outlined above, we initially focused on improving website accessibility, and then addressed our curation procedures. First, we modified the internal version of our curation handbooks (Demgenski, Karcher, Kirilova, & Weber, 2021a), ensuring that curated files are accessible. We then reviewed, and started to mitigate, accessibility issues for existing deposits.
Website Accessibility Our initial process for determining the accessibility of our website was to use AXE DevTools 3 , a free website accessibility checker, and to catalog all the errors that the website checker found in GitHub issues. We performed such checks on all pages of our Drupal 9-based website, as well as every user-facing view of our Dataverse-based data repository. Broadly speaking, issues could be classified in four main categories: 1. Issues with information display that could quickly, and in bulk, be addressed by CSS changes. This is typically true, for example, for problems around insufficient font contrast.
2. Issues with specific page content that required individual attention by content creators, such as alt text, accessible tables, and proper link text.
3. Issues that required individual attention by a developer but could be addressed relatively quickly by adding to and improving the site's .html. Such additions included page landmarks, button and form labels, and ARIA labels more generally. Anderson,Colón,Goben & Karcher | 5 4. Issues that required more significant technical adjustments. Those included the behavior of menus and forms (in particular, keyboard accessibility) as well as visual focus for keyboard/tab behavior. Issues in this category were the only ones that at times remained unresolved due to upstream dependencies. 4 We verified changes using the AXE tool and cross-checked with a second accessibility checker, WAVE 5 , and against the general accessibility guidelines developed by Syracuse University and by the A11Y project (n.d.). Manual checks are required, for example, to ensure full keyboard navigation with sufficient visual cues. Where our accessibility improvements applied to the upstream Dataverse software, QDR's development team submitted pull requests, leading to marked improvement of accessibility in the Dataverse software and its large userbase of data repositories worldwide (see especially Myers, 2021 as well as the Dataverse release notes for versions 5.8, 5.9, and 5.10).

Accessible Curation Guidelines
The most commonly used software for our data project files (Microsoft Office/Adobe Acrobat) include internal accessibility checkers, which are helpful for a rudimentary check of accessibility of a given file. However, these accessibility checkers do not always transfer from one software to another, and they often do not detect all accessibility issues. For example, a Microsoft Word file that is deemed accessible by its internal accessibility checker does not necessarily convert into an accessible PDF, thus it is necessary to check the accessibility of the final data file even if other accessibility checks were made along the way. As noted above, there is no specific guidance for the accessibility of research data stored in repositories. However, most of QDR's data projects consist of a relatively small number of different file formats (PDF, .docx, .xlsx., CSV, JPG, and .mp4). By exploring accessibility guidance for each of the file formats, we created a step-bystep guide for curation of a particular file type.
CSV: By the nature of a CSV file, as long as the data contain a single row of headings in the first row and contain no formulas, the file is considered accessible. MS Excel (.xlsx) files require more checking depending on what is contained within the Excel file. As a general rule, data should be flush with the top left corner and start in cell A1. Additionally, using multiple blank cells for spacing reasons should be avoided. Instead use a single blank row or column resized to the appropriate width and length. The MS Excel Accessibility checker can also be used but it does not detect all accessibility issues, so it should not be used as the only accessibility check for an Excel file. 6 Docx: When source files are MS Word (.docx) files there are some basic accessibility checks that are easier to implement in Word rather than in Adobe PDF. For example, ensuring meaningful hyperlink text (instead of URL), visually accessible font, and bulleted lists are all easier to edit in Word.
PDF: The majority of our data files are PDF files and Adobe Acrobat software includes not only an accessibility checker but an additional "Make Accessible" tool that helps ensure that the document is tagged appropriately and flags any issues that need to be fixed.
Typically, internal accessibility checkers do not have the ability to detect certain visual issues, such as color contrast, which must be checked by a user. While black text on a white background is considered accessible, use of color to convey information may require a more indepth review. A simple way to check color contrast is through viewing the document in black and white and seeing if the document is still readable. For more complicated projects, external color contrast software can be used to determine the ratio between the background color and foreground color. WebAIM has developed a tool that determines the contrast ratio and whether 6 | Curating for Accessibility Anderson, Colón, Goben & Karcher it is considered compliant with W3C Web Content Accessibility Guidelines (WCAG). 7 The also incorporates an eye dropper that allows the user to click on a color on the screen and the tool will determine the exact color specifications.
We consulted various resources for each common data type and then developed efficient, step-by-step workflows to ensure accessibility, which we incorporated into our standard curation guidelines and checklists (see Demgenski et al., 2021a, though note that the accessibility workflows have not yet been added to the public version). For example, Adobe Acrobat's "Make Accessible" tool does not solve all accessibility/compliance issues and requires a number of issues to be corrected by hand. Moreover, some features, such as color contrast and reading order, and alt-text for figures and images, must always be reviewed by hand. We include stepby-step instructions on how to review and correct such issues.

Retroactive Curation for Accessibility
We began looking at retroactive curation for accessibility by creating an inventory of data projects and the expected time needed to address most accessibility issues. For most projects, this ranged between 10 minutes and 10 hours, depending on their size and complexity. For a small subset, consisting of large collections either of scanned archival documents with poor-quality or no optical character recognition (OCR) or of videos without subtitles, remediating accessibility would exceed 100hs per project and may prove impossible.
The length of time it takes to retroactively curate data projects depends heavily on the content of the projects in addition to the length and number of files. For example, videos should have captions, a transcript, and, ideally, an audio description of visual events in order to be considered accessible. For a project that consists of multiple lengthy video files, such work would take considerable time. In contrast, Adobe Acrobat's built-in tools can considerably reduce the amount of time spent making PDFs accessible. The tools are fast, relatively hands-off, (except for errors that might require correction by hand), and can be run in bulk using Acrobat's "Action Wizard" functionality. Data projects that consist primarily of transcripts usually require minimal correction by hand, and most can be completed within 20 to 30 minutes. Some PDF features add significant labor, however. For example, any PDF with redactions (using the built-in redaction tool rather than, as recommended by QDR, replacements in square brackets in the original text) requires that every redaction have alternate text, since the Adobe considers redactions as "figures." Auto-tagging also does not work reliably with PDF redactions.
As the next step, we used curated data files before any file-format conversion for preservation purposes (e.g., to PDF/A) were conducted. Files are stored at multiple stages of QDR's curation process, facilitating this step and allowing us to work with the most suitable version. A majority of our files were PDFs, so we ran the Make Accessible process in Adobe Acrobat in order to ensure that all files were tagged appropriately, the correct reading language was set, and all figures were given alt text. After the Make Accessible process, we reconverted the files to the appropriate preservation format for re-archiving.

Preserving source integrity
Previously published data projects require a different approach for accessibility. Including accessibility in the curation steps for a newly submitted data project allows for the researcher to review any changes made before publication, including important changes to the visual display of files that may enhance transparency. Retroactively curating published data projects does not allow for this approval. We therefore focused primarily on the non-visual aspects of curation for accessibility. For example, a sans serif font is widely recommended for users with dyslexia (Rello & Baeza-Yates, 2013). While curating a current data project could incorporate a font change into the documents, retroactive curation does not allow for the researcher to approve this change before publication. However, alternative steps can be taken in order to ensure that readers with varying levels of vision can still access the data project, such as ensuring that the text, headers, and tables are appropriately tagged for assistive screen readers. Such changes Anderson,Colón,Goben & Karcher | 7 make no visual changes to the document and can thus be applied even to documents used as part of the research process, such as study information sheets read by participants.
In addition, any file change on QDR triggers an automatic version update of the data project. In the notes of the new version, we note the reason for the version update. Old versions and associated files remain accessible, should any concerns about the impact of accessibility transformation surface.

Initial Lessons
A number of important lessons emerged from working to improve the accessibility of our collections. These lessons have informed our own processes, but may also facilitate similar efforts at other repositories.
Seek the collaboration and feedback of disabled users. Working towards better accessibility should not just be an exercise in compliance; it should be rooted in the necessity to make research data truly accessible to all users. It is therefore indispensable to work with disabled users, testers, and, if possible, developers to ensure that any measures taken are successful. For example, one of the authors of this paper (Colón), who uses assistive technology, found that an early iteration of QDR's improved accessibility workflow of PDFs only produced modest improvements, as tagged headings were not recognized by the screen reader. More generally, working with disabled users and developers can turn up shortcomings of complianceoriented solutions and offer important insights into how users interact with your holdings and your site.
Built-in accessibility checkers are convenient but not sufficient. Many programs have begun incorporating accessibility checkers into their software (Microsoft Office, Adobe Acrobat). However, these checkers are not exhaustive and can falsely claim compliance with accessibility criteria as well as showing compliance where actual accessibility remains poor. Although these built-in checkers are a great tool for assisting in accessibility curation, they cannot be the only guidelines used.
Balance between perfect and good enough. Some compliance features are of limited importance for disabled users and given the time and budget constraints of curating data, it can be worthwhile to weigh costs and benefits for particularly labor-intensive steps. This is not always possible, however. For example, good-quality captioning is essential for the accessibility of videos, yet is a very time consuming task. Many repositories will meet situations where data cannot feasibly be made accessible (even in the compliance sense of the term) and should have a policy for such situations.
Re-curating past projects for accessibility is an opportunity. Revisiting older projects to ensure accessibility offers the possibility audit them for other inconsistencies. Throughout QDR's re-curation process, we have found documents that do not follow current curation best practices, e.g., missing recommended metadata. In addition to addressing accessibility issues, revisiting existing projects has thus improved the overall quality of our curated collections.
Subject expertise matters for accessibility. Adding alternate text for images is an important part of making formatted documents accessible.
A key to writing effective alt text is understanding what the content and most salient features of each image are. Where curators are not subject-matter experts, however, this can be difficult. This is one of the areas in which building accessibility into workflows early, and, for example, working with depositors to have them supply alt text that requires subject expertise, is essential. 8 | Curating for Accessibility Anderson, Colón, Goben & Karcher

Conclusion: Next Steps
Making research data fully accessible to all users is an ideal. Much of our initial work on accessibility has still been focused on compliance with applicable standards (and laws). This is not just because we are subject to those laws, but also because standards such as WCAG do reflect many good practices in improving accessibility. We view this as a beginning of our efforts, though, not as its end. As an immediate next step, we are hoping to more systematically include accessibility testing into standard software deployment routines -both for QDR, and for the Dataverse software at large: updates that break accessibility should no more pass Q&A tests than updates that mis-render graphical elements or disable clickable buttons.
As discussed above, improving accessibility design and testing requires collaborating with disabled users. An important part of ensuring accessible data in the long term will be to build and sustain relationships. This includes working with (and paying) disabled users working with different accessibility features and technologies to regularly test repository software and subsamples of curated data. It also includes building lasting relationships with disabled researchers to encourage them to report any remaining issues they encounter, knowing that reported accessibility issues will be properly understood and treated as serious usability bugs.
Finally, taking accessibility seriously may (and should) lead to questioning fundamental assumptions of our curation and preservation work. One example for QDR is the dominant role of PDF files for large textual data. PDFs have a range of attractive properties, such as their consistent display of information and the well-defined PDF/A archival standard. In spite of the existence of an accessibility standard for PDFs, however, they are poorly suited for the use with assistive technologies, even compared to MS Word files: structural tagging works unreliably with screen readers, and options to enlarge fonts lead then lead to uncomfortable side-to-side scrolling. PDFs do not (with reasonable effort) allow users to change fonts or colors, and, unlikely HTML (and the HTML-based EPUB format) they don't readily allow for including semantic information, such as the language of a foreign-language term to facilitate accurate pronunciation by screen readers. As repositories, we need to consider offering alternative formats to PDF files wherever possible.
Curating for accessibility is an on-going, multifaceted process that touches almost every component of a repository's operations from user guidance to curation, from software development to testing. We hope that this account of QDR's first steps in building more accessible practices will encourage other repositories to develop or improve upon their approach, so that the research data we curate and preserve are truly accessible.