Digitizing dissertations for an institutional repository: a process and cost analysis *

Objective: This paper describes the Lamar Soutter Library’s process and costs associated with digitizing 300 doctoral dissertations for a newly implemented institutional repository at the University of Massachusetts Medical School. Methodology: Project tasks included identifying metadata elements, obtaining and tracking permissions, converting the dissertations to an electronic format, and coordinating workflow between library departments. Each dissertation was scanned, reviewed for quality control, enhanced with a table of contents, processed through an optical character recognition function, and added to the institutional repository. Results: Three hundred and twenty dissertations were digitized and added to the repository for a cost of $23,562, or $0.28 per page. Seventy-four percent of the authors who were contacted (n 5 282) granted permission to digitize their dissertations. Processing time per title was 170 minutes, for a total processing time of 906 hours. In the first 17 months, full-text dissertations in the collection were downloaded 17,555 times. Conclusion: Locally digitizing dissertations or other scholarly works for inclusion in institutional repositories can be cost effective, especially if small, defined projects are chosen. A successful project serves as an excellent recruitment strategy for the institutional repository and helps libraries build new relationships. Challenges include workflow, cost, policy development, and copyright permissions. ABSTRACT: In order to generate healthy daughter cells, nuclear division and cytokinesis need to be coordinated. Premature division of the cytoplasm in the absence of chromosome segregation or nuclear proliferation without cytokinesis might lead to aneuploidy and cancer. are highly conserved from yeast to humans, but were only characterized in Saccharomyces cerevisiae at the time this thesis was initiated. Cdc14 had been identified as the effector of a signaling cascade homologous to the SIN, called the mitotic exit network (MEN), which is required for exit from mitosis. This thesis describes the identification of the S. pombe Cdc14-like phosphatase Clp1p as a component of the cytokinesis checkpoint. Clp1p opposes CDK activity, and Clp1p and the SIN activate each other in a positive feedback loop. This maintains an active cytokinesis checkpoint and delays mitotic entry. We further found that Clp1p regulates chromosome segregation. Concluding, this thesis describes discoveries adding to the characterization of the cytokinesis checkpoint and the function of Clp1p. While others found that Cdc14- family phosphatases, including Clp1p, have similar catalytic functions, we show that their biological function may be quite different between organisms, possibly due to different biological challenges.


INTRODUCTION
Digitization projects in libraries seem ubiquitous as libraries become increasingly involved in the acquisition, development, and management of digital information [1]. Libraries typically target archival and special collections materials such as historical documents and photographs [2]. Projects to digitize vast collections of books began as early as 1971 with Project Gutenberg and are now getting widespread media attention with the launch of Google Book Search, the Internet Archive, and others [3]. In an April 2007 list of ten assumptions about the future that would significantly impact academic libraries and librarians, the Association of College & Research Libraries Research Committee placed digitization at the top of the list, stating, ''There will be an increased emphasis on digitizing collections, preserving digital archives, and improving methods of data storage and retrieval'' [4].
A related emergent trend in academic libraries is the implementation of institutional repositories (IRs), digital collections that capture and preserve the intellectual output of university communities [5]. A search of OpenDOAR, the Directory of Open Access Repositories, lists 298 academic repositories in North America N Seventy-four percent of dissertation authors (209/ 282) gave permission for the digitization. The cost to process the entire dissertation collection in-house was $23,562, only $1,062 more than the estimate to outsource.
N Digitizing the dissertation collection increased access: the print collection was used 723 times in the past 5 years, while the electronic collection was used 17,555 times in 17 months.

Implications
N Digitizing student works is an effective way to begin populating an institutional repository.
N In-house digitization projects can be cost-competitive with outsourced alternatives. N A repository can be a catalyst for developing relationships in the institution by providing the library with a new avenue for outreach. N Skills and experience gained from a small project can be applied to larger-scale projects. and 70 are planning to add or are considering offering a repository [7]. According to Foster and Gibbons, libraries build IRs because they ''provide an institution with a mechanism to showcase its scholarly output, centralize and introduce efficiencies to the stewardship of digital documents of value, and respond proactively to the escalating crisis in scholarly communication'' [8].
Medical librarians are just beginning to report their experiences with institutional repositories in the professional literature [9][10][11][12][13]. In one case study, Krevit and Crays [13] [15] confirmed that the challenges experienced at the Texas Medical Center were not unique. Libraries are the drivers of IRs at their institutions, as few faculty members identify and self-archive their own materials. Libraries struggle to recruit content and employ a variety of strategies to enlist submissions [16][17][18][19]. Content may vary, but a recent study by McDowell reports that student works account for the largest percentage of documents in institutional repositories, approximately 41.5% [19].
The following case study describes a nexus of these two trends: digitization of student scholarly works and institutional repositories. The first digitization project for the Lamar Soutter Library at the University of Massachusetts Medical School (UMMS) was to digitize 300 doctoral dissertations and add the full text to the school's new IR.

BACKGROUND
Founded in 1970, UMMS encompasses the graduate schools of medicine, nursing, and biomedical sciences. The Lamar Soutter Library holds 175,000 print volumes and provides access to 316 databases, 4,650 electronic journals, and 359 electronic books. The IR is the library's first comprehensive digital initiative.
In early 2006, the library purchased a license for ProQuest Digital Commons,{ a hosted institutional repository system, and named the repository ''eScholarship@UMMS'' ,http://escholarship.umassmed .edu.. The team implementing the repository, a previously reported process [12], consisted of representatives from the library's systems (project management and technical support), cataloging (metadata support), and reference (outreach) departments. In March 2006, the dean of the UMMS Graduate School of Biomedical Sciences (GSBS) expressed interest in digitizing the school's dissertations. The GSBS had produced 300 dissertations, most of which were available only in print format. The team thought this would be an excellent demonstration project: it was supported by the dean, it was a manageable size, metadata could be reutilized from the library's online public access catalog (OPAC), and the dissertation authors held the copyright. In May 2006, the library and GSBS partnered to make the dissertations fully searchable on the web.

Outsourcing versus insourcing
The team investigated 2 options for digitizing the dissertations: outsourcing to UMI or performing the work in-house. UMI estimated the cost to be $75 per title ($22,500 total) and 8-12 weeks processing time. The basis for the library estimate was created by library staff scanning and locally preparing 3 sample dissertations. Table 1 shows the library's cost estimate of $27,750-for staffing, project management, equipment, and software-and 725 hours of processing time (or 18 weeks when represented as a 40-hour work week). In all instances, except for project management, the team assumed the work would be performed by temporary help. The team had 2 issues of concern: at the time, electronic files created by UMI were not full-text searchable, and the graduate school would need to commit to sending all future dissertations to UMI to keep the database current.
The project team recommended that the library process the dissertations in-house, despite longer time to process and higher cost, in order to gain experience, retain access to materials throughout the project, and have tighter control over scanning quality. Library administration accepted the recommendation to do the digitization locally, citing ''gaining experience'' as the major benefit; however, $27,750 was not available to fund the project. Ten thousand dollars was allotted to hire temporary staff, with the understanding that circulation staff and interlibrary loan equipment would be utilized for scanning and team catalogers would add dissertations to the repository. It was also recognized that the project could not be completed in 18 weeks as staff assigned to the project would need to incorporate the dissertation tasks into their daily workload.

Metadata
To fully utilize metadata from the library's integrated library system, team catalogers customized default templates in the Digital Commons software designed to control the indexing and display of a collection of records. Customizations were necessary to fully describe the dissertations and incorporated features such as the activation of live link functionality in fields where uniform resource locators (URLs) might be included, the addition of a field to record authors' UMMS departmental affiliations, and the accommo-

Piorun and Palmer
dation of Medical Subject Headings by changing the field delimiter from a comma to a semicolon. For instance, ''Libraries, Medical'' and ''Library Technical Services'' previously displayed as ''Libraries'' and ''Medical; Library Technical Services.'' Catalogers copied and pasted title and subject data from the OPAC into the repository manually, using macros when possible. Though the Digital Commons software contained a batch loader functionality, it was not used in the submission process due to the batch loader having a separate extensible markup language (XML) schema that at the time could not be programmed to match the customized dissertation templates.

Digitization and submission process
Using alumni contact data provided by the graduate school, library staff wrote to the dissertation authors to request copyright and digitization permissions.
Alumni were asked to grant permission immediately, while current graduates were given the option to add only an abstract and delay adding the full-text for one year to allow for publishing opportunities. Initially, only dissertations for which the library secured permissions were scanned and processed. Once those were completed, a decision was made by the project team to scan the remaining dissertations, add records with the abstracts to the repository, and store the fulltext files until permission was obtained. Dissertations averaged 250 pages in length and were single-sided, with a mix of text, tables, graphs, and images. They were scanned using a Canon Image Runner 3,300 with eCopy version 3.1, a software program used for scanning, optical character recognition (OCR), and portable document format (PDF) creation. Figures 1  and 2 illustrate the digitization process. Figure 3 shows a typical dissertation record in eScholarship@ UMMS.
An unexpected step to alleviate privacy concerns As the project neared completion, the dean of the graduate school expressed concern about the signature pages of the dissertations being made public. The team asked ProQuest's UMI Dissertation Publishing its policy on this issue and learned UMI stopped scanning signature pages in 2005. The team concluded it was worth the additional time and cost to re-create a ''blank'' signature page for each dissertation, which would retain the names of the advisor and review committee without their signatures (this information was not stored elsewhere). The new signature pages were created and reinserted into the PDF files. Cataloging staff then substituted the revised PDFs in eScholarship@UMMS.

RESULTS
The total number of documents processed was 320, 300 from previous graduates and 20 dissertations submitted by an additional 20 students over the course of the project. The project team was able to successfully contact 282 of the 320 authors, and 209 (74%, 209/282) granted permission to digitize their dissertations. The dissertations (or records providing abstracts only in cases where permission was not granted) were all available online by March 2007.

Processing time
Actual processing times are summarized in Table 1. The total hours to process the materials were 906 hours, exceeding the original estimate of 725 hours by 181 hours. One-hundred and fifty-nine hours of this difference can be attributed to the unexpected need to replace the signature pages in each dissertation. The total duration of the project was 12 months, as the circulation staff members who scanned the dissertations were not assigned to the project full time. They scanned on average 2 dissertations per night and 5 on weekends. Spreading the work over the course of 1 year allowed for multiple attempts to contact alumni for permission.
Closer analysis of the estimated and actual time needed per dissertation shows 2 important factors. First, the initial time estimate to process a dissertation was low (145 minutes vs. 170 minutes); however, if the additional step of replacing the signature pages was not required, the original estimate would have been accurate. Second, regardless of the difference in the total time needed per dissertation, some important

Digitizing dissertations
areas were underestimated, such as the time to OCR the abstract and overall project management. Issues that contributed to this miscalculation include the extra time to correct the scientific notation in the OCR process and the total project management time required to obtain permissions from authors to digitize their work.

Equipment and software
The work was accomplished using existing library scanning equipment. The library already owned copies of the software used throughout the process: Microsoft Access, eCopy, Adobe Acrobat, and Adobe Illustrator. Because eCopy came with a scaled down version of the Readiris OCR software, the library purchased 3 copies of the full Readiris program for a total of $990; however, these were not used in the project because the 2 versions conflicted. Thus, the original estimate of $10,000 for equipment and software was too high.

Labor
Actual labor costs, as shown in Table 1  Digitization process & Quality control: Page orientation, page order, completeness, and image quality were reviewed and corrected. A searchable PDF was created from the eCopy file to be used throughout the rest of the process. & Table of contents: Each major section of the dissertation received an entry in the table of contents, including title page, list of figures, and chapters. & Optical character recognition (OCR) abstract: An image-based version of the abstract was created using eCopy and processed through the OCR function, then exported as a text file. The text file was then examined for capitalization, punctuation, spacing, and scientific notations; marked up in HTML to preserve original layout; and used by the cataloger adding the metadata for the dissertation. & Addition to eScholarship: Catalogers reutilized data from the library's online public access catalog to create the repository record. Information pertaining to copyright, quality of the scanned images, and missing pages was added as applicable. The HTML version of the abstract was copied into the record, as well as a link to the library's holdings information for the dissertation. The full-text PDF was subsequently added if the author granted permission. For dissertations with an abstract only, the cataloger added a comment to the record stating the library was in the process of obtaining permission from the author. When permission was obtained, the record in eScholarship@UMMS was revised to add the full-text PDF and edit the Comments field. The catalogers also added a link from the library catalog to the repository.

Figure 2
Process to convert and add dissertations to repository

Staff development
Team members became more familiar with the repository software, metadata standards, scanning, and OCR technologies and developed closer working relationships on the team and between departments. The team developed a greater awareness of the importance of copyright compliance.

CONCLUSION
Many libraries have viewed digitizing collections as too expensive an undertaking in this time of tight budgets [20]. Chapmen of Harvard University states the costs for scanning, OCR, and quality control work can be as much as 48% of a project's total costs [21]. Equivalent costs throughout the Lamar Soutter Library's dissertation project match this estimate (47.79%). Using Chapmen's group of activitiesscanning, OCR, and quality control-the per-page cost to process black-and-white text in a bound volume can range from $0.10 to $1.40 [22,23]. Both these figures are based on outsourcing the work. The Lamar Soutter Library's internal costs were competitive with these estimates, at $0.28 per page. This suggests the cost to digitize may be within the reach of many medical libraries and a viable option to populate institutional repositories.
The usage statistics for this collection indicate that by disseminating the dissertations through eScholarship@UMMS, which is indexed by Google, access and use increased substantially. Studies indicate that individuals who publish their research online in addition to publishing in traditional scholarly venues are cited more often than those who rely solely on paper publications [24][25][26][27]. In digitizing the GSBS dissertations, the library has assisted in making the school's research more widely available.
The team faced challenges such as workflow, cost concerns, policy development, and permissions. Communication and coordination between internal and external departments was vital and minimized errors. As the team learned, regardless of the amount of planning and thought that goes into a project, there is always the possibility that each record or file will need to be reworked. Decisions made in processing the dissertations set a precedent for future collections, such as adding documents without the full text if permission has not been obtained. The team acknowledges this could result in user frustration because they cannot get access to the full text. The team has worked hard to contact as many dissertation authors as possible to keep incomplete records to a minimum.
Nolen and Costanza described their experience in populating the repository at Trinity University, which

ABSTRACT:
In order to generate healthy daughter cells, nuclear division and cytokinesis need to be coordinated. Premature division of the cytoplasm in the absence of chromosome segregation or nuclear proliferation without cytokinesis might lead to aneuploidy and cancer. Cdc14-family phosphatases are highly conserved from yeast to humans, but were only characterized in Saccharomyces cerevisiae at the time this thesis was initiated. Cdc14 had been identified as the effector of a signaling cascade homologous to the SIN, called the mitotic exit network (MEN), which is required for exit from mitosis. This thesis describes the identification of the S. pombe Cdc14-like phosphatase Clp1p as a component of the cytokinesis checkpoint. Clp1p opposes CDK activity, and Clp1p and the SIN activate each other in a positive feedback loop. This maintains an active cytokinesis checkpoint and delays mitotic entry. We further found that Clp1p regulates chromosome segregation. Concluding, this thesis describes discoveries adding to the characterization of the cytokinesis checkpoint and the function of Clp1p. While others found that Cdc14family phosphatases, including Clp1p, have similar catalytic functions, we show that their biological function may be quite different between organisms, possibly due to different biological challenges. COMMENTS: Chapter 5 not included in digitized version, per author's request. RELATED RESOURCES: Link to record for print version in Library Catalog This abstract has been cut short for display purposes. This particular dissertation was accompanied by a disc containing four videos that were linked as supplementary files. The Comments field notes that one chapter was not digitized at the author's request, because she was in the process of publishing an article based on that chapter. The Related Resources field links to the OPAC record describing the print version of this dissertation and displaying its availability.
Digitizing dissertations also focused on student works, by noting, ''it's important to start small, choosing projects that have usefulness to our constituents'' [28]. The Lamar Soutter Library also found having a small, defined project had many benefits. It allowed the team to experience an early success and manage staff and resources by gradually incorporating the work. The team also gained experience with Digital Commons, metadata standards, and copyright. Additionally, this project served as a recruitment strategy to other campus departments through coordinated promotion by GSBS and the library for further population of the institutional repository. New materials recruited include student works, nursing dissertations, and faculty publications, a small portion of which required digitization.
For UMMS, digitizing dissertations proved to be a successful and cost-effective recruitment strategy and helped the library build stronger relationships at the medical school to secure future content. The team's quick response to the dean's privacy concerns built a foundation of trust for future work. Currently, all dissertations are submitted to the library in both print and electronic format along with a signed permission form to digitize the work. The library anticipates that building this relationship with students will make it easier to recruit future scholarly works over the life of a researcher's career at the medical school.