Near-Infrared Spectroscopy and Machine Learning for Accurate Dating of Historical Books

Non-destructive, fast, and accurate methods of dating are highly desirable for many heritage objects. Here, we present and critically evaluate the use of near-infrared (NIR) spectroscopic data combined with three supervised machine learning methods to predict the publication year of paper books dated between 1851 and 2000. These methods provide different accuracies; however, we demonstrate that the underlying processes refer to common spectral features. Regardless of the machine learning method used, the most informative wavelength ranges can be associated with C–H and O–H stretching first overtone, typical of the cellulose structure, and N–H stretching first overtone from amide/protein structures. We find that the expected influence of degradation on the accuracy of prediction is not meaningful. The variance-bias decomposition of the reducible error reveals some differences among the three machine learning methods. Our results show that two out of the three methods allow predictions of publication dates in the period 1851–2000 from NIR spectroscopic data with an unprecedented accuracy of up to 2 years, better than any other non-destructive method applied to a real heritage collection.


Table of Contents
reports the results of previous studies to date paper using mid-IR and NIR spectral data combined with PLS. In order to easily compare the reported results, since different date ranges were analyzed, the Normalized Root Mean Square Error of Prediction (NRMSEP) is also reported, and computed as follows: NRMSEP = RMSEP y max −y min , where RMSEP is the Root Mean Square Error of Prediction, y max and y min are, respectively, the maximum and minimum reference values in the property of interest (i.e., date). Table S1. Summary of the data reported in the literature on the use of mid-IR and NIR spectral data to date paper by PLS. The spectral preprocessing algorithms used are reported, i.e., Savitzky

S2.1 Samples and Sampling Strategy
The analyses were designed to explore the underlying process in dating models provided by SML methods using NIR spectroscopic data, as well as the possible sources of uncertainty associated with the publication dates of the books and the sampling method.
The books analyzed are from the general collection of the National and University Library of Slovenia To have a sample set representative of the period 1851-2000 a stratified sampling strategy was designed with the decade of publication as the criterion for stratification. A total of 100 books was analyzed. Table   S2 reports the number of books analyzed per decade. Table S2. Sample sizes as the number of items (books) analyzed in each stratum (decade) of publication date. Sample size  1851-1860  6  1861-1870  6  1871-1880  7  1881-1890  7  1891-1900  7 1901-1910 7 1911-1920 7 1921-1930 6 1931-1940 6 1941-1950 7 1951-1960 7 1961-1970 7 1971-1980 7 1981-1990 7 1991-2000 6 The books were randomly selected from within a decade, and Figure S1 shows the number of randomly selected samples per publication year. Metadata including bibliographic information (e.g., title, author and publication year) were recorded and are reported in Supporting Information file (Dataset_S1).

S2.2 NIR Spectroscopy
Diffuse reflectance spectra were acquired in the range of 350-2500 nm using a portable UV-VIS-NIR ASD LabSpec® 5000 spectrometer (Malvern Panalytical Ltd, UK), equipped with a built-in light source, and three separate detectors: a 512-element silicon photo-diode array detector for the spectral interval 350-1000 nm, and two TE-cooled, extended range InGaAs photo-diodes for spectral intervals 1000-1800 nm and 1800-2500 nm. The sampling interval was 1 nm, while the spectral resolutions were 3 and 10 nm in the interval 350-1000 and 1000-2500 nm, respectively. Spectra, each an average of 200 scans, were acquired using a fiber-optic probe (Malvern Panalytical Ltd, UK) with spot diameter of approximately 2 mm in close contact with the samples, using Indico™ Pro software, version 6.0.3 (Analytical Spectral Device, USA). Splice correction for the light source was used to achieve a continuous spectrum. An ASD Spectralon® reference target (Malvern Panalytical Ltd, UK) was used for baseline measurement. In each book, 10 different pages were measured: 3 pages in the front, 4 pages of the middle and 3 pages at the back of the book block. On each page, 3 different points were measured: gutter, center and outer margin of the page (see Figure S2). Thus, a total of 30 spectra were taken for each book (taking about 30 min) in an area without ink and visible signs of localized degradation (e.g., foxing). The stack of paper below the measured page was used as the background for the spectra acquired in the inner (gutter) and outer margins, while a Spectralon® reference target was used for the spectra acquired in the center of the page to avoid interferences due to the ink from the pages below. The penetration depth of NIR radiation in a matrix of organic substances is typically 1-3 mm [5] , and it has been previously estimated that information is returned from up to 4-5 layers of purified cellulose sheets (approximately 0.5 mm) [1] . Therefore, the spectra can be considered as representative of both surface and bulk properties. All raw spectra were visually inspected, no outlier was detected. Some spectra exhibit different spectral features in some books, they were not considered to be outlier as they can express variability associated with different kinds of paper in the same book block or with extent of degradation.
To see if there is any influence associated with the compositional changes and sampling method, thus pages and points on the page analyzed, each spectrum was treated as an independent observation as it is representative of a unique combination of page and point where the measurement was made.
Spectroscopic analyses were conducted in a repository room of the NUK Library, where all the books were assembled at least two weeks before the analyses to let the paper acclimatize to the well-controlled environmental conditions. Temperature (19.0 ± 0.5 °C) and relative humidity (52 ± 1%) were measured using a Hobo MX100 datalogger (Onset Computer Corporation, USA).
The raw spectra are reported in Supporting Information files (Dataset_S2-S4), including the naming convention adopted for the spectra.

S2.3 Data Analysis
The workflow shown in Figure S3 illustrates the main steps of data analysis. Figure S3. Data analysis workflow.

S6
Preprocessing steps and SML models were conducted using R (vers. 4.2.1).  [6] Data manipulation and plot prospectr 0.2.5 [7] Spectral preprocessing GA 3.2.2 [8] Variable selection -GA Boruta 7.0.0 [9] Variable selection -RF pls 2.8-1 [10] PLS ranger 0.14.1 [11] RF mlr 2.19.0 [12] kNN Table S4 reports the seven combinations we tested of two commonly used algorithms: Standard Normal Variate (SNV), [13] which performs a normalization of the spectra by subtracting each spectrum by its own mean and dividing it by its own standard deviation, and Savitzky-Golay (SG) algorithm, [14] a smoothingbased derivatization method.  Figure S4 shows the truncated reflectance spectra (1000 -1899 nm and 2001 -2300 nm) of two books as collected by the LabSpec 5000 (raw spectra), and as preprocessed by the preprocessing algorithms. As variable selection method we tested GA employing PLS, RF and kNN to compute the fitness function (see Table S5). is the population size, is the probability of crossover between pairs of individuals, is the probability of mutation in a parent individual, is the maximum number of iterations to run before the GA search is halted. To calculate the fitness value for PLS-GA and kNN-GA, the normalized values of , i.e. , was computed; while for RF-GA the normalized value of , i.e. , was used.

S2.3.1 Preprocessing
is the number of attributes tried at each split, and is the number of trees in the forest.

Abbreviation
Variable selection method and parameters settings PLS -GA Genetic algorithm (size pop = 50, p cross = 0.8, p mut = 0.1, iter max = 1000) with NRMSE CV of PLS using 100 PLS components to calculate the fitness value RF -GA Genetic algorithm (size pop = 50, p cross = 0.8, p mut = 0.1, iter max = 1000) and NRMSE OOB of RF using m try = √p and n trees = 500 to calculate the fitness value kNN -GA Genetic algorithm (size pop = 50, p cross = 0.8, p mut = 0.1, iter max = 1000) and NRMSE CV of kNN with previously tuned k values to calculate the fitness value. Euclidean distance. Kernel optimal. Boruta Variable selection using random forest using m try = p/3 and n trees = 500 Preliminary analyses (Figure S5-S7 and Tables S6-S8) were carried out using all variables (wavelengths).
As result, for computation of fitness values in GA-based selection methods, we considered: 100 PLS components and chose the error at the minimum to calculate the fitness function using PLS (Table S6 and Figure S5); the number of neighbors (k) corresponding to the minimum value of error for each spectral preprocessing algorithm (Table S7 and Figure S6) to estimate the fitness function using kNN; 500 trees were chosen to estimate the fitness function using RF. At each node of the trees, a given number of randomly selected input variables (m try ) are chosen for all the trees in the forest. In general, m try can be chosen as some function of the number of variable (p), usually √p and p/3 are the default values for classification and regression, respectively. However, performance of RF is affected very little over a wide range of m try values, except near the extremes (i.e., m try = 1 or p) [15] . Based on preliminary tests ( Figure S7), due to computational times of the GA implementation, we considered a random subsample of √p of all available predictors (selected wavelengths) to determine the best split.
S9 Figure S5. as a function of number of components for PLS. The minimum is reached before 100 components for all spectral preprocessing algorithms, as reported in Table S6.  Figure S6. as a function of k values. The minimum corresponds to values of k ranging from 2 to 4, as reported in Table S7.  Figure S7. sufficiently converge to a constant level after 400 trees for all spectral preprocessing algorithms.
The results of GA-based wavelength selection are shown in Figure S8-S10, demonstrating that a smooth convergence to a maximum fitness values of selected variables (wavelengths) is reached.

S2.3.2 Simulation Study
To investigate the trend of model performance according to sample size (i.e., number of spectra), we design a simulation study where 100 random samples for each of the ten sizes (i.e. 50, 100, 200, 300, 400, 500, 600, 700, 800, 900) are generated for the groups identified, by sampling without replacement from the original data. We stop at 900 as it is the maximum sample size of the smallest subset (i.e., "Front" and "Back"  Tables   Table S8. Summary of the results of the PLS models with different spectral preprocessing and variable selection methods in terms of 100 , corresponding SD and CI95%.    Table S15. Summary of the results in terms of 100 , corresponding SD and CI95% of the RF models built using different number of spectra grouped by the publication date of the books (1851-1900, 1901-1950 and 1951-2000). The 100 is also reported.  -1900, 1901-1950 and 1951-2000). The 100 is also reported. S20 Table S17. Summary of the results in terms of 100 , corresponding SD and CI95% of PLS, RF, and kNN built using different number of spectra without grouping. The 100 is also reported.  Table S19. Summary of the results in terms of 100 , corresponding SD and CI95% of the RF models built using different number of spectra grouped by the page (front, middle, back pages of the book block) where the measurement was made. The 100 is also reported.  Table S21. Summary of the results in terms of 100 , corresponding SD and CI95% of the PLS models built using different number of spectra grouped by the point (gutter, centre, margin of the page) where the measurement was made. The 100 is also reported.  Table S23. Summary of the results in terms of 100 , corresponding SD and CI95% of the kNN models built using different number of spectra grouped by the point (gutter, centre, margin of the page) where the measurement was made. The 100 is also reported.