1 Introduction

Recent years have witnessed a great amount of research articles dedicated to the use of Artificial Intelligence (AI) in the humanities, focusing on Natural Language Processing (NLP) and Computer Vision (CV) approaches dealing with historical texts and artifacts.

Such approaches tackled numerous problems such as historical document layout analysis and information extraction, printed and handwritten text recognition, as well as text reconstruction and restoration in historical documents.

Document Layout Analysis (DLA) is an active field of research with numerous competitions (Gao et al., 2017; Simistira et al., 2017; Clausner et al., 2019; Yepes et al., 2021) regularly evaluating new approaches. DLA has relied heavily on AI methods, with transformersFootnote 1 assuming control of the field (Huang et al., 2022). In historical document layout analysis in particular, e.g., in Xu et al. (2018), the authors relied on a Multi-Task Fully Convolutional Network (FCN) to segment highly unstructured manuscript and printed-text pages into multiple semantically relevant groups (e.g., marginalia, main text, and comments), while Ravichandra et al. (2022) opts for an object-detection based approach relying on the YOLO model (Redmon et al., 2015). Others have recognized the value of extracting images from historical documents due to their importance in transmitting the information and ideas contained in the texts, leading to approaches such as the FCN networks presented in Monnier and Aubry (2020) and the object detection-based methodologies applied to specific corpora adopted by Dutta et al. (2021); Büttner et al. (2022) from techniques like YOLO (Redmon et al., 2015), U-Net (Ronneberger et al., 2015), or Faster R-CNN (Ren et al., 2016). By getting closer to the textual content of these documents, numerous AI-based approaches for optical character recognition (OCR) and handwritten text recognition (HTR) have been proposed, with deep learning-based approaches (Jaderberg et al., 2016) setting new standards. More advanced deep learning techniques rely on Recurrent Neural Networks (RNN) (Tsochatzidis et al., 2021; Fischer, 2020; Puigcerver, 2017) and Gated-CNNS (Kang et al., 2020; de Sousa Neto et al., 2020; Bluche & Messina, 2017); most recently, transformer-based architectures have set new benchmarks (Wick et al., 2021; Li et al., 2021; Ströbel et al., 2022). Beyond OCR and HTR tasks, AI approaches are emerging as a leading method in text restoration and reconstruction, which is vital when working with often fragmentary historical data. In Assael et al. (2019), the authors focused on Greek inscriptions and proposed a sequence-to-sequence RNN model which they called Pythia, and which was later followed by a transformer-based architecture (Assael et al., 2022) called Ithaka which is able to restore, date, and attribute Greek inscriptions. Continuing with the subject of ancient languages, Latin was addressed in Bamman and Burns (2020), who proposed Latin-BERT, a pre-trained BERT model (Devlin et al., 2018) on a large corpus of Latin texts aimed at text restoration tasks. While numerous approaches for OCR/HTR and text reconstruction tackle different languages (e.g., Akkadian (Lazar et al., 2021; Fetaya et al., 2020), hieroglyphs (Barucci et al., 2021), etc.), the analysis of historical texts is heavily biased towards Ancient Greek and Latin (Sommerschield et al., 2023).

While the above only scratches the surface of AI approaches in the humanities, it offers a comprehensive overview of the different research tracks in the field which appear to be adopting, and adapting, AI approaches. However, the use of Explainable AI (XAI) for insight generation remains in its infancy despite calls for a closer integration of DH and XAI approaches (Berry, 2020; Díaz-Rodríguez & Pisoni, 2020; Huggett, 2021). The term XAI refers to the field of AI research that is dedicated to the generation of explanations with regard to the increasingly complex machine learning models (Montavon et al., 2018; Samek et al., 2019, 2021; Holzinger et al., 2022) and is crucial in numerous domains to ensure model safety, robustness, and resilience to data drift. They may also reveal useful correlations in the data as well as ensure that the model results are understood by domain experts (Samek et al., 2021; Lapuschkin et al., 2019; Kamath & Liu, 2021). This field opened the door to numerous impactful contributions across an array of knowledge areas. Such contributions have left a mark for instance on medical imaging (Holzinger et al., 2019; Binder et al., 2021; van der Velden et al., 2022). In Zhang et al. (2019), the authors proposed an explainable model proposing human-like pathological diagnostics, while Hofmann et al. (2022); Müller and Hofmann (2023) used XAI methods to highlight the most relevant brain features that contribute to “brain age” which is considered a brain health biomarker. XAI methods are also heavily used in meteorological studies where their scope is to better understand the radar images (Ebert-Uphoff & Hilburn, 2020), as well as validate and interpret models and generate insight into specific phenomena such as tornadoes (McGovern et al., 2019). In chemistry, XAI is often applied on graph structures (Schütt et al., 2019; Schnake et al., 2022; Jiménez-Luna et al., 2020), as in Preuer et al. (2019) who applied XAI methods to highlight the specific substructures of molecules relevant for novel drug discovery. The above represent a small subset of the fields where XAI has had an impact on model validation and insight generation; for a more detailed review of XAI methods and applications see Samek et al. (2021); Samek (2023).

In contrast to the extensive work on XAI in the above-mentioned fields, XAI applications in the humanities remain limited to a few isolated cases. In Pawlowicz and Downum (2021), the authors trained a classifier to distinguish between multiple pottery types of the Tusayan White Ware in use around northern Arizona between AD 825–1300. The different types of pottery feature similar designs, which prompted the authors to rely on Grad-CAMFootnote 2 Selvaraju et al. (2020) to generate interpretable explanatory outputs to investigate which areas of these images had the highest saliency in assigning a pottery type to a particular artifact (Pawlowicz & Downum, 2021). In the domain of art history, Offert (2018) highlights the benefits of using feature visualizations generated by a trained machine learning algorithm for digital art historians. Through the use of these features, the assessment of an artwork by an art historian would combine both the original artwork and its model representations, enriching the available data for the interpretative process (Offert, 2018). Similarly, Bell and Offert (2021) showcases a workflow that integrates explanations–using Grad-CAM–to help domain experts identify different painting styles under difficult conditions (e.g., similar painting styles). In this case, by training a convolutional neural network to distinguish between the different styles of painting, the authors were able to hone in on a specific region that was relevant for the classification based on the explanatory heatmapFootnote 3. Such region of the painting is where the hands are displayed (Bell & Offert, 2021). In a similar manner, XAI methods are now being used in art forgery forensics (Ji et al., 2021).

The above applications of XAI in the humanities rely mostly on region activation maps, often on the application of Grad-CAM which returns a relatively broad region heatmap describing the model decision. While such approaches open the door to general model explanations, they often miss the mark in providing detailed explanations that can be interpreted by a domain expert. We argue that a domain expert needs a fine-grained level of explanation to formulate and test hypotheses and show that a pixel-level relevance scores generated by Layer-wise Relevance PropagationFootnote 4 (LRP) (Montavon et al., 2019) could be used for this end.

At this stage, a more theoretical definition of explanation is warranted. In this paper, we refer to explanation in a teleological, post-hoc mode that has become common in the computational realm (Berry, 2021). A teleological mode of explanation aims to understand the neural network model’s behavior and provide us with clues to interpret the features and correlations within the input data that in turn lead to a certain output. This means that our explanation is directly dependent on the goal that the system (in this case the neural network) is trying to achieve and indirectly on the data used to train the model.

Fig. 1
figure 1

Diagram representing a simplified overview of our proposed workflow for a historian XAI research companion (see the Appendices A and B for a technical rundown of the workflow). The workflow starts by the collection of curated and annotated data by the domain expert (a), which is then used to train a neural network (b), in this case, a VGG-16 network. Once trained, the model is able to provide an accurate prediction (c). We rely on this trained network to generate pixel level heatmaps (d) showing which pixels contributed to the classification prediction result in (b) and (c). This heatmap (d) is then read by the domain expert in order to generate, validate, and investigate historical hypotheses based on the curated data (a)

The explanatory outputs of such an explanation system need to satisfy important criteria in order to serve as the foundation for a sound human-machine interaction, which are: “explanations should be faithful and sufficient” and “explanations should be humanly interpretable” (Samek et al., 2021).

The first criterion is satisfied when the explanation accurately describes the model behavior (Lopes et al., 2023), which can be assessed by removing highly relevant features (highlighted by the XAI model) from the input and observing whether this leads to high decay in the network predictions. A rapid decay of the network prediction score after removing highly relevant features indicates that the explanation is a faithful representation of the model’s processing (Samek et al., 2017). Satisfying the second criterion is often challenging, as a concise and clear definition of human interpretability is difficult to achieve. Human interpretations are often contrastive, selective, influenced by social contexts, and commonly do not rely on probabilities (Miller, 2019). Additionally, these interpretations might vary based on the downstream task, depending on the social sample, domain knowledge, and the output type (Subramanian et al., 1992; Huysmans et al., 2011; Miller, 2019). We pay special attention to the type of produced explanatory outputs as they are the basis of domain expert-formulated historical hypotheses. As such, explanations intended to be analyzed by domain experts need to be more detailed, and thus more complex, than those provided to laypersons who only might be comfortable with simpler explanatory output. The complexity of an interpretation can be quantified by the file size of the information carried by its heatmap. In this case, a smaller heatmap file (providing image region scale heatmaps) is likely to be easily interpretable by the layperson (Narayanan et al., 2018), while a complex explanatory output (providing pixel scale heatmaps) is likely more adequate for a domain expert. A comparison of different explanatory output complexity is presented in Samek et al. (2021), where LRP (Monnier & Aubry, 2020) returns faithful, and complex explanatory outputs on a pixel level, highly suitable for a domain expert.

To drive our point, we present a novel approach to historical image analysis that harnesses the explanatory outputs provided by our XAI model (see Section 3), and thus enables correlations to be revealed within a curated dataset of early modern printed illustrations (see Fig. 1). Such correlations enable domain experts, in this case historians, to better understand and analyze the content of the dataset at a pixel level. By looking at the insights generated by XAI method, we instrumentally choose a typical question in the frame of material history of science and technology of the early modern period, namely to provide a definition of early modern mathematical instruments. Such a research question, which is developed along the lines of three case studies, is instrumentally used to display the effectiveness of our approach. In this process, we treat the AI model as a helping hand or research companion–in line with similar approaches in medical AI (Klauschen et al., 2018; Ratti, 2022) and cyber analysis (Holder & Wang, 2021)–that provides suggestions and reveals interesting data correlations that might have been overlooked by the domain expert.

This approach breaks with a general trend in the humanities to “simply” use AI to classify, and generally speaking, to automatically assign labels to humanities dataset elements, and fill a much needed research gap by elevating the human-machine interaction from one that is mainly operationally driven–such as the use of the machine as a supporting tool for domain experts–to one in which machine and human-expert interact via a common visual interpretation channel to produce interpretative historical results.

2 Dataset

The robust application of machine learning approaches typically requires a large amount of labeled data to ensure that our explanations are trustworthy, as well as generalizable. To meet this criterion, we rely on the Sphaera dataset, created in the frame of the project “The Sphere: Knowledge System Evolution and the Shared Identity of Europe” (https://sphaera.mpiwg-berlin.mpg.de), which contains data and metadata on over 350 early modern textbooks based on the Tractatus de Sphaera by Johannes de Sacrobosco (–1256). Electronic copies of these books are available via the project’s database, comprising over 70,000 pages, 23,000 of which contain visual elements. These visual elements were collected both manually and with the help of neural networks (Büttner et al., 2022). The Sphaera dataset is stored in a large knowledge graph modeled according to the CIDOC-CRM standards (Bekiari et al., 2021), where information about the editions, as well as fine-grained information about their content is stored (Kräutli & Valleriani, 2018; El-Hajj et al., 2022).

For the purpose of this paper, and to accomplish our aim to train a neural network capable of correctly classifying pages containing illustrations displaying a mathematical instrument, we initially collected a total of 2,879 pages containing such illustrations, whose labels were carefully studied as part of a PhD dissertation within the Sphere project by Shlomi (2023). In addition to the pages containing illustrations of mathematical instruments, we collected 3,000 pages that do not contain any illustrations and which serve as a contrast signal to guide the model to learn class-specific instrument features. With such a binary dataset, our neural network, which is based on VGG-16 (Simonyan & Zisserman, 2015) and further described in Section 3.1, could successfully distinguish pages containing illustrations of a mathematical instrument from those with no illustrations. We applied the same method in Section 3 on this model, however, the generated explanations we received were not constructive in helping us understand the underlying features that represent a mathematical instrument because the model was simply learning the feature associated with the presence of an illustration vs. non-illustration. To gain further insights into what really defines a mathematical instrument, we created a richer dataset that encouraged our model to learn more discriminative features within these illustrations, which consequently led to more specific insights.

To accomplish this, we added two additional classes to our dataset. The first includes images of pages containing scientific illustrations that do not directly denote any material object, such as an instrument, but rather have a descriptive or explanatory function in reference to the subject matter, in this case astronomical phenomena. These scientific illustrations, similarly to those representing mathematical instruments, were recovered from the Sphaera dataset. The second added class refers to illustrations of material objects, namely those which one can refer to as machines. This data was collected from Branca (1629); Zonca (1607); Ramelli (1588); Besson (1595) using CorDeep (https://cordeep.mpiwg-berlin.mpg.de), a web service designed to extract and classify visual elements from historical documents (Büttner et al., 2022). In total, the dataset contains 5,879 pages distributed across the four classes, as shown in Table 1; each class is represented by a single sample in Fig. 2.

Table 1 Distribution of image samples used across the four different classes

3 Methods

In this section, we introduce the methods used for data processing and model training. Further assuming a successfully trained model, we describe how the layer-wise relevance propagation (LRP) (Bach et al., 2015; Montavon et al., 2019) approach is used to extract explanations for each data point. A detailed explanation of the LRP rules is provided in Appendix A.

Fig. 2
figure 2

For each of the four classes - mathematical instruments, scientific illustrations, machines, and other - one example has been chosen to describe the general features of each respective class. Figure a) displays a typical early modern machine, in this case a structure that holds a gear-wheel, activated by a vertical gear-drum that turns the big wheel on the left. Figure b) is a scientific illustration that demonstrates the sphericity of the planet Earth by showing the reader its empirical proof, namely the fact that two observers on a ship–one at the top of the mast and one below on the gangway of the hull–would discover the castle, toward which they are navigating at different times, the one on the mast earlier than the other. Figure c) displays a typical page with no illustration that still contains other graphical layout features of an early modern textbook. Figure d) displays a common mathematical instrument, namely an armillary sphere which is a mechanical miniaturized reproduction of the geocentric cosmos. Figure a) from Branca (1629, p. 30), courtesy of the Library of the Max Planck Institute for the History of Science, Berlin. Figure b) from Sacrobosco (1547, Sign. B-2), Bayerische Staatsbibliothek, urn:nbn:de:bvb:12-bsb10173470-0. Figure c) from Piccolomini (1553, p. 17), courtesy of the Library of the Max Planck Institute for the History of Science, Berlin. Figure d) from Barozzi (1607, p. 104), Biblioteca Digital Hispánica, PID bdh0000001287

3.1 Neural network training

Given the comparably limited number of annotated training data available, as well as the large hetereogenity of the historical pages, we use the pretrained VGG-16 (Simonyan & Zisserman, 2015) convolutional neural network architecture to extract feature representations. This encoder consists of five convolutional blocks, each followed by a max-pooling layer with kernel size \(2\times 2\). Convolutional layers use \(3\times 3\) filter kernels sizes and ReLU activation functions. The resulting representations from the pretrained VGG-16 model are then used for the classification block, which consists of fully connected convolutional layers and a final softmax layer that predicts class probability for each of the four classes.

To address the hetereogenity of the input data, we standardize the pages using min-max normalization, apply thresholding using the 10% and 90% quantiles of the pixel value distribution and scale each image in proportion to a reference height or width of 800 pixels depending on its orientation using bilinear interpolation. These steps ensure that sufficiently high image quality is maintained while reducing variation in scan resolution, background texture, as well as colorization, brightness, and contrast (see Appendix B for details).

During optimization, we use 80% of the data for training the classification head parameters using the Adam optimizer (Kingma & Ba, 2015) and an initial learning rate of 0.001, which decays every seven epochs with gamma set to 0.1 and a batch size of 1 to allow for different page sizes and orientations. The resulting test set accuracy is 0.96 with class-wise F1 scores ranging from 0.89 for instruments to 1.0 for machines.

3.2 Layer-wise relevance propagation

We apply LRP (Bach et al., 2015) to attribute the predictions of our trained neural network to the input features (i.e., pixels). More specifically, by feeding a given input image to the network and denoting by y the resulting value of the output neuron for a given class, say instrument, LRP generates a collection of scores \(R_1,R_2,\dots ,R_{\#\,\text {pixels}}\) identifying the contribution of each pixel to the output value y. This collection of scores can be represented as a heatmap in which the blue color indicates pixels that contribute negatively to the given class (i.e. are in contradiction with it), and red color indicates pixels that contribute positively (i.e. support it). Because there are four classes in our dataset (other, mathematical instruments, machines, scientific illustrations), we generate four heatmaps for each image indicating pixels that support/contradict the respective class (see Fig. 4).

Technically, LRP pixel-wise scores are computed in an iterative fashion aiming to “invert” the nonlinear function implemented by the deep neural network. LRP starts at the output of the network with the predicted value y. The value y is then redistributed backwards in the network, layer after layer, by means of propagation rules (see Appendix A). LRP propagation rules are designed so that (1) neurons that are locally relevant (contribute strongly to the next layer) must receive more relevance than locally irrelevant neurons, and (2) quantities being redistributed must be conserved locally in the network, similar to water flowing through a network of pipes or a current traversing an electrical circuit. A variety of LRP rules implementing these two requirements have been proposed and they address different types of neurons and or levels of nonlinearity at each layer (Montavon et al., 2019). In practice, these rules are selected in a way that the resulting explanation faithfully reflects the true decision strategy of the neural network model and remains at the same time easily readable for the human-expert. In our experiments, we apply at each layer the same LRP rules as in Eberle et al. (2022).

4 Three case studies and the quest for early modern mathematical instruments

By applying the method from Section 3 on the dataset described in Section 2, we obtain a treasure trove of information about the image pixels with the highest contributions to the learning task. In the following section, the model explanations are discussed in reference to the predefined classes in Section 2 (excluding the category “other”), and therefore subdividing the argument into three case studies. Due to the complexity of the LRP explanations, interpretations are usually formulated by domain experts, who in our case are historians of early modern science.

4.1 The historical research question

On an abstract level, mathematical instruments of the early modern period are defined as measuring, computational, or demonstrating objects that embody mathematical knowledge. A historical overview can be found in Bennett (1987) and Bennett (2011). In the framework of the Sphaera corpus, mathematical instruments are those ordinarily used to measure time, as well as terrestrial or celestial angular distances. However, this abstract definition does not help us generate a concrete definition of a mathematical object that aims to answer how these objects were built and operated in the early modern period. The vagueness of the abstract definition becomes very evident when acknowledging that all of the illustrations in Fig. 3 show devices belonging to the same broad class of mathematical instruments.

Fig. 3
figure 3

Paradigmatic selection of early modern mathematical instruments. a) armillary sphere, b) globe, c) Jacob’s Staff, d) universal sundial, e) universal meridian, f) quadrant. a) from Sacrobosco and Clavius (1585, Title page), courtesy of the Library of the Max Planck Institute for the History of Science, Berlin. b) from da Firenze (1537, Sign. H-II-7), Österreichische Nationalbibliothek, http://data.onb.ac.at/rep/10B4373E. c) from Schreckenfuchs et al. (1569, p. 285), Bayerische Staatsbibliothek, urn:nbn:de:bvb:12-bsb10141204-0. d) from Finé (1551, p. 18), courtesy of the Library of the Max Planck Institute for the History of Science, Berlin. e) from Cortés (1556, fol. XLVII v), Biblioteca Digital Hispánica, PID bdh0000254979. f) from Finé (1587, p. 32), courtesy of the Library of the Max Planck Institute for the History of Science, Berlin

Some of the instruments in Fig. 3, like the armillary sphere (Fig. 3a) and the globe (Fig. 3b), are mechanical reproductions of objects or layouts of objects that existed (or were believed to exist) in reality. The armillary sphere represents the geocentric cosmos, while the globe represents the Earth with additional features. In the first case, the armillary sphere includes a series of scales on its movable elements, thus allowing its user to determine the length of the solar day at any latitude throughout the year. Armillary spheres can also be more complex, for instance, when a planetary model is built-in so as to allow the determination of the position of all main celestial bodies at any given day and time. Armillary spheres are multipurpose mathematical instruments and, in addition, they also have a high pedagogical value as they intend to mechanically represent and visualize the constitution of the cosmos. The globe has very similar features. It is usually equipped with scales (though this example does not show them); it has movable parts; and, in this case, it clearly displays a geographic coordinate system, which at that time was considered the projection of the cosmos coordinate system onto the planet (cosmography). This last feature finally enabled the calculation of terrestrial distances. Adding the distribution of landmass and water surfaces allowed the globe to become a representation of Earth, and because of this, this instrument also had high pedagogical value. Figure 3b shows a relatively poor example of such representation, as the continents are simply shown by placing their names at the appropriate locations.

The Jacob’s staff is a distinct instrument (Fig. 3c). It does not represent anything found in nature, but is rather a purely functional device. Its function is to measure terrestrial distances from the point at which the observer (instrument’s user) is positioned. It works on the basis of simple triangulation techniques where the two pivots at its base are movable in order to change the length of the triangle’s base. The user would look through the center of that base and direct the staff toward the point whose distance from the observer is then measured. No scales and numerical values are added to this instrument. The length of the base between the movable pivots could be measured after observation, for instance by means of a ruler. It is hard to state what the qualitative difference is between the Jacob’s staff on one hand and the globe and the armillary sphere on the other, but both groups of objects are considered mathematical instruments.

When it comes to time measurement, the universal sundial (Fig. 3d) and the universal meridian (Fig. 3e), which is also a sundial, are considered to be purely time measurement instruments. The diffusion of the mechanical clock and the corresponding division of the day into 24 equal hours began in the thirteenth century. However, the traditional division of the day according to the variable length of solar day and night, and therefore to the variable length of hours during the year as depending on the latitude, was still highly relevant in the early modern period for multiple reasons. The increased mobility during the early modern period increased the demand for sundials that could be operated at different latitudes. While the second case (Fig. 3e) is visually closer to our current understanding of a clock since the values are displayed in circular form, the first case (Fig. 3d) has a different purpose. In fact, this illustration does not really represent an instrument, but rather a display cabinet for different types of sundials.

Finally, the quadrant shown in Fig. 3f is common, especially within the Sphaera corpus, due to its function in the frame of astronomic observations. This type of instrument was mostly used to measure the angular height of celestial bodies over the horizon by placing it close to one’s eye and pointing it towards the celestial body. The plumb line would then show the angle in degrees or hours and therefore the angular distance between the celestial body and the horizon. It shows scales with values and therefore is a kind of mathematical instrument that, qualitatively, belongs to the group of measuring instruments to which also the universal meridian belongs, and to a lesser extent, the globe and the armillary sphere (Fig. 3e).

It is clear from these short descriptions that a global and unique definition of mathematical instruments able to describe the different functions that such instruments perform and embody is difficult to establish. This situation in turn highlights the need to take a step back and think of an overarching definition, if only within the field of early modern astronomy.

In order to reach this new definition of mathematical instrument, we propose a novel approach that combines a) the insights revealed by using XAI methods applied to an AI model trained with these illustrations and b) the knowledge of expert historians.

Fig. 4
figure 4

Heatmaps of Fig. 3f detecting the relevant elements of a quadrant (red pixels are those that contribute positively to the class in question; blue pixels are those that contribute negatively to the class in question) and classifying it as a mathematical instrument

4.2 Case study 1: mathematical instruments

Looking at the correctly classified instances of mathematical instruments (95%), we are immediately drawn by the LRP results to the finely graduated elements (scales and their values) of the mathematical instruments whose pixels appear to be most relevant for the correct classification. These finely graduated elements appear to be key elements in guiding our model to identify this class: they dominate the majority of the classified image results and are invariant to the numerous shapes, designs, and representations of these instruments within the pages of the different editions of textbooks. Examples of this finding are numerous. The quadrants displayed above (Fig. 3f) and the universal meridian (Fig. 3e), for instance, show this feature very clearly (Figs. 4 and 5).

In the cases where our model fails to recognize a mathematical instrument or falsely classifies something else as a mathematical instrument, the LRP result reveals insights into the causes of this misclassification. For example, the false positive in Fig. 6 shows the classification of the image as a mathematical instrument, while in fact this is a scientific illustration. This particular scientific illustration is designed to explain that the earth has a round shape, which is shown by the fact that sunrise is earlier in eastern locations on earth than in western locations. While this scientific illustration reproduces natural phenomena, it is also enriched by a densely graduated ring along the orbit of the sun in order to show the hourly divisions. In this case, it is clear that the model focused on the finely graduated orbital rings using this as the basis for its classification of this illustration as a mathematical instrument.

Fig. 5
figure 5

Heatmap of Fig. 3e detecting the relevant elements of a universal meridian (red pixels are those that contribute positively to the class in question; blue pixels are those that contribute negatively to the class in question) and classifying it as a mathematical instrument

Fig. 6
figure 6

Scientific illustration showing the dependence between the position of the observer on the spherical Earth and the times of sunset and sunrise. The illustration is misclassified as an instrument and the heatmap shows that this is due to the presence of a scale around the illustration which displays the hours (red pixels are those that contribute positively to the class in question; blue pixels are those that contribute negatively to the class in question)

Considering the already discussed Jakob’s staff in Fig. 3c, it becomes clear why this instrument is also misclassified as a machine. The heatmap shows that pixels representing mechanical elements (e.g. the pivots) are highly relevant for the classification results (Fig. 7) and, by comparison, it is possible to infer that this misclassification is due to the absence of detectable scales and/or numerical values, i.e. the omission of the most relevant pixels for the mathematical instrument class. As mentioned above, it is known that a graduated ruler is needed to measure the distance between the two pivots at the base of the instrument. Had this ruler been present in this illustration, perhaps the classification result would have been correct (i.e., as part of the mathematical instrument class). The lack of graduated element has led to a false negative for this mathematical instrument. This result emphasizes that within the Sphaera corpus, the representation of an instrument, at least from the perspective of the model, is highly related to it having a finely graduated element.

Fig. 7
figure 7

Heatmap of Fig. 3c in which the Jacob’s staff is misclassified as a machine. The heatmap highlights the mechanical feature of the instrument related to its movable parts and, by comparing it with other cases, it is possible to infer that the misclassification is also due to the lack of a scale with values (red pixels are those that contribute positively to the class in question; blue pixels are those that contribute negatively to the class in question)

In addition to the graduated scales, some elements of the object’s materiality and morphology, such as the bases on which these instruments stand, played an important role in guiding the model toward a correct classification result as shown in Fig. 5. The displayed materiality is not a feature that concerns solely mathematical instruments and its meaning is best understood after considering the machine class in the following section.

4.3 Case study 2: machines

The Jacob’s staff case demonstrated that the absence of the most relevant features, in this case, graduated scales, can quickly lead to a misclassification. This is one of the main reasons why the training set was enriched to include objects (machines) that are functionally and semantically distinct from mathematical instruments and clearly represent a distinct category.

The classic definition of an early modern machine is that of an object that enables the accomplishment of a task by means of the efficient performance of a mechanical device and thus lowering the need of human resources and/or reducing the time required for its performance. This definition is too general, and does not allow us to hone in on what constitutes a machine in our training set.

Our trained model was able to perfectly classify the machines in our dataset, and by looking at the LRP explanatory outputs, we were able to deduce that the most influential elements driving the correct classification of a machine-class within our dataset are the pixels representing aspects of the mechanical apparatus on the one hand, and the presence of a structured environment on the other. If a machine is analyzed in a representation that depicts it in its natural or semi-natural context, then the mechanical apparatus is activated. Figure 8a shows a machine to raise water from wells, operated by a man who interacts with a series of mechanical cranks and gear-wheels, which in turn drive the hydraulic apparatus with its dual pump. The LRP heatmaps (Figs. 8b and c) show that the pixels representing the underground section of this machine illustration are the most relevant for our model to generate a correct classification; more specifically, these pixels are those that represent the hydraulic apparatus.

Fig. 8
figure 8

a) Machine to raise water from wells. From Ramelli (1588, p. 9v), courtesy of the Library of the Max Planck Institute for the History of Science, Berlin; b) and c) heatmaps that highlight the hydraulic apparatus of the machine as the reason for its correct classification (red pixels are those that contribute positively to the class in question; blue pixels are those that contribute negatively to the class in question)

Fig. 9
figure 9

a) Machine to strike gold medals. From Branca (1629, p. 2), courtesy of the Library of the Max Planck Institute for the History of Science, Berlin; b) and c) heatmaps of a) that highlight the scaffolding and the poles that hold the mechanical elements together, rather than the mechanical elements themselves (red pixels are those that contribute positively to the class in question; blue pixels are those that contribute negatively to the class in question)

Further investigation, however, shows that the proper mechanical apparatus (pulleys, gear wheels, handles, etc.) is less determinant than one would expect. For example, a machine for striking gold medals is examined in Fig. 9a. This machine is activated by a pneumatic device (labeled G and M on the right side of Fig. 9a) that channels hot air and fumes from a fire to activate a mechanical contrivance (K at the top right). This in turn moves three further mechanical elements, each comprising a gear-drum and a gear-wheel. This drivetrain finally activates a press made of two drums on whose surface the forms for the gold (in shape of medals) are engraved (A, E, D, and the operator V). This machine produces medals in series and is more efficient than the traditional method also represented here by the blacksmith (T at the bottom-center). Unlike Fig. 8, the most relevant pixels in Figs. 9b and c do not strictly refer to the mechanical elements that play a direct role in the transmission of forces, but rather to the scaffolding and poles that hold these mechanical elements together. This feature, which is often observed in the heatmaps, is difficult to interpret but a plausible reason is that most of the mechanical elements are circular, namely they have a common characteristic with many further instances of other classes, be it instruments or scientific illustrations, as it will be discussed in detail in the next case study.

There are also misclassifications with respect to the early modern machine class such as the case presented in Fig. 10a. This is a lathe equipped with a semicircular blade which is rotated by means of a handle. Material (wood or stone) roughly spherical in shape is placed inside this machine, which then refines the object’s shape (by cutting the excess) to produce an almost perfect spherical shape. The example of the lathe machine is inserted in the historical sources analyzed here because it was used to furnish an operational, almost material definition of what a sphere is: the fundamental geometric concept to understand the cosmos and the working of the machina mundi. For this reason, this machine is printed numerous times in the Sphaera corpus. The multiple instances of this machine appearing in our dataset were consistently classified as mathematical instruments. This misclassification is due to numerous reasons and can be explained as follows. While this machine shows strong “materiality” aspects, highlighted by the realistic footstands, it does not present any cogs, wheels, levers, or poles, which are typical of the machine class like in Fig. 9a. This leads the model to incorrectly classify this machine as a mathematical instrument. However, looking at Section 4.2, we can see that the main characteristic of mathematical instruments appears to be graduated elements, which this particular image lacks. We can conclude here that while graduated and mechanical elements played a major role in the classification of both mathematical instruments and machines respectively, the materiality of the represented object plays a secondary yet important role in defining these two abovementioned classes. As will be shown in Section 5, the issue at stake here is the concept of “materiality” as it needs further analytical qualifications.

Fig. 10
figure 10

a) Lathe machine. From Sacrobosco and Melanchthon (1545, Sign. B-ii-1), courtesy of the Libraries of the University of Oklahoma; b) and c) heatmaps of a) that misclassify the lathe as an instrument due to the presence of a footstand, which is also typical for instruments, and, upon further comparison, also because of the absence of an evident mechanical apparatus (red pixels are those that contribute positively to the class in question; blue pixels are those that contribute negatively to the class in question)

4.4 Case study 3: scientific illustrations

The books used in this study feature many illustrations, but only a fraction of these represent a mathematical instrument or a machine; the other images vary between diagrammatic representations of mathematical concepts and schematic representations of the movements and geometrical constellations of celestial bodies, such as the orbit of the moon or the positions of the planets during a solar eclipse. Together, these images constitute the category referred to here as scientific illustrations. However, due to the heterogeneity of these illustrations, the LRP results rarely offer any insight into what constitutes a scientific illustration. By looking at the LRP results of scientific illustrations with respect to other classes, it is possible to gain insights into what does not constitute the illustration of a mathematical instrument or a machine.

In the example shown in Fig. 11a, the model correctly predicted that this is a scientific illustration. By looking at its prediction with respect to the machine class shown in Figs. 11b and c, it is possible to infer that the presence of multiple circles negatively contributes to the classification of a machine. This observation is validated by multiple other scientific illustrations and might be helpful in understanding why the often circular mechanical elements of the drivetrains of machines are not particularly relevant for the classification of machines and, above all, why graduated scales are relevant for the classification of instruments. The reason probably lies in the common feature of “circularity” of elements found in these images.

Fig. 11
figure 11

a) Scientific illustration representing, from outside to inside, the empyrean, the spheres of the prime mobile, the firmament, the seven planets, the elements and the Earth at the center. From Sacrobosco and Glogów (1513, Sign. A-iiii-7), Regensburg, Staatliche Bibliothek, urn:nbn:de:bvb:12-bsb11110894-9; b) and c) heatmaps of a) that allow us to infer why circular shapes cannot be relevant for the correct classification of machines (red pixels are those that contribute positively to the class in question; blue pixels are those that contribute negatively to the class in question)

Finally, there are many misclassifications also among the scientific illustrations. A typical example is the illustration frequently used to demonstrate the sphericity of the Earth (Fig. 2b). This, as many other cases, shows that, when the illustration is particularly rich and moves toward artistic expression, the model misclassifies them as other pages, namely pages that are not supposed to contain visual elements (Fig. 12). The example shows that the lack of linearity in the drawing is associated with other features that are typical of a page with no visual elements, an issue for which there is currently no feasible explanation.

Fig. 12
figure 12

Heatmap showing how particularly rich scientific illustrations, as in the case of Fig. 2b, are misclassified as other page (red pixels are those that contribute positively to the class in question; blue pixels are those that contribute negatively to the class in question)

5 Discussion

The take-home definitions so far regarding illustrations within the studied corpus read as follows: a mathematical instrument is an object with a graduated scale; a machine is an object with frames and scaffolding that hold together mechanical elements; scientific illustrations are characterized by circular elements. These XAI assisted definitions are heavily influenced by the subject of the collection of historical sources analyzed here, namely university textbooks used to teach geocentric astronomy.

A number of cases shown in Figs. 7, 10 and 12, demonstrate that the results achieved are not entirely satisfactory. To express this positively, our interaction with the explanatory output of the XAI model enabled us to identify a different interpretative layer or, more precisely, it revealed that an additional interpretative layer which was initially ignored, is more relevant than initially expected, as described in the following.

The goal of the presented historical research was to provide a definition of early modern mathematical instruments. These illustrations within our corpus often represented real objects; however, only a small fraction of these objects survives today. Museum collections contain some of these objects, but their number is limited, no complete dataset has been created, and, most importantly, even if such a dataset existed, it would not cover the myriad of instruments that populated the early modern period. Faced with this situation, historians of science and technology rely, as in the case of this work, on representations of such objects. These are abundant in the numerous preserved historical textual sources. The massive digitization efforts of the last twenty years, moreover, have made them largely and easily accessible.

Historians of science and technology interested in research questions similar to the ones proposed in this paper (see Section 4.1) often encounter the same issue of ambiguous definitions. What are the defining features of a specific class of images within a corpus? Are general class definitions a sufficient criteria to distinguish between class instances? What are the most important features that describe a class?

As shown in Section 4, the explanatory output of our XAI model can act as an intermediate layer, distorting our classical perception of the images in question, and guiding us toward specific pixel groups within images to help redefine or reshape our definitions of the selected classes. Of course, explanatory output ambiguity is present in this case too, as demonstrated in Fig. 10, highlighting the need for a continuous human-machine interaction to reach the desired objective. In other words, the heatmaps show what can be considered as the machine’s “definition.” Thus, the heatmaps enable us to compare the human definitions with those of the machine, and to accordingly reconsider and potentially define them.

In fact, the definitions we–and other researchers–often try to seek are not those of the machines and instruments themselves, but those of their representations: the conceptual ideas in the minds of the actors of the early modern period who designed, drew, and cut the woodblocks used for their printing. In other terms, what we are investigating here is the result of a reality abstraction exercise by the authors, publishers, printers, and woodcutters, which eventually resulted in these preserved illustrations.

This process offered a degree of freedom but also a restraining condition that the material object itself cannot offer and does not provide, respectively. First of all, the represented object, such as a machine or an instrument did not have to exist. It could be the illustrated design of a new instrument, namely just a mental exercise. From this perspective, the analysis looks at what elements made a printed illustration a mathematical instrument illustration. Second, even in the cases where the represented object existed, there may no longer be a physical counterpart. In other words, in many cases, there is no extant artifact that enables us to compare the real physical instrument with its representation in the Sphaera dataset, meaning that abstractions are often not verifiable. This condition, combined with the high variability of instrument types and the consequent low number of images per specimen, makes working with this kind of datasets quite challenging. The degree of abstraction used in the illustrations of the studied corpus often required using a specific visual language, or visual conventions, to represent the objects in question. In this case, the use of the XAI explanatory output allowed us to better investigate these abstractions and grasp some of the visual conventions, which revealed interesting historical insights about the thought process of the early modern actors.

In this respect, the analysis highlights the most relevant image features as viewed by our trained model while considering not only the technical knowledge of the early modern actors in representing machines, mathematical instruments and scientific illustrations, as well as diagrammatic and decorative illustrations, but also how they imagined them. We refer to this feature as “visual conventions.”

The first discoverable convention is related to materiality. As shown by the heatmap of the Universal meridian (Fig. 5c), the foot-stand of the instrument is an important element for this classification. This aspect is not only demonstrated by many similar heatmaps but also by the misclassification of the lathe as an instrument (Fig. 10). The feature of materiality in these cases conveys the message that these instruments either existed or could have existed.

The concept of “materiality” as used until now needs to be further distinguished into a closely related category, namely a second relevant visual convention, here called environment. This term refers to elements inserted in the image that show the surrounding context in which an instrument or machine could be found. Instruments such as the globe (Fig. 3b) or the cabinet (Fig. 3d) display a naturalistic environment. Heatmaps show an activation of the environment especially in reference to the representations of machines. These are often represented within a rich environment, which often includes individuals operating these machines, as well as animals (e.g. oxen) and surrounding architecture.

A close look at the heatmap of the hydraulic machine (Fig. 8), for instance, shows that some tiles on the left play a relevant role. This feature is particularly evident in most of the machine representations, as shown also in the mill displayed in Fig. 13. The heatmap shows that, besides those machine, components that display the actual mechanical elements, stairs, windows, and, partially, even human beings act as determinants for its correct classification (Fig. 13).

Fig. 13
figure 13

a) Mechanical mill represented in its architectural context. From Besson (1595, Sign. H-iii-1), courtesy of the Library of the Max Planck Institute for the History of Science, Berlin; b) and c): heatmaps of a) showing that besides those machine components that display the actual mechanical elements, stairs, windows, and, partially, even human beings act as determinants for its classification as a machine (red pixels are those that contribute positively to the class in question; blue pixels are those that contribute negatively to the class in question)

The numerous cases displaying this feature show that the model is guided toward a machine class prediction based on the presence of regularly spaced lines, often denoting floor or roof tiles or blocks of stone in wall construction, among others. This aspect reveals important insights into the thought process of the historical actors involved, who clearly chose to represent machines, not only in a rich environment, but also in one where the reader of these books could estimate the size of the machine by easily comparing it to familiar objects such as tiles or stone blocks.

The final convention, which in this case is not highlighted by the activation displayed in the heatmaps, could be referred to as proportion. Many images, especially of the scientific illustrations class, and to some extent those of the instruments and machines, such as the globe (Fig. 3b) and the lathe (Fig. 10a), show a lack of proportionality among the represented elements. For instance, the lack of proportion between the globe and the tree at the bottom of the image or the lathe and soldiers below it. Such over-proportionality of elements is probably due to pedagogical intentions of the historical actors.

Misclassifications and the examination of their LRP heatmaps can shed further light on visual language and its evolution. For example, Fig. 6 shows a scientific illustration that was misclassified as an instrument because it included a scale. The visual structure of this image, without the scale, is typical for explaining the sphericity of Earth with different sunrise and sunset times at multiple locations. The “addition” of a scale to the typical illustration shows a conscious employment of the visual convention of mathematical instruments in a scientific illustration. This addition makes the illustration look more like an instrument, and thus relates the represented content with the practice of exact measurements. It can be understood as a visual statement reflecting larger processes of mathematization of scientific and cosmological knowledge as well as changes in the practical and pedagogical orientation of the content (Oosterhoff, 2018).

5.1 Definitions, visual language, and training set

Taking into account the additional layer of visual language conventions of the historical actors encourages a reconsideration of the initial definitions. The examination of the LRP heatmaps showed that the full image was taken into consideration by the model and the results were based on different, distinct, visual aspects within the image. We were thus faced with the need to strongly consider the similarity between instruments and their contemporary illustrations. We understood that we could benefit from art-historical knowledge about visual language and in turn contribute to that field by revealing and defining visual conventions through large corpora.

The examination of the heatmaps points to elements that identify the different kinds of illustrations. The illustration of a mathematical instrument can be considered to be an object with a graduated scale that is represented by elements indicating its materiality and denoting that they exist or could exist. A machine illustration depicts an object with a structure that holds together mechanical elements and is represented within a realistic environment to convey additional information, such as its size or how it works. Unlike scientific illustrations, the representations of machines do not contain simple symmetric or concentric shapes, which implies that machines were portrayed as complex systems, without resorting to abstraction and simplification.

If we consider the results of this project to be the “discovery” of visual language conventions, and not the definition of what a mathematical instrument is, we can use LRP heatmaps to study both the evolution of visual language itself as well as the history of mathematical instruments. Visual conventions are structures or symbols that are widely used, and can thus be better “exposed” with the use of large corpora rather than a more traditional and manual examination of single sources. Although they do not necessarily represent objects that in fact existed, such conventions may reflect common knowledge used in printed books of the period and is thus meaningful in the study of the knowledge tradition.

Beyond the study of visual language for itself, the analysis of the XAI explanatory output for the study of mathematical instruments highlighted that the analysis of historical sources requires the consideration of all intrepretational layers or, in other terms, that both the form and substance of the images need to be considered at the same time for this kind of research. Considering the visual language can be a tool for better defining and studying mathematical instruments through the database.

The initial training set conceived for this investigation was not designed to cover all aspects the XAI analysis can highlight. Our research has shown that the introduction of XAI in the frame of historical research requires the generation of a transparent workflow dictating the principles to create training sets able to cover all interpretational layers and all their specific aspects and potentialities. In this sense, working with the explanatory output of an AI model can require multiple iterations, refining the data selection criteria and research questions at every step.

Attempts have been made to discuss the different aspects of “data modeling” in the field of computational history. This term can be used in a very broad and general sense to address ways in which historical information is organized in order to apply systematic, computational, or quantitative methods. Some of these attempts emphasize a semiotic approach that highlights the variety of types of similarities between objects (historical source or idea) and their representations (either representations of single objects or of relations between historical objects or ideas) (Kralemann & Lattmann, 2023; Ciula & Eide, 2016; Flanders & Jannidis, 2023). In the context of this research, we have seen that it is crucial to consider and strictly define whether the object falls under the definition of mathematical instruments or of scientific illustrations and the visual conventions they include. Furthermore, we had to also consider the nature of the relationship and similarity between the two. Thus, this research demonstrates the importance of a semiotic discussion in the process of building a training set, database, and research questions based on them. In other terms, the construction of the dataset itself should have included a discussion of the semiotic relation between the instrument and image and, therefore, an interpretation of the statistical results of such classification of images must include a consideration of visual language factors that inherently influence the activation of statistical tools.

6 Conclusion

We applied a XAI approach, more precisely an LRP approach to our classification model in order to explain how and why it classifies the illustrations of our corpus. We intentionally trained our model based on a specific, curated, dataset with carefully chosen classes in order to help us gain insights concerning our initial research question, namely what is an early modern mathematical instrument and, as our research shows, what differentiates it from an early modern machine and an early modern abstract scientific illustration. While we were able to achieve interpretable and useful results, we were also faced with unexpected outputs that obliged us to reevaluate our class definitions. As we are using illustrations of instruments and machines, XAI has shown us that we need to consider not only the proper content of those illustrations but also the visual conventions used by the historical actors to produce them. We have learned, through the interaction between the domain-experts’ analysis and the explanatory model, which visual conventions were actually used. To put it emphatically, the XAI companion, our new team member, assumed the role of an art historian.

Many features of this research are characteristic of approaches that are common in the field of digital humanities, or more specifically digital history. The contribution is clearly the result of a group effort that relies on combining historical and machine learning expertise. It provides a case study in an open and exploratory manner, describing the research undertaken in terms of a journey (including steps that did not yield the intended results), rather than presenting definitive findings. Moreover, the paper is informed by the idea of advancing a relatively new computational method and contributing to an evolving field. In terms of the historical research, it constitutes a proposal for a novel approach for analyzing historical sources, in particular scientific images.

Less characteristic, perhaps, is the fact that the focus is neither on the results of the applied computational method (images classified in a certain way), nor on a discussion of the applicability of the method, as is often the case in the digital humanities and digital history respectively. Instead, XAI–the method for analyzing the corpus’ images–is used to create an interactive dialogue: between algorithmic classification and the historians involved. It is this dialogue that finally provides the basis for a nuanced discussion of the different types of scientific images contained in the corpus.

As shown, XAI makes transparent what specific features the ML model based its classification decisions on; the black box is thus, at least partially, opened. On the one hand, this means that the classification results are not taken at face value (in order to move on to the next step of the process). Rather, their very purpose is to be examined, reviewed, and questioned. Which part of the image led the model to choose the classification “instrument” and not “machine”? From the very beginning, the researchers’ attitudes toward the model’s decisions are therefore interrogative, perhaps even critical. On the other hand, the close observation made possible by applying XAI may provide further insights into the historical material. The larger question behind the endeavor is thus what historians can learn from engaging with the logic behind the model’s decisions. In many cases, following and studying this logic will mean confronting a different kind of logic. In this particular case, historians using a XAI companion (who were trained in a certain way) were led to areas or elements of the images that they would not necessarily have considered relevant for classifying beforehand. The model’s classification mechanisms thus become an element of discovery and–in combination with the historians’ knowledge–a potential source for knowledge production. One could thus argue that XAI has the potential to alter the conventions of observation and in doing so may provide a new readability of the images, a readability that constitutes an interplay between machine logic and the historian’s rationale.

However, XAI’s potential for discovery is limited and remains, as pointed out earlier, teleological. It will only ever provide explanations for decisions it has been instructed to make. Data remain capta, taken not given, to use Johanna Drucker’s definition, whether XAI is used to make its decisions transparent or not (Drucker, 2011). As always in data-driven research, the data must therefore be chosen carefully, and its selection and effect on the application of XAI must be taken into consideration when studying its decisions.

Nevertheless, this contribution shows that engaging with the interrogative approach engendered by XAI and taking its classification decisions as a productive “derivation of the eye,” can produce a different way of seeing and thus provide a new perspective on the classification of historical sources. In fact, the presented approach is not limited to early modern illustrations but can provide relevant insights (depending on the quality of the dataset used) in numerous humanities fields. One can envision that such an approach may help identify minute stylistic changes in artifacts such as statues or pottery, identify variations in motifs or ornaments transmitted over centuries, or precisely highlight artist styles as initially proposed by Bell and Offert (2021). Finally, the interrogative approach to machine-generated results may shift the overall perception of decisions made by AI, or, on a broader scale, change the way digital humanities scholars think about computational methods and infrastructures in general. This holds great potential for the discipline and would enable a criticality that is already visible in the many self-reflexive, theory-driven digital humanities approaches developed in recent years.