Towards a Better Understanding of Tar¯ajim : Creating Topological Networks for Arabic biographical Dictionaries

Biographical writing is one of the earliest and most extensive forms of Arabic literature. Some scholars tend to assume that classical Arabic biographies, widely known as Tar¯aˇgim , arose in conjunction with the study of the reliability of the Hadith transmitters (the reciters of the Prophet Mohammad’s sayings) which lead to a proliferation of biographical material collected and used to assess the transmitter’s trust-worthiness Hourani [2013], Perlmann [1964]. However, a scrutiny of the well-known classical Arabic biographical dictionaries such as Siyaru ’A‘l¯ami an-Nubal¯a’ ‘The Lives of the Noble Figures’ for Adh-Dhahab¯ı shows that they extend their entries to other classes of persons important to the development of particular ﬁelds such as Islamic jurisprudents, rulers, poets, philosophers or physicians. The main contribution of Arabic biographical dictionaries is the cumulative value of the thousands of life histories which construct a picture of the Islamic society in different eras. An Arabic biographical dictionary, therefore, is predominantly used by scholars to look up an eminent person’s achievements and historical background. In this project, however, we explore Arabic biographies as a prosopography, rather than a biography in the strict sense. We introduce a novel method for a better understanding of Arabic biographical dictionaries by creating a network of relations among different persons. We utilise Natural Language Processing (NLP) tools to create a topological network from the unstructured data of 45,500 biographical entries collected from different dictionaries. We aim to illustrate how network analysis leveraged by NLP tools can provide scholars with innovative methods for discovering complex constellation of

The basic characteristics of an Arabic biographical entry are a special interest in the descent of each individual as well as an emphasis on the outer events, rather than the mental development of a person's life Young et al. [1990]. Although Arabic has no single term for biography, the most widely used term Tarǧamah (pl. Tarāǧim) refers to a short biographical notice that typically starts with a genealogical record of the individual to validate the biographical authenticity, then the exact date and place of birth and death, if available. The content of the biography itself differs extensively from one biographical book to another. There are, for example, laudatory biographies or hagiographies, widely known as Manāqib (virtues, feats, exploits), which are intended to present a portrait of a morally admirable person, together with a recital of his outstanding actions and achievements (e.g. Siyaru 'A'lāmi an-Nubalā' 'The Lives of the Noble Figures' by Adh-Dhahabī in 1374 A.D). There are also dictionaries devoted to eminent persons resident in a particular city or country which usually contain a topographical and cultural description of that place (e.g. Tāryáu Baġdād , 'History of Baghdad' by Abū Bakr Aámad ibn Alī in 1071 A.D). In modern times, the Arabic biographical genre is still present. For example, we have twentieth century biographies such as 'The Arab Scientists Biography ' Awad [1986], 'The Modern Arab Scholars Biography' Al-Alawnaa [2011] and 'The Prominent Innovators' Abdul-Fattah [2010]. However most of the modern Arabic biographical dictionaries are a collection of classical and medieval biographical dictionaries with additional entries for prominent politicians, scientists or literary men and women in modern times. The peculiar structure of classical Arabic biographies is still adopted in entries of modern figures but with less focus on a person's pedigree.
The information contained in Arabic biographies is considered by scholars as "the greatest untapped source of information on the medieval Middle East" Bulliet [1970]. In this study, therefore, we utilise the Arabic biographical material as a vital source of registered data, irrespective of their literary value. We seek to provide the means to identify relationships between individuals across a vast number of famous Arabic biographical dictionaries that might otherwise require extensive time and effort by researchers to manually determine. We compile a dataset of biographical entries from famous classical biographies such as , (Al-Dhahabī in 1374 A.D) and (Ibnul-atheer 1305 A.D) by scraping an online Encyclopedia of Arabic biographies. We illustrate the diversity of insight and utility of analysis that researchers can achieve by applying a variety of NLP tools to unearth relevant historical relations in a bafflingly mass material of biographical entries. Moreover, we demonstrate how graphically visualising these cross-connections in as simple and robust form as possible can help scholars interpret ties in terms of both the geographical and temporal attributes of the data.
The contributions of our research, therefore, can be summarised as follows: 1. create a large dataset of 45,500 biographical entries compiled from different biographical dictionaries spanning from the pre-Islamic to the modern era.
2. use NLP tools to structure the collected dataset by extracting attributes needed for building relational networks between persons in the biographical entries.
3. propose to use the semantic similarity metrics to create topological networks between an eminent individual and his contemporaries as well as the clustering of a group of individuals around a thematic node.
4. provide visualisation methods for graphically interpreting constructed biographical networks from a spatio-temporal perspective.
In order to present our networks, in Section II we review previous related research on network analysis for Digital Humanities (DH) purposes and how information in Arabic biographical dictionaries has been tackled by other scholars. Then, in Section III we describe the methodology followed for constructing the topological networks. In this section, we explain how the biographical data is compiled, the challenges we faced to extract the network attributes and the NLP tools we utilised to extract information from the data for the purpose of network analysis. In Section IV we illustrate our experimental results for the construction and visualisation of representative cross-sections of the constructed networks with the objectives mentioned in 3 and 4 above. Finally, in Section V we discuss our conclusions on the overall experiment and suggestions for future work.

II RELATED WORKS
In this section we review the existing related works and organise them in the following two sections. In Section 2.1 we review how network analysis tools are used in DH research. In

Study of Arabic Biographical Dictionaries
The study of Arabic biographical dictionaries in humanities have focused primarily on two themes: the development of the genre and the classes of persons covered on the one hand Wilkens [2015], Young et al. [1990], Cooperson [2000], and the Arabic biographical dictio- sions that tag descriptive names, place names, and dates in each biographical entry. He uses this tagged data to discover spatial and chronological patterns among different persons included in this dictionary. In our research, however, we tackle a collection of biographical dictionaries with a wider range of themes rather than one in particular.

Data Compilation and Challenges Involved
There is a huge number of Arabic biographical dictionaries which differ not only in terms of the classes of persons they document but also in their length and structure. Also, the time-span covered by these dictionaries expand from quite an early age in Islamic history as researchers point out that the Prophet Muhammad has inspired the first Arabic biographies, known as Al- Quran readers and jurisprudents are as old as the ones on Hadith transmitters Cooperson [2000].
Faced with this abundant biographical material, we searched for a source that has a collection of Arabic biographical dictionaries rather than analysing individual ones dedicated to one field of knowledge.
As far as the authors' knowledge, the largest compilation of Arabic biographical dictionaries is provided by "Mawsūat al- Tarājim  Since our aim is to provide an illustration of how relational networks can unearth significant connections between individuals in different biographical dictionaries, we opted for scraping a sample dataset of 45,500 entries extracted from the biographical dictionaries on the Encyclopedia website. We used the Beautiful Soup Beautiful Soup [2022] Python library to pull out the data from the encyclopedia website in a computer-friendly format in order to be able to utilise NLP tools for building up relational networks. We did not select the entries from one particular biography but we selected biographies for both men and women gender.
In our analysis, however, we targeted specific categories in our dataset that we thought would bring about informative relations in our network. Thus, we extracted what is known in Arabic biographical genres as , "the Unknown", which refers to individuals that are not eminent in any particular field but are only notable because of their relation to a well-known person. Those biographical notices of unknown men or women were traditionally collected by classical biographies to document the authenticity of a Hadith by the Prophet or an interesting anecdote recorded about an outstanding person in which the unknown man or woman was involved.
We used this category to reveal any unexpected cross-connections between networks of the other groups, mainly prominent men and women extracted from the other biographical entries included in the Encyclopedia website.
The second category that was of interest is women biographical notices. The Encyclopedia of dictionaries included separate entries for women notable in different fields which were collected from classical and modern Arabic biographical dictionaries such as by al-Zuhari (785 A.D) and by al-Zarakli (1927 A.D). We focused on this category to bring about any cultural, religious or historic commonalities between female prominent figures in classical and modern times.
After collecting the data, however, we were faced with two challenges. First, since the biographical entries were collected from different dictionaries, they lacked structural consistency.
Thus, the length of notices varied widely. Some were very brief (e.g. The second challenge is related to extracting information essential for creating networks by NLP tools. To build relational networks, we needed to hard-code attributes that are common among different persons, for example, common era, geographical proximity, common places of birth, death or living, common historical events or a combination of any of these attributes. Although Arabic biographical dictionaries typically start with genealogical and geo-temporal information about a person, the style of presenting this information differs from one dictionary to another.
For example, one dictionary would incorporate birth and death information thus: "Ibn Al-Naqīb: Egyptian Shafii (school of jurisprudence) his birth and death in Egypt..." Al-Zarakli [1927].
While in others, the geo-temporal information is not as clearly presented. As seen from the quote above, information about the particular historic period can only be inferred from the fact that Ibn Mājid and Vasco da Gama were contemporaries, i.e. the 15th century AD. These inconsistencies in the textual structure of the scraped biographical data also apply to other information such as the birth/death place and profession. It is also not uncommon to have the biographical notice stating that the exact date of birth or death is not available or cannot be determined.
It is considerably difficult to extract network data from such unstructured text since the network nodes and edges are primarily determined by having clearly defined features for each individual, such as place of birth, death, position, important dates or events, and historical era. Extracting commonalities among individuals in terms of these attributes is essential for the purpose of easier access of the biographical information in the data through relational networks. The NLP tools used for this task are explained in the following section.

Data Structuring by NLP Tools
Creating a successful network from an unstructured data requires clearly defined ties between nodes in the network, in our case individuals in the biographical entries. The ties that we were seeking are mainly the attributes that are typical of an Arabic biographical dictionary. These attributes are: date of birth, date of death, place of birth, place of birth, era, and the specialised field of knowledge in which the person is prominent. We also added gender as one of the distinguishing attributes. As mentioned in the previous section, information on the attributes are not textually expressed in a consistent style. So we utilised a number of NLP methods to extract these seven attributes.
We extracted places of birth and death in two steps. First, we used regular expressions to extract all trigrams (three-word phrases) that can precede or follow the place of birth or death. After examining the data, we extracted phrases that are normally used in Modern Standard Arabic (e.g. 'his birth place was Mecca', 'he died in Aleppo', 'his birth and death was in Damascus' etc.) to match the Arabic version used in modern biographies in the dataset. To capture phrases used in the old biographies, we extracted Classical Arabic phrases used to denote the place of birth (e.g. 'he is from the people of Baghdad', 'Cordoba is the place where his head has fallen (i.e. where he was born)'). The output of this stage was truncated phrases that have the word indicating the location of birth or death somewhere in the middle. Second, in order to have the place as an attribute, we needed to extract only the name of the city or the country from these phrases. We experimented with two methods. The second category of attributes was the dates of birth and death. The scraped data from the Encyclopedia website contained dates in a special dual format typical of modern Arabic Biographical dictionaries: the Hijri (i.e. Islamic lunar calendar) denoted by the Arabic letter ' ' (e.g. 1268 ) followed by the Gregorian calendar denoted by the Arabic letter ' ' (e.g. 1547 ).
We used both regular expressions and programming string functions to separate dates of birth into the two independent categories: AH (Hijri) and AD (Gregorian). As for the era, it was easily separated as the scraped data had a separate category for the era of each biographical To calculate the semantic distance between Ibn áanbal's document and other entries, we applied the cosine similarity model. Cosine similarity is a commonly used metric, which measures similarity as the angle between two vectors; in our case vector representations of biographical entries Xia et al. [2015]. Figure 2 shows the cosine similarity values of the top five most related individuals in our dataset to Ibn áanbal based on the SBERT embeddings and TF-IDF values, in blue and cayen respectively. Despite the fact that SBERT embeddings are considered state-ofthe-art semantic representations in NLP research, we found that the tf-idf values provided more informative similarity measures than the SBERT embeddings which produced indistinguishable similarity values with a large number of entries with Ibn áanbal's. We, therefore, opted for using the cosine distance between the vector of the tf-idf values of Ibn áanbal's biographical summary and the tf-idf vectors of other entries in our dataset to create the relational network.
For the visualisation of this network, we used NetworkX 4 to represent these scores in a network where Ibn áanbal is a node and the similar documents are connected to it by edges that varies according to the similarity weight. Figure 3 is a graphical representation of a section from Ibn áanbal's network 5 based on cosine similarity scores. It should be noted here that the graph shows only a cross section of Ibn áanbal's network, the network can be expanded as far as the research purpose entails. We aim here to show how network analysis and visualisation of Arabic biographical dictionaries can highlight relations between individuals in an insightful and elegant way that are not feasible by manual research.

Relation Network to a Thematic Vector
The second type of networks we created was between women biographical entries and a vector  Figure 4 shows a section of this multi-graphical visualisation of these two networks.
The numbers on the edges in Figure 4 show  The most interesting part of this network is that it was able to correlate women fighters across different ages. For example, it shows that several prominent women figures from the Islamic era such as Umm imāra and áafiyya (the Prophet's aunt) are connected to women in later ages such as 'issmat Al-Iskandarāni', an author whose father died in the Crimean War (1853( -1856 between Turkey and Russia during the Ottoman Empire era. The latter documented a number of battles in her writings. We also attempted to calculate the centrality measures of these two connected networks. Through centrality measures, we learn how to find the most important nodes (individuals) in the network. The first centrality measure we calculated was the 'degree of centrality' which is based on the assumption that important nodes have the largest number of connections in a network Opsahl et al. [2010]. We found that the highest degree of centrality can be seen that the network as well as its visualisation can aid in connecting individuals based on a common theme and bringing out close and remote connections to figures across different ages. As previously mentioned, this is only a section of the women biographical networks, this network can be further expanded thematically and temporally.

Spatio-Temporal Networks
Another type of topological networks with which we experimented was geographical networks.
We aimed to link individuals eminent in particular field with respect to the geographical proximity of their place of birth and death. For this type of networks, we supplemented the information we extracted from the Biographical Encyclopedia with context knowledge. As mentioned in Section 3.2, the classical biographies mention cities and countries whose names and borders have changed in the geography of the modern world. We searched the modern names and  Figure 5 illustrates a section of the geographical network for the places of birth and death of poets during these three periods. The size of black circles in the figure is indicative of the frequency count of poets related to the respective location on the map and the edges between circles indicate a movement from one place to another, i.e. birth place to death place or vice versa. It can be seen from the geographical network that the central density of the circles is located in what is historically known as the Levant area. It is a large area in the Eastern Mediterranean region of Western Asia. In modern times, it is equivalent to Syria, Lebanon, Jordan, Palestine and most of Turkey southwest of the middle Euphrates. Also, the edges connected to this focal area is indicative of movements to and from other parts of the world 7 . This shows that the Levantine cities have been a literary capital for poets since the older eras up to the modern times. We also wanted to explore any difference in the geographical pattern of poets' locations in older and modern times. For this purpose, we used the Palladio facet filter to visualise the Umayyad and Abbasid times as one unified period independent from the modern times. Figures 5 and 6 illustrate the older and modern time periods, respectively.
An interesting fact that is revealed by the two figures is that poets in the Umayyad and Abbasid were geographically centred in Iraq and Syria, specifically Damascus and Baghdad, which were the capital cities for these Caliphates respectively. In modern times, however, we have a shift to the West where Egypt, Palestine and Jordan become focal locations for poets either for birth or death. It should be noted that this is a drastically simplified geographical network of poets' locations across different eras. It does point to some facts about the complexities of past events relevant to men and women eminent in one field of knowledge, i.e. poetry. It does not, however, suffice to generate the full insight into the geo-temporal aspects of the biographical data that was collected for this experiment. This example of a poets' geographical network only highlights the potential conclusion that can be arrived at from visualising similar geo-temporal networks for persons distinguished in other fields of knowledge and across diverse periods in the Islamic history.

V CONCLUSION
Many network analysis projects in the social sciences rely on pre-existing data where the attributes are created for network analysis. In this research, we experimented with a large unstructured dataset for a collection of Arabic biographical dictionaries that covered ten eras in 7 Due to space limitations, the full map does not show locations between the Levantine cities and other cities in other continents such as London, Brazil and Indonesia. over, the geographical networks have proven to be a helpful tool in highlighting a geo-temporal difference for the literary capital of poets across three eras included in our dataset. It remains to be said that our experiments have provided only examples for Arabic biographical networks which revealed hitherto unexpected cross-connections between nodes in the networks. Which aspects of relations between individuals in a biographical network and which attributes matter solely rely on the researcher's viewpoint and research purposes. Our experiment showed that the creation and visualisation of topological networks for biographical data would significantly help researchers in a systematised interpretation of text and unwrap any complex relations that may not be easy to extract from crude unstructured biographical material.