Mosaic: recovering surviving census records and reconstructing the familial history of Europe

In recent years, there has been a marked increase in the demand for global data on historical family systems, both in the social sciences and in the humanities. Until lately, however, scholars interested in historical global family variation had to rely on simplified and often ahistorical world-scale classifications of family systems by world geographic regions. This article communicates Mosaic to the scholarly community – one of the largest infrastructural projects in the history of historical demography and family sociology. The article provides a brief history of the project, a discussion of the main issues involved in creating the database (including sampling and representativeness), and Mosaic's data structure and coverage. In the remainder of the article, the authors provide an overview of methodological and research opportunities that the project can offer to scholars, showing how the most pertinent problems of historical family demography can be tackled in more systematic ways than previously.


Introduction and rationale
In recent years, there has been a marked increase in the demand for global data on historical family systems, both in the social sciences and in the humanities (Dennison & Ogilvie, 2013;Duranton, Rodríguez-Pose, & Sandall, 2009;Klüsener, Szołtysek, & Goldstein, 2012;Rijpma & Carmichael, 2013;Ruggles, 2010;Therborn, 2004;Todd, 2011). Until lately, however, scholars interested in large-scale patterns of family variation in the past had to rely on simplified and often ahistorical world-scale classifications of family systems by world geographic regions (Kok, 2009;Rijpma & Carmichael, 2013;Therborn, 2004;Todd, 1985), or were forced to use 'meta-databases', which combined dozens of individual studies into one common scheme (Dennison & Ogilvie, 2013;Kaser, 2002;Wall, 2001). The Integrated Public Use Microdata Series (IPUMS) and the North Atlantic Population Project (NAPP), 1 the largest systematic census microdata efforts to date, are either confined to the populations of the North Atlantic region or cover mainly the twentieth century (Ruggles, Roberts, Sarkar, & Sobek, 2011). Early infrastructure efforts, which started in the 1970s at the Cambridge Group for the History of Population and Social Structure and in Vienna, are no longer able to fill in this vacuum because they relate to only some parts of Europe, or for reasons of data accessibility. 2 According to conventional wisdom, major hindrances to furthering the development of 'global' accounts of the family history of Europe have stemmed from constraints on data availability. For example, while the current large-scale initiatives like IPUMS International or NAPP are based on existing complete census material for entire q 2015 Taylor & Francis countries, for many parts of historical Europe such complete census material is extremely difficult to obtain or non-existent. This might foster a tacit assumption that the materials of relevance to the study of family and co-residence in the more remote past are scarce, or that acquiring such data would be extremely difficult. Because the historical listings have not survived or never existed, or their exploration is risky, it could appear that some aspects of the regional patterning of social and family structures in the historical landmass of Europe will never be fully explored.
Over the course of the past several years, however, the assumption that future scholars will need to conduct their research 'below the data poverty line' has been shattered (see Ruggles, 2012), as has the 'image of limited good', whereby the most desired items (i.e. the surviving and accessible population listings from historical Europe) exist in a finite quantity and are always in short supply. Whereas a census microdata initiative that covers the entire territory of historical Europe may never be fully realized, the use of scientific harmonized samples of census microdata comparable across time and space has already begun to revolutionize historical demography. This article elucidates the importance of the current disciplinary moment by communicating Mosaic to the scholarly community -one of the largest infrastructural projects in the history of historical demography and family sociology. The article provides a brief history of the project, a discussion of the main issues involved in creating the database (including sampling and representativeness), and Mosaic's data structure and coverage (with plans for its future expansion). In the second part of the article, we provide an overview of the methodological and research opportunities that the project can offer to scholars, pointing out the ways in which the most pertinent research questions of historical family demography might be answered in more systematic ways than previously.

Starting Mosaic
The Mosaic project originated in the research and data-infrastructure activities of the Laboratory of Historical Demography at the Max Planck Institute for Demographic Research (MPIDR) in Rostock, Germany. Since 2008, the Laboratory has been involved in several large data-transcription projects, as well as in depositing and inventorying various microdata collections from historical eastern Europe. At the same time, comparative analysis of family systems has been actively pursued by the Laboratory's team members. The ultimate stimulus came from Joshua R. Goldstein, the then director of the MPIDR, who has since taken the lead in an effort to create a database for historical census and census-like material for the whole of continental Europe and beyond.
In May 2011, the Mosaic project was launched with a conference on Reconstructing the Population History of Continental Europe by Recovering Surviving Census Records, held at the MPIDR in Rostock. The 2011 conference helped to establish and cement the international collaboration necessary to enforce the new agenda, and to promote the idea of the Mosaic project among wider scholarly circles. It harnessed the energies of a large number of historians, demographers and archivists, who jointly committed themselves to the shared purpose of recovering surviving census records of historical Europe (see Figure 1).
The Mosaic project was funded by the MPIDR until mid 2013, when the then director, Joshua R. Goldstein, left the institute to take a chair in demography at the University of California, Berkeley. Currently, the project is headed by a managing board consisting of Siegfried Gruber (University of Graz), Mikołaj Szołtysek (Max Planck Institute for Social Anthropology, Halle/Saale), Joshua R. Goldstein (University of California, Berkeley), Kees Mandemakers (International Institute of Social History, Amsterdam), Péter Ő ri (Demographic Research Institute, Budapest) and Steven Ruggles (Minnesota Population Center, Minneapolis).
In the years following the launch conference, the work continued along several primary paths -i.e. the identification, digitization, transcription, coding and harmonization of surviving census or census-like materials from different parts of Europe.

Country inventories and recovering surviving census records
Data for the Mosaic project was derived from two main sources: either existing data was donated or new data was identified and transcribed using the MPIDR's resources. The original database for the Mosaic collection consisted mainly of donated files for Albania, Mecklenburg-Schwerin (including Rostock) and part of Poland-Lithuania, and the Vienna database. Consequently, the Albanian and Polish-Lithuanian projects (the latter in its wider extension) became the 'prototype' for further Mosaic-type data sets, both in terms of database structure and with regard to the particular research framework they were embedded in (see Kaser, Gruber, Kera, & Pandelejmoni, 2011;Szołtysek, 2012Szołtysek, , 2014. Next, the search for already existing machine-readable data which would fit into the Mosaic format (see below) was undertaken. It targeted not only historians or historical demographers, but also genealogists and their respective associations, especially in Germany, but also elsewhere.
Second, we developed a long-term strategy for searching for surviving census and census-like microdata in various parts of Europe. In each case, the first step was the creation of an inventory of local surviving census and census-like material for a given historical region, an administrative area or the whole country (either within its contemporary or historical boundaries). This was necessary because the majority of countries under consideration lacked any sort of overview of such material, or knowledge of it had long since been forgotten, while most historians and historical demographers continued mistakenly to assume that only a few manuscripts survived.
The inventories were made by local historians visiting and communicating with local archives, backed up by back-and-forth consultations with the Mosaic core team. In some cases, it turned out to be a quite straightforward, if still very laborious, task. 3 However, in most countries, data identification was a complicated and thorny process because the archival materials of interest were not centrally stored, but scattered in many regional or even local archives. Altogether, such inventories have been made for Austria, Catalonia, Germany, Hungary, Lithuania (including parts of Belarus), the Netherlands, Poland, Romania, Serbia, Slovakia and Western Ukraine (see Figure 2; most of them are available for download as part of the Mosaic Working Paper series at http://censusmosaic.org/web/ publications). They demonstrate that, even though enumeration forms for national censuses taken before the early twentieth century were preserved to drastically varying degrees (from large proportions to almost nothing), in virtually every country, census manuscripts or various individual-level census-like listings (church lists of parishioners, tax lists, local estate inventories) survived in great quantities.
Inventorying the surviving census and census-like records from different cultural milieus in Europe was seen as crucial to sustaining not only their long-term access, management and dissemination across the worldwide scientific community, but also to safeguarding their preservation. Many historical data sets (especially in eastern Europe) exist only as manuscripts, deteriorating Xerox copies or disordered scans. If they were to be properly identified, preserved and computerized, the risk of a large part of the world's historical heritage literally going to waste would be significantly diminished. In addition to recording detailed information on the European ancestry of past centuries, the historical listings at stake recount important social practices carried out in various spatial and cultural contexts, thus forming a vital part of Europe's joint historical and sociocultural heritage.
Furthermore, the inventories provide necessary regional contextual information about the census-taking practices which led to the production of a given material, often including discussion of the categories and indicators employed during the information-gathering. In the future, standardized accounts of which data sets are comparable to each other regarding certain specific research topics should be incorporated into the existing body of inventories.

Sampling and representativeness
Finally, country-specific inventories of surviving census microdata proved indispensable in providing selection frames for strategically digitizing relatively small numbers of records which would be useful for research, and which would make their way into the final database upon transcription.
One of the unique points of Mosaic when juxtaposed with existing large-scale datainfrastructure efforts is that it relaxes the requirement of full-count census data. In order for the project to foster expansion of historical demographic knowledge beyond the wellknown confines, it had to deal not only with full-count data or representative samples of surviving complete census material (like in NAPP), but also with incomplete censuses or census-like materials. This, in fact, applied to the majority of European countries and regions for which no complete historical census coverage survived or ever existed, and which were therefore omitted from previous projects.
This very feature of the Mosaic data introduces some vital concerns about the data representativeness. Before full advantage is to be made of the new collection by its prospective users, a solid idea of the representativeness of the data has to be clearly spelled out.
Before the foray of full-count census data in historical demographic research associated with the IPUMS and NAPP projects, the problem of data representativity has made only a rare, if not whimsical, appearance in family history literature. In the past, it was customary for scholars to examine the material or populations they had chosen to study by deliberately picking up on what appeared to be the most 'representative' cases. For Le Play -the doyen of family sociology and an early advocate of the typologization of European family systems -it was sufficient to seek out what he considered a 'typical' or 'representative' family of a particular region (and some of his 'regions' could spread over thousands of kilometres) by inspecting selected cases believed to prove a particular manifestation of a given social trait, either in its 'average' or 'extreme' forms (Zimmerman & Frampton, 1935, quoted in Kruskal & Mosteller, 1979b. But Le Play could not prove that his families were representative of the groups which he studied (see Le Play, 1877-1879.
No more scrutinizing in this respect was the approach of the Cambridge 'School' of historical family sociology. At no point does Laslett's discussion of the well-known 'English sample of 64 settlements' used for the comparative study of domestic groups feature a proper exploration of the sampling scheme or representativeness of the entire collection. Referring to the 64 listings included in the collection, which dated from 1574 to 1821 and stretched from Devonshire to Durham and Westmorland to Kent, Laslett wrote: It would be idle to expect 64 settlements to be representative of the whole of England, or to be effective for indicating change over time. Nevertheless, 25 of the 40 counties of the country appear with one or more settlements, and only 1 county . . . can be said to occur much too frequently for its relative importance. (Laslett, Wachter, & Laslett, 1978, pp. 73 -74;cf. also Laslett, 1969) Referring to the 'one hundred most informative listings' used for the analysis of the mean household size, Laslett (1969) wrote: 'It was decided in 1967 to analyze the hundred most informative listings; where there was a choice some respect was paid to geographical distribution, but for the most part selection had to be arbitrary as regards time and place ' (p. 199).
Mosaic provides a more thoughtful consideration of these issues not only by making imperfect samples more readily comparable, but also by scrutinizing more closely what particular data sets stand for, while making statistical inferences from them. Table 1 shows the basic characteristics of various Mosaic samples, providing information about their size, their creator and their release date. Table 2 is more detailed, classifying the countries and regions contained in Mosaic by the proportion of the surviving manuscript material, the kind of sampling applied and the assumed representativeness of the sample (representative of the country, representative of the region or representative of particular social strata of the region). In addition, it gives information about whether urban populations are included in the samples or not.
To this end, based on Table 2 , we distinguish between several major groups of country or region samples within Mosaic: 1. Country samples based on random sampling: these samples are based on complete or almost complete coverage of the census area, in most cases covering the entire territory of the respective country, either within its historical or contemporary boundaries. These samples are representative of the census area. The strata for sampling are generally administrative or geographic areas within a country. One example of such a sample is the Albanian census of 1918, where most of the material has been preserved and a random sampling based on seven regional strata could be applied. In addition, the complete urban and minority populations were added to the sample. Another example is a sample of the 1838 census of the Principality of Wallachia (fourteenth century -1859), which covers the southern part of present-day Romania. It is a representative sample of the rural population based on four regional strata (east, north, south and south-west). 2. Country or region samples based on known criteria: these samples are based either on low shares of surviving material or sampling schemes which were restricted by financial resources or the accessibility of manuscripts. Sampling should ensure regional coverage or different ecotypes/legal statuses of the peasant population. We can assume that these samples are somehow representative of the sampled regions, but we cannot be sure about it. The data of the 1867 census of Mecklenburg-Schwerin, for instance, is based on clusters of villages and cities covering different regions and different legal statuses of the peasant population. 3. Country or region samples based on other sampling schemes: these country samples were, for the most part, donated to Mosaic, and sampling was based on criteria other than representativeness. Ultimately, these country samples could cover the whole country because all of the manuscript material is available and the data transcription is still ongoing. The censuses of western Flanders in 1814 and of Schleswig and Holstein in 1803 have been transcribed by genealogists, and part of this material is already available for research. 4. Country or region samples based on complete surviving material: as no other material has survived, the representativeness of these samples cannot be directly verified, but can only be gauged through an informed speculation based on a thorough knowledge of the respective local or regional conditions and socioeconomic or environmental properties. The Central European Family Forms Database (CEURFAMFORM), covering the peasant population of historical Poland-Lithuania, is such a country sample (see Szołtysek, 2014, ch. 2), as well as the data set for western Ukraine in 1863.
While in some cases the relative scarcity of primary materials fostered an inclusive approach (category 4 above -i.e. all of the primary materials for a particular area found in the archives or in printed editions were included in the database), in the main, sampling schemes were necessary either due to convenience (there was too much data to work on and process by a single team) or financial constraints (see categories 1 -3). Sampling has generally been done by complete settlements (villages), with subdivisions of countries (or regions) used as strata for sampling, but in each case it heavily depended on the extent of the data coverage of the surviving material. Sample sizes vary (Table 1), depending primarily on the expense of primary data collection and the financial resources available. Some of the country samples have already been released, and most of the others are scheduled for release within the next two years. Since, in the majority of the surveyed countries, the surviving material was not distributed evenly, truly representative (in statistical terms) country samples could be obtained only for a few countries. The Mosaic samples are therefore generally more representative of the surviving census material than of the population of the concerned region or country, but they remain the best samples for the concerned regions or countries available for now, and for some areas better samples may never be obtained (for example, Poland-Lithuania). In order to better indicate the differential status of the Mosaic data, a column called 'assumed representativeness' was added to Table 2, which gives information about the different levels of representativeness we assume for particular populations.
Despite this shortcoming, the data gathered in the Mosaic database seems better safeguarded than that used in pre-Mosaic family history research. One of the definitions of 'representative sample' is the absence of selective forces (Kruskal & Mosteller, 1979a). In this sense Mosaic decidedly departs from earlier studies in which cases presented and popularized in literature were hardly ever discussed in terms of how they should be dealt with as regards representative or selection bias; they were simply either the most appealing to the scholarly imagination (for various reasons) or the easiest to get (see, for example, Laslett, 1983; for Czap's Mishino estate data, see Czap, 1983). In this respect, Mosaic has no selective forces involved, except for the 'random' forces of history which delimited the very survival of the material. 4 Some areas might have been prioritized for archival scrutiny in the early phase of the project, but this has since changed.
Another feature of 'representative sampling' that is not met in our case is that the majority of the Mosaic samples (except for those in category 1 above) cannot be ascertained to have the same distribution as the respective country population from which they were drawn -in other words, it cannot be asserted that they are made up of typical units of that general population (Kruskal & Mosteller, 1979b). Although it is not unlikely that several more sizeable 'samples' in fact represented the average tendencies of the much bigger spatial entities to which they belonged, it is not possible to know exactly to what extent this is true because more encompassing data collections do not exist for those areas. At present, there is simply no way of knowing how representative of Spain the data from Catalonia is, or how representative of the whole of France the data from the 1846 census is.
Still, it would not, however, do to say that this data just happened to come to hand and we have no notion whatsoever of the relation between what would statistically be called the 'target' population and the 'sample' population. First, this data may still be representative of certain areas or administrative regions, some of which are quite large (for example, Schleswig and Holstein or Catalonia). Whereas it might be difficult to argue that the collection of data from Münsterland may be representative of the entire German population of that time (see Table 3), much better grounds exist to assume that it exhibits on a smaller scale the relevant familial characteristics of the population of the Prince-Bishopric of Münster (a large administrative area in the northern part of today's German states of North Rhine-Westphalia and western Lower Saxony) from which it came. Furthermore, some of the regions included in the collection have data coverage which successfully represents the given region's -or even country's -population and socioeconomic heterogeneity (for example, Poland-Lithuania and the Kingdom of Hungary), which tallies well with the generally acclaimed sense of representativeness (Kruskal & Mosteller, 1979a).
In addition to country or regional samples, Mosaic also includes a large collection of data of single villages (or cities) or regional clusters of villages (for example, Istanbul or Italy and some locations from European Russia -see Table 3). This data is usually the outcome of various research projects of different scholars which have been donated to Mosaic, most of them in the form of case studies. The caveat of unrepresentativeness is particularly relevant in this regard, as what we have for those areas is so unlikely to represent the grander 'whole' fairly that every small addition of empirical data can be expected to change the patterns revealed by this data rather dramatically. Special caution needs to be exercised when using this data in a larger context. Nevertheless, this data might be valuable for some comparative research exercise, and Mosaic provides a portal for making it available to the international scientific community.

Mosaic's data structure
The crucial component for the success of Mosaic is the integrated character of the transcultural and cross-temporal data it contains. There are three primary requirements for a data file to fulfil in order to be included in the Mosaic project: (1) the data source should list individual persons, preferably by name; (2) the data source should list all individuals in a settlement or an area, not just the household heads, men or adults; and (3) the data source should list all individuals by clearly delineated residence units (houses, hearths, domestic groups or households). The file cannot be included in the Mosaic project unless the primary sources provide important details either explicitly or implicitly -i.e. the individual's age, sex, relationship to the household head, marital status, place and year of enumeration, and first and last names for the whole population. Sex might only be available as a variable derived from first names or the relationship to the head of the household. Marital status might be accessible only in the case of married people and possibly widowed people. Age should be available for the whole population, not only for adults or children. Occasionally, though, missing ages do not pose a problem. The relationship to the head of the household is essential for analyzing co-residence patterns, and thus can only be missing for a few persons (Ruggles & Heggeness, 2008). More marginal characteristics, such as occupational title or status, literacy, religion and language, are of interest too, but are not obligatory for the source to be included in the Mosaic framework. We are aiming at pre-1900 data from all over Europe: state censuses, church listings (for example, lists of souls), local enumerations, tax lists, etc. Longitudinal material -i.e. referring to periods of time (for example, population registers) or vital events (for example, church books) -remains outside the scope of the Mosaic project, in which preference is given to cross-sectional data (cf. Dillon & Roberts, 2002).
The starting point for getting all of the material properly organized -in line with some of the best practices for handling historical nominal sources for the purposes of creating large databases (Kelly Hall, McCaa, & Thorvaldsen, 2000;Mandemakers & Dillon, 2004;Thorvaldsen, 1994) -was the full transcription of original manuscripts or published sources. Information on all individuals from selected listings was entered into a computerized database according to the domestic groups they inhabited. The process progressed with full conformity to the original manuscripts, retaining the spellings for first and last names, occupational terms and interpersonal relationships, as well as the original order of appearance of individuals within a residential group. Thus transcribed, the original data was then subjected to standardization and, subsequently, to various coding decisions (cf. Mandemakers & Dillon, 2004, p. 36). 5 The task of transcribing the census-like material was undertaken by student helpers in Rostock or was commissioned to partners within or outside Germany (several partner teams responsible for specific country inventories subsequently became involved in the data transcription for their particular country-specific 'samples'). Harmonization processes (also known as data 'integration'), on the other hand, were conducted mostly by student helpers in Rostock. This was also the case for many of the small databases donated to the project, which had often been designed according to various research needs and varied database practices. 6 The generous support of the MPIDR made it possible to employ up to 2 research assistants and 24 student helpers at its peak, in addition to the core members of the Laboratory of Historical Demography. On a few occasions, however, it proved much more convenient for harmonization tasks to be performed by the local researchers who were responsible for data collection, based on careful instructions received from Mosaic's core team.
Each harmonized Mosaic data file contains 30 variables, which refer to three different levels of information: 4 variables defining the data set, 8 variables defining the household (such as household size and group-quarter status), and 18 variables defining the person (for example, age, sex, marital status, occupation and relationship to the head of the household). In general, most of the variables are designed according to their IPUMS International and/or NAPP counterparts, and can therefore be readily used in comparative research with data from these databases. Occupational titles are coded into the OCCHISCO codes also used in NAPP. Those who are interested may wish to consult the documentation file of all variables with all values, available at the Mosaic website (http://www.censusmosaic.org). First and last names, relationship terms and occupational titles are available in a non-harmonized form, unless the data was donated in an already harmonized way, or sometimes only in a coded version. Under no circumstances are scholars wishing to work with the Mosaic data expected to harmonize the data by themselves. Should weighting the data prove necessary, weighting variables are provided. Basic checks with regard to the completeness of the codes and the internal coherence of the data are performed before the data is released. Some of the data needed correction (for example, female sons or parents being younger than their children), and the changes that were introduced are flagged in quality flag variables.
Another component of integration is variable documentation, the aim of which is to highlight important comparability issues that are not self-evident from the coding structure for the variable. A general comparability discussion has been integrated into the documentation file on the project's website, emphasizing concerns for inter-regional comparisons, which can be called on when making intra-regional comparisons.
The harmonized Mosaic data files are distributed free of charge for scholarly purposes via the data section of the Mosaic website. Every individual wishing to download Mosaic data files has first to register as a user and then to confirm his/her acceptance of the terms (no misuse and improper citation).
Each data set comes in a zipped file, containing comma-separated values files and a readme file with the appropriate citation of the respective data file. Geographic Information System (GIS) files are also available for download. The MPIDR Population History GIS Collection supports demographic and socio-economic research by filling in the gaps in the European GIS data infrastructure on historical national and regional administrative boundaries and historical place names. Currently, there are GIS files with administrative boundaries available for Albania, Austria-Hungary, Courland, Germany, Poland and Serbia, and maps for the whole of Europe with the borders in 1900, 1930, 1960, 1990 and 2003. All of the settlements of the released Mosaic data files are geo-referenced. A table with this information will soon be available for download directly from the project's website. Table 3 presents a detailed list of the data sets included in the most current version of Mosaic. Table 4 shows the distribution of those regions across time and urban and rural contexts, and Figure 3 reveals the spatial patterning in the distribution of the data across Europe.

Current data coverage
The current data covers 92 regions/locations in Europe, stretching from the Atlantic Ocean to the Ural Mountains. A slight majority of the included locations comes from the nineteenth century (56%), with more or less equal numbers covering earlier and later periods. The collection contains both rural and urban sites, although rural societies clearly predominate (82%).
As of now (see Figure 3), the data is largely concentrated in the central continental belt, providing quite a good coverage of the French, German, Austro-Hungarian, Polish and Balkan areas. However, some areas that are important for the investigation of the European geography of family systems, for example, are not yet included, or the coverage of these areas in the database is very limited. These areas include the Low Countries, which are often assumed to have encompassed the essential features of 'north-west European' family systems (De Moor & Van Zanden, 2010); the Italian territories, which, according to some scholars, exemplify the 'Mediterranean' zone (Smith, 1981); and Russia. As the Mosaic database expands, some of these deficits will soon be addressed.
However, in addition to covering both urban and rural communities, the current database runs across many important fault lines in the European geography of family systems, including places located 1. eastward of the Hajnal-Mitterauer line (parts of Poland, Russia, Ukraine, Belarus, Hungary and Latvia); 2. within the south-eastern Europe zone (Albania, Serbia, Turkey, Romania and Bulgaria); and 3. in the 'intermediary central European zone' of Laslett's (1983), e.g. Austro-Hungarian and German areas, as well as parts of historical Poland and western Europe -France).  The collection encompasses societies which varied significantly in terms of basic principles of family and household organization, including strictly nuclear and neolocal populations (like urban Rostock, but also southern Ukraine, the Braclav area and Podolia); stem-family societies (like those in the area of Münster in Germany, in south-western France and in parts of western Poland); complex societies exhibiting a 'classic' eastern European joint-family pattern (like those in Mishino near Moscow studied by Czap (1983) or in Polesia in eastern Poland-Lithuania) or Balkan versions of this pattern (Albania and Serbia); and a range of intermediate patterns with varying degrees of intermingling of nuclear-and stem-family organization (Poland proper, Germany in 1846 and Austria around 1910), or stem-and joint-family patterns (Red Ruthenia in Poland and fifteenthcentury Italy). Furthermore, even in its present scope, the database already covers much of European variability in terms of geographical features, populations, cultures and socioeconomic geography (Jordan-Bychkov & Bychkova-Jordan, 2002): i.e., plains, mountains, and coastal areas; free and unfree peasantries; a range of ethnicities and religions; and a range of regional patterns of economic growth in the early modern and modern eras.
Millions of records that have already been identified by our team and a European network of collaborators, arranged according to Mosaic's structural requirements, currently remain outside the scope of the project, although samples from them are potentially obtainable. The primary future goal of the Mosaic initiative is, thus, to expand the existing data infrastructure into further regions of Europe, and ultimately to extend the reach of comparative analysis into the landmasses of Asia. Spain, Italy and Russia need to be targeted first, as it is indispensable to include these territories if major European tendencies in family organization are to be captured by the Mosaic database (eastern Europe, the Balkans, the Mediterranean and north-west Europe).
The prospects for further expansion are particularly strong in eastern Europe, especially Russia, and in Siberia, where hundreds of digitized nineteenth-century individual-level taxation lists (so-called 'revision lists') are known to exist in the Urals and surrounding area, including also some parts of central Asia (for example, for the Astrakhan Governorate, which borders with Kazakhstan). 7 In addition, nationally representative samples of census microdata from Canada, Great Britain, Germany, Iceland, Norway, Sweden and the USA from 1801 to 1910 can also be easily combined with the Mosaic data; initial attempts have already been made (see below). Finally, it may be possible to harmonize historical representative country samples from the expanded Mosaic collection with contemporary European microdata samples available through the Integrated European Census Microdata project. 8 All these data sets have generally comparable formats and can yield comparable basic information on coresidence patterns.

Research significance
Space limits our possibilities to discuss this issue at length, but it has to be stressed that apart from its major focus on the development of data infrastructure, Mosaic has primarily been a research-driven initiative, and the information it produces can become a powerful tool in transforming the historical demography of Europe. It can make substantial contributions to other research fields as well.
Despite the lack of consistency of the sampling schemes and concerns about the rigid statistical representativeness of the data, the sheer amount of data assembled and its spatial regional distribution make Mosaic capable of putting the study of historical family forms on a new footing. The vast majority of the quantitative evidence used in the debate on the geography of historical family and household composition has consisted of studies of a single community or a small group of communities (see, for example, Fauve-Chamoux & Ochiai, 2009;Laslett, 1977Laslett, , 1983Wall, 2001;cf. Ruggles, 2012), raising questions about the extent to which their results can be generalized for larger populations and for larger territorial units. Apart from that, these studies have often relied on a range of different methodologies (for example, ad hoc coding schemes or non-transparent operationalization of the variables), posing obvious challenges in terms of systematic data comparability. Although Mosaic is definitely not immune to the first set of flaws just mentioned, there are at least two respects in which the project might steer the study of family and residence patterns far away from its usual drawbacks.
First, Mosaic is better equipped to tackle the methodological challenges involved in studying small populations, such as the stochastic variations in demographic behaviour at a micro level, which may themselves affect the observable co-residence and other demographic patterns (Ruggles, 1987). Unless it is done with careful sampling considerations (which, as already mentioned, has hardly ever been accomplished in pre-Mosaic research on family history), focusing on local micro-populations runs the risk of being affected by what is known as the 'law of large numbers', which posits that the smaller the number of observations, the greater the problem of inaccuracy of parameter estimates due to large measurement errors. 9 This statistical principle has long been known to demographers, who have assumed that the size of a population affects the pattern of relationships between the elements of the natural movements within that population (Ruggles, 1999, pp. 126 -127;Vallin, 2006, p. 6;Wachter, 1978). Using larger populations (of regions or macro-regions) usually causes random errors stemming from stochastic variation to average each other out, and thus yields more accurate and more parsimonious estimates. To us, the difference between assuming regularities based on a single study (Laslett, 1983) and inferring them by averaging a few dozen or a few hundred observations boils down to yielding incomparably more robust estimates.
The second major asset of Mosaic in this regard is that all of the microdata samples discussed above are very similar to one another in terms of their structure, organization and available information, and manifest a core set of common variables. By this token, Mosaic makes imperfect samples more readily comparable, allowing researchers to measure family systems or other concepts systematically across space. Recently, Gruber and Szołtysek (2015) capitalized on this feature and used a range of harmonized Mosaic variables related to familial behaviour -including nuptiality and age at marriage, living arrangements, post-marital residence, power relations within domestic groups, the position of the aged and the sex of the offspring -to develop a composite measure of differences in sex-and age-related social inequality in 91 regions of historical Europe, covering more than 700,000 individuals from the Atlantic to Moscow. Based on a large set of harmonized variables, this index has the potential to become the new 'master variable' for crosscultural studies of family organization and relations.
Another significant issue is Mosaic's data coverage and its spatial distribution. Within the general aim of collecting surviving microdata from different areas of Europe, the initial focus of the project has been on gathering data from central and eastern Europe. While much data on western Europe was already available through NAPP, in countries like Poland, Lithuania, Hungary, Russia, etc., although a wealth of data has been preserved, it has barely been collected and analyzed. These countries present a particular challenge to historians and social scientists -central and eastern European societies have been long believed to have had different demographic regimes from those of western European societies (for example, different marriage and fertility patterns, more complex family systems, etc.). The fact that Mosaic contains a large amount of previously unavailable data from these under-researched areas makes the possibility of future comparative research in historical family demography especially promising.
Accordingly, one can reasonably argue that, by the very amount of material amassed, Mosaic might elevate the ongoing discussions of the geography of European family forms to new heights. Szołtysek (2014), for example, in his prototype Mosaic study, used measures of co-residence and marriage patterns to reveal that, at the end of the eighteenth century, three household and family patterns with substantial numerical and qualitative differences existed in the Polish-Lithuanian territories, thus challenging the notion of a homogenous family system in what used to be one of the largest regions of east-central Europe. Similarly, Ő ri and Pakot (2014), working on evidence from the Hungarian Mosaic sample, revealed a great deal of patchiness in the patterns of marriage and household formation across pre-industrial Hungary, Slovakia and Transylvania, which called for a move beyond the stereotypical and artificial divisions of Europe into 'western' and 'eastern'. Pre-industrial Germany, too, has been discovered to have been only slightly less variegated Szołtysek, Gruber, Klüsener, & Goldstein, 2014).
The outcomes of these and future investigations might call into question the homogeneity of family forms in any region of Europe, thus turning it into a dubious endeavour to further promote a spatial construction of European family systems by branding major areas. At the same time, however, it may lead to compelling discoveries of broader, perhaps more complex, regularities, and thus yield refined spatial classifications of family types. These new geographies may still be incomplete, subject to change or partly disputable -either because of white spots, which are likely to remain on the map of the surviving microdata recovered by Mosaic, or because of the fragmented nature of many of these pieces of evidence (even if assembled in great numbers). Nevertheless, in comparison with how these things were handled in the pre-Mosaic world, this project represents a major breakthrough. Providing that future scholars will maintain their interest in regionalizing family systems, these attempts may very well profit from discussing regional 'sets of familial tendencies' not based on just a few local case studies (as Laslett did -see Laslett, 1983; see also Polla, 2006;Wall, 1995Wall, , 2001, but on a large pool of regionally differentiated data on households and families. In addition to regional 'averages', these future investigations could also incorporate various measures of variation within the spatial clusters involved. At present, in the case of several countries -Poland-Lithuania, historical Hungary, most of the German-speaking areas and the Balkans, and to some extent also Ukraine -a spatially diverse data coverage will facilitate an exploration of the extent to which meso-level regional demographic regularities corresponded to these countries' internal long-term socioeconomic and/or cultural divisions. Szołtysek (2014), for example, found that the broad regional variation of family and marriage patterns across Poland-Lithuania tallied neatly enough with the country's historical socio-economic and cultural fault lines discovered by sociocultural anthropologists of the 1920s and 1930s, thus prompting serious questions about the relationship between the familial sphere and other domains of social life. A preliminary study of household structure in Ukraine revealed a striking correspondence to the country's east-west sociocultural and (later) political divides, which have recently resurfaced as some of the underlying factors of the harsh military conflicts (Szołtysek, 2014). On the other hand, however, the variation in residence and marriage patterns within Germany and Hungary precludes any straightforward classification in terms of longpresumed socio-economic or cultural frontiers .
These examples invite us to look at the Mosaic data through the prism of more than just statistical representativeness. Almost 100 European regions currently stored in the Mosaic database encourage meso-local investigations of a large number of topics that, thus far, have been poorly explored from a wider territorial perspective. A study of leaving home may serve as an example. Currently, the most comprehensive account of historical nestleaving patterns looked at differences in home-leaving in the nuclear-and stem-family societies of Eurasia (Van Poppel, Oris, & Lee, 2004; Great Britain, Scania, two provinces in The Netherlands, two Japanese villages, two local sites in Italy, 27 villages in the Pyrenees and 1 rural site from Belgium were analyzed), but did not cover joint-family societies or any historical sites in eastern Europe. Instead of dealing with dispersed cases through variable methodologies, a thoughtful Mosaic-based study of leaving-home patterns could grapple with the problem for dozens (if not hundreds) of regional populations, for all of which harmonized variables could be designed, quantified and analyzed with common statistical methods.
The same holds true for many other issues, such as the residential patterns of the aged, household formation, etc. Integrated across space, the Mosaic microdata has already spawned a comparative study on the residential arrangements of the elderly in different joint-family societies ); other comparisons are equally possible. Such efforts could provide important added value to our understanding of the complexity and variability in family, life-course and residential situations in the past. Even if providing definite answers to questions, along the lines of how much a particular observation tells us about historical Germany, Spain or France, may not always be possible with the Mosaic data, the indisputable gain would be revealing variability in human agency that is not artefactual to data structure or different variable operationalization. This is particularly true should attempts be made to understand the meso-level diversity of demographic behaviour captured in the Mosaic data by linking it to potential environmental, socio-economic and cultural factors. Though still a largely unoccupied terrain, were it approached properly, it could bypass the project's representativeness problems by shifting scholarly attention away from universal claims to classic anthropological-like concerns about determinants of cross-cultural variation (see, for example, Levinson & Malone, 1980;Whyte, 1978). Understanding the determinants of historical family systems has potentially enormous importance for contemporary demography, if only by providing understanding of the extent to which certain demographic patterns have been persistent and universal, and to what degree they were moulded by diverse economic, environmental and social conditions. Finally, regional clusters of data assembled within Mosaic -which are more or less representative of their constitutive larger wholes -through offering a grid of information on households and families in hundreds of localities, might facilitate better micro-level research in the future. First, they would provide a natural -and necessary -benchmark against which the more in-depth investigations of family historians could be compared. It is in this realm that the cooperation and synergy between Mosaic and the European Historical Population Samples Network (EHPS-Net) with its Intermediate Data Structure (IDS) is expected to be particularly fruitful (see http://www.ehps-net.eu/; Alter & Mandemakers, 2014), at least in a threefold way: either by facilitating the development of joint relational databases relying on the record linkage of Mosaic cross-sectional data with vital statistics, along the path followed earlier by the Eurasia Project in Population and Family History (see Tsuya, Feng, Alter, & Lee, 2010); by providing a test bed against which longitudinal and cross-sectional observations about life course can be mutually compared; and by providing meso-regional quantifiers of the demographic and familial domains to be used as contextual variables in the investigation of other topics (for example, fertility).
Another advantage of the historical Mosaic data is that, due to its harmonized structure, it can easily be compared with other large data-infrastructure efforts, such as IPUMS International (worldwide contemporary data) and NAPP (historical data, but generally of later provenance than in Mosaic). Gruber and Szołtysek's (2012) research serves as a good example of a fruitful combination of these data sets, which materialized as a reaction to the study by Ruggles (2010), who, comparing 87 censuses from 34 countries around the world, argued that there is virtually no cultural variation in the living situations of the elderly with regard to stem-family arrangements; rather, it is economic and demographic change within particular fractions of populations employed in agriculture that acts as the main driver of observed differences. On the other hand, Gruber and Szołtysek (2012) have established that the incorporation of the first set of Mosaic data from eastern and central Europe reveals distinct differences, which go beyond the economic development explanation. The problem with earlier analyses is that the available data from North America and northern Europe was too limited in time and space to reveal much diversity.
Other applications of the Mosaic data beyond demography and family history should also be possible. We shall highlight only two examples from the neighbouring field of economic history. Historical demography has attracted new attention in recent years, as economists have begun to argue that certain patterns of marriage, female celibacy, individual life course and household structure might have been more conducive to economic growth than others (De Moor & Van Zanden, 2010;Duranton et al., 2009;Greif, 2006;Kick, Davis, Lehtinen, & Wang, 2000). The Mosaic data can be of immense value in testing global-level hypotheses about how family types affect different societal and economic outcomes by providing a large number of localized indicators of demographic and familial, as well as gender and age relations, which can be used as independent variables in regression models, provided that at least partial compliance between Mosaic and other data sets is ensured.
Secondly, the Mosaic project can be of great assistance to economic history in that severe data limitations have forced researchers to resort to proxy indicators in measuring the human capital of historical populations. This concerns especially the project's capacity to equip economic historians with, to date, the most detailed geographical coverage of numeracy (quantitative literacy) patterns in historical societies, by focusing on the phenomenon of 'heaping' in the self-reported age data contained in all of Mosaic's microlevel censuses (see A'Hearn, Baten, & Crayen, 2009;also Baten & Szołtysek, 2014).
Mosaic also offers a partial possibility to analyze change over time: there are already a few places or regions for which Mosaic holds data from more than one census. For other places, there is microdata for different points in time, but it has not yet been transcribed. Future scholars might be interested in transcribing additional census microdata (information on existing microdata may be obtained from the country inventories mentioned above) and linking it to already existing microdata within Mosaic.
Research with Mosaic's microdata is not restricted to analyses of household structures and similar research questions; other demographic questions can also be addressed. One is fertility, which could be analyzed by using the own-child method or the child -woman ratio (see Breschi, Kurosu, & Oris, 2003). Studies of societal differentiation could be conducted with the use of information about occupations, which could then be further applied to investigations about the proportion of the servant population. Naming patterns and their change over time could be juxtaposed with and analyzed against the respective patterns of different -older -generations, including the change from more religiously based names to more secular ones.
Last but not least, some data sets have the potential to utilize information not included in the harmonized variables, such as place of birth, language spoken, legal status or property. Although spatial coverage of these listings is far from comprehensive, they may open a window onto studies of migration patterns, multiethnic societies and historical patterns of societal diversity.