A method for statistical analysis of repeated residential movements to link human mobility and HIV acquisition

We propose a method for analyzing repeated residential movements based on graphical loglinear models. This method allows an explicit representation of residential presence and absence patterns from several areas without defining mobility measures. We make use of our method to analyze data from one of the most comprehensive demographic surveillance sites in Africa that is characterized by high adult HIV prevalence, high levels of poverty and unemployment and frequent residential changes. Between 2004 and 2016, residential changes were recorded for 8,857 men over 35,500.01 person-years, and for 12,158 women over 57,945.35 person-years. These individuals were HIV negative at baseline. Over the study duration, there were a total of 806 HIV seroconversions in men, and 2,458 HIV seroconversions in women. Our method indicates that establishing a residence outside the rural study area is a strong predictor of HIV seroconversion in men (OR = 2.003, 95% CI = [1.718,2.332]), but not in women. Residing inside the rural study area in a single or in multiple locations is a less significant risk factor for HIV acquisition in both men and women compared to moving outside the rural study area.


Introduction
This paper is concerned with modeling repeated residential movements of a group of individuals over a certain period of time, and with the assessment of the predictive associations between these multivariate patterns of residential changes and health outcomes of interest such as HIV acquisition. To a good extent, the statistical literature on human mobility has focused on the estimation of migration flows [1][2][3]. Migration flows are represented as origindestination migration flow tables. These are square tables in which the rows and columns correspond with areas of interest. The (i, j) cell contains a count of the number of individuals that left from area A i and moved to area A j over the course of a specified time frame. The inclusion of other categorical variables lead to higher-dimensional migration flow tables. However, migration flow tables cannot capture the movement of those individuals that resided in more than three areas during the time frame of observation. An example individual that left from an area A 1 to move to another area A 2 , then moved again to area A 3 , would contribute with a count of 1 to the (1,2) and (2,3) cells of the resulting migration flow table. But, the link between residential movements that are associated with the same individual will be lost.
Other important classes of statistical models for human mobility are Lèvy flights models [4] and multiplicative latent factor models [5]. Lèvy flights models make use of a power law to represent the probability that an individual changes their residence over a certain distance. Under this model, moving over a shorter distance is more likely than moving over longer distances, but residential movements over longer distances can still take place even if they occur less often. Multiplicative latent factor models improve the Lèvy flights models with their ability to quantify the desirability of residing in certain areas over other areas. Both the Lèvy flights models and the multiplicative latent factor models are based on the crude assumption that human travel can be seen as a Markov process in which the probability of residing in an area depends only on area in which the previous residence was located, and does not depend on the locations of previous residences. However, it is possible that individuals move repeatedly across multiple areas over longer time periods of several years. Markov process models break residential trajectories that involve multiple residential locations into pairs of consecutive locations of residency, and, by doing so, loose key dependencies that are induced by multiple locations of residency of the same individuals in a reference time frame.
Information about residential locations has also been used in statistical models through the construction of mobility measures-see, for example, [6] and the references therein. These measures are summaries of distances between consecutive residencies, or of time spent in certain locations. While mobility measures can be successfully used as independent variables in a wide range of statistical models, the connection between these measures and the areas in which individuals have resided is lost. The method for analyzing repeated residential movements we follow in this paper allows an explicit representation of residential presence and absence patterns from several areas without defining mobility measures. As such, this method offers a new perspective on what can be learned from this important type of human mobility data.
We assume that the residential locations of N individuals belong to K areas denoted by {A 1 , A 2 , . . ., A K }. For each individual, we know which areas they resided in. These data can be represented as a N × K mobility matrix M = (m nk ), where if individual n resided in area k; 0; if individual n did not reside in area k: Our framework does not impose any constraints on the number of individuals N, or on the number of areas K. Other categorical variables of interest can be recorded as additional columns in the mobility matrix M. By counting the number of times the same combination of levels of the categorical variables in M appear as rows of this matrix, a multi-dimensional contingency table is formed [7]. We propose representing the multivariate patterns of associations in this contingency table with graphical loglinear models that are a special type of hierarchical loglinear models [8,9]. These models are determined by graphs that have vertices associated with each area. They characterize the multivariate dependency structure (e.g., independence or conditional independence) among random variables using graphs [9]. The complete subgraphs of these graphs define interaction terms of joint presence and absence patterns from two, three or several areas. A missing edge between two areas means that, conditional on presence or absence in the rest of the areas, the presence or absence of a random individual in the first area is independent of the presence or absence of the same individual in the second area.
A key step in data analysis with graphical loglinear models consists of the estimation of the underlying graph. This is called the structural learning problem [10,11], and it becomes a very difficult computational problem when many random variables are involved [12,13]. Bayesian methods provide a flexible framework for incorporating uncertainty of the graph structure: inference and estimation are based on averages of the posterior distributions of quantities of interest, weighted by the corresponding posterior probabilities of graphs [14]. Here we follow a Bayesian approach for solving the structural learning problem.
The goal of our statistical analysis is to identify graphs that have vertices associated with each area in the corresponding graphical loglinear models. Based on this approach, we examine the predictive value of residential locations as a driver of HIV transmission risk in a comprehensive population-based demographic surveillance site in the KwaZulu-Natal Province, South Africa-the Africa Centre, now Africa Health Research Institute (AHRI) [15]. Specifically, we analyze mobility patterns of 21,015 individuals who were HIV negative at baseline, and were registered in the AHRI demographic surveillance system. Their mobility patterns are defined by residential histories over the study period. The AHRI site is characterized by high adult HIV prevalence (24% in adults aged 15 years 30 and older in 2011), and high levels of poverty and unemployment (in 2010, 67% of adults over the age of 18 in the rural study area were unemployed) [16]. The geographical location of this demographic surveillance area is ideal for our aim.

Background
Historically, human mobility has been one of the key drivers in the spread of HIV at a global scale [17][18][19][20][21][22][23][24][25][26][27][28][29]. Many studies have provided significant evidence linking increased population mobility with multiple sexual partners, reduced condom use, increased risky behavior (e.g., encounters with commercial sex workers, engaging in transactional sex) [30][31][32], increased sexual behavior [20,[33][34][35][36][37][38], and increased likelihood of HIV acquisition [6,28,39]. Mining settlements, transport corridors, or poor urban or periurban communities exacerbate the effect of the risk factors of HIV acquisition [28,40,41]. It has been empirically demonstrated that an individual's risk of acquisition of HIV is strongly driven by community-level HIV prevalence [16], community-level migration intensity [42], mean number of sexual partners in the surrounding local community [43], as well as ART coverage and population viral load in the local community, respectively [16,44]. These community-level risk factors confer substantial additional risk of new HIV infection after controlling for a suite of well-established individual-level risk factors.
In South Africa, which is the focus of this study, the risk of HIV infection has been shown to be increased by human mobility [19,45,46]. South Africa is one of the countries with the highest burden of HIV, and has a long history of internal labor migration of men that periodically leave their areas of permanent residence to seek temporary employment in mines and factories due to the scarcity of local employment [47]. During the apartheid era which imposed travel restrictions for Blacks, women were typically left behind to take care of families, while men submitted remittances back to their households. Because of economic conditions, this way of life continues to exist in poor rural regions of South Africa including this rural study community. However, as opposed to the aparheid era, in the last decade both men and women frequently establish residencies for various periods of time to work or for many other reasons in locations within the KwaZulu-Natal Province (e.g., Richards Bay or Durban), or in other more distant locations in South Africa (e.g., Johannesburg, Pretoria or Cape Town) [6].
The rapid increase in the adult HIV prevalence in South Africa, from 0.7% in 1990 to 13% in 2000 [48,49], is broadly consistent with ongoing patterns of circular labor migration within the country and increased in-migration from neighboring countries after the collapse of Apartheid [50][51][52]. For example, a phylogenetic study from the KwaZulu-Natal province reveals that external introductions in the early 1990s, via human movement from neighboring countries, played a vital role in driving the early HIV epidemic [53]. Historically, patterns of circular migration in South Africa were shaped by the migrant labor policies of the Apartheid system. From the 1950s until the democratic transition in 1994, Apartheid authorities sought to consolidate white rule by developing urban centers and resettling black Africans into rural and undeveloped homelands. Racial segregation and resettlement was seen as a more rational distribution of African labor between white cities, industries, mines, and farms [54]. Men had to migrate from their homeland residencies to their work place for long periods of time, without the possibility of their families joining them [55]. Because of separate spheres of living, migrant men took other partners and formed second families at the places where they worked [56,57], thus increasing the risk of HIV infection and the probability of transmission upon returning home. Apartheid policies had a profound effect on the stability of the family system, a demographic reality that drove the spread of HIV in the 1990s and thereafter.
Efforts to contain the HIV epidemic after 2000 were stalled by the South African government's refusal to make ART available at public health-care facilities nationwide [58,59]. This refusal was motivated by AIDS denialism among government officials, who claimed that HIV was not the cause of AIDS, that ART was toxic, and that the spread of HIV was being over-sensationalized [60,61]. During this time, the adult HIV prevalence increased to 15.2% [49] and was as high as 29.5% among pregnant women attending antenatal clinics [48]. Following public pressure from AIDS activists and civil society organizations, the South African government made ART with a CD4+ T-cell count eligibility criteria of <200 cells/μL in 2004 [62]. In 2010, treatment eligibility was extended to pregnant woman with CD4+ T-cell counts <350 cells/μL and patients with active tuberculosis [62]. By 2012, the HIV prevalence among 15-49 yearolds was at 18.8% [63] and at 20.6% in 2017 [64].

Study setting
The study was conducted in the Africa Health Research Institute (AHRI) Population Intervention Platform Study Area (PIPSA), formerly the Africa Centre Demographic Information System (ACDIS), in uMkhanyakude District, KwaZulu-Natal Province. PIPSA was commissioned in 2000 by the Wellcome Trust as a platform for longitudinal population based studies of epidemiology and intervention research. This rural study area covers 438 km 2 , and comprises approximately 11,000 households with 100,000 individuals. This community is characterized by high HIV prevalence frequent migration, low marital rates, late marriage especially for men, polygamous marriages and multiple sexual partnerships, as well as by poor knowledge and disclosure of HIV status [15,39,56,65]. Incidence peaked at 6.6 per 100 person-years in women aged 24 years, and at 4.1 per 100 person-years in men aged 29 years over the same period [39].
For over 15 years, PIPSA has continuously collected longitudinal surveillance data on a range of health care and social intervention exposures, as well as health, socio-economic and behavioral outcomes [15]. During the household data collection cycle, households are visited every 6 months by fieldworkers and information supplied by a single key informant.
Population-based HIV surveillance and sexual behavior surveys take place annually. Since 2003, annual HIV testing became part of household surveillance.

Study eligibility
Starting with 2007, all adults and adolescents aged 15-17 residing in the rural study area who were able to provide written consent were eligible to participate in the study. From 2003 to 2006, eligibility was restricted to women aged 15-49 years and men aged 15-54. We note that, although individuals under 18 are legal minors, under South African law, they can consent independently to medical treatment from the age of 14. Minors can legally consent independently to an HIV test from the age of 12, when it is in their best interest, and below the age of 12 if they can understand the benefits, risks and social implications of an HIV test [66].

Ethics statement
Informed written consent was obtained from all eligible individuals. After signing the consent, eligible participants are interviewed in private by trained fieldworkers, who also extract blood from consenting individuals by finger-prick for HIV testing, prepare dried blood spots for HIV testing according to the Joint United Nations Programme on HIV/AIDS (UNAIDS) and World Health Organization (WHO) Guidelines for Using HIV Testing Technologies in Surveillance [15]. Ethics approval for data collection and use was obtained from the Biomedical Research and Ethics Committee (BREC) of the University of KwaZulu-Natal (Durban, South Africa), BREC approval number BE290/16. The BREC was aware that some of the study participants were legal minors, and approved the age range of participation and the specific consent procedure for minors.

Cohort description
From the entire population under surveillance in PIPSA between January 1, 2004 and December 31, 2016, we selected those individuals who consented to test at least twice for HIV after the age of 15, and whose first test was negative. Although the annual participation rates in HIV testing are not high (see Table 1), a number of 8,857 men and 12,158 women satisfy these inclusion criteria. Participants seldom test every year, and, in this cohort, the median time between the last HIV negative and the first HIV positive tests in men was 3.34 years (IQR = 4.64), and in women 2.58 years (IQR = 3.69). The date of HIV seroconversion was assumed to occur according to a uniformly random distribution between the date of the last negative and first positive HIV test [67]. Here seroconversion refers to the transition from infection with the HIV virus to the detectable presence of HIV antibodies in the blood. Fig A from S1 Supporting Information gives the crude annual consent rates, while Fig B from S1 Supporting Information shows the consent rates by age group and gender. Although the overall consent rate changes over time, there does not seem to be any relevant difference in consent by sex and age.
PIPSA collects data about all the individuals that are members of a family unit or a household in the rural study area irrespective of the current residency status. It collects longitudinal residential information about the exact periods of time each study participant spent living in each location. Fieldworkers record changes in residency as the origin place of residence, the destination place of residence and the date of the move. Residencies can be located inside or outside the rural study area. The residential locations inside the rural study area have been comprehensively geolocated to an accuracy of <2m [68]. Repeat-testers can change their place of residence multiple times: they can move between two residencies located inside the rural study area, between two residencies located outside the rural study area, or between a residency inside the rural study area and another residency outside the rural study area. The relevance of looking whether repeat-testers have resided outside the rural study area comes from the findings of Dobra et al. [6]. Their results indicate that, for the same rural study area, the risk of HIV acquisition is significantly increased for both men and women when they spend more time outside the rural study area, or when they change their residencies over longer distances.
For the purpose of this study, the geolocations of the homesteads have been mapped into 45 non-overlapping communities that cover the rural study area-see Figs E and F in S1 Supporting Information. The division of the rural study area into communities is motivated by the results of Tanser et al. [69]. Their study identified a significant geographical variation in HIV incidence in the same rural study area. Specifically, they identified three large irregularlyshaped clusters of new HIV infections. Although these clusters cover only 6.8% of the rural study area, about 25% of the sero-conversions that occurred over this study's period are associated with residencies in them. This suggests the existence of clear corridors of HIV transmission inside the rural study area. Together, the results of Dobra et al [6] and Tanser et al [69] indicate that men and women who reside outside the rural study area, or occupy residencies that are located in the corridors of HIV transmission inside the rural study area are at an increased risk of acquiring HIV.
We note that the exposure period for a repeat-tester starts at the time of their first HIV test, and ends at their HIV seroconversion date for seroconverters, or at the time of their last HIV negative test for those that did not seroconvert. The residential locations occupied before seroconversion coud have contributed to changes in sexual behavior that led to HIV acquisition, while residential locations occupied after seroconversion could be associated with repeat-testers seeking family support, health care or moving away to avoid social stigma [22,38]. For this reason, the residential locations occupied by seroconverters after they acquired HIV were discarded.

Statistical analyses
We determined in which of the 45 communities each of the 8,857 men and 12,158 women lived in during the study period. This information was recorded as binary variables C1, C2, . . ., C45 with levels "yes" or "no" in two mobility matrices, one for men and one for women. We also determined whether a repeat-tester moved outside the rural study area. This information was recorded as a binary variable Outside with levels "yes" or "no". Furthermore, we determined whether a repeat-tester has seroconverted, and whether a repeat-tester was younger than 30 years at start of their observation period. This information was recorded as two additional binary variables Seroconverted and Young with levels "yes" or "no". For example, a repeat-tester that lived in communities C1 and C2, moved outside the rural study area, was older than 30 years at baseline, and has seroconverted, would have C1 = C2 = Outside = Seroconverted = yes and C3 = . . . = C45 = Young = no. The data in the resulting mobility matrices involve 48 binary variables. The mobility matrix for men is available in S1 Data, and the mobility matrix for women is available in S2 Data. They define two dichotomous contingency tables with 2 48 cells, one table for men and another table for women. These tables which we call mobility tables are hyper-sparse: most of their counts are zero. The mobility table for men has only 598 positive counts-see Table E in S1 Supporting Information. Among these counts, there are 292 (48.83%) counts of 1, 48 (8.03%) counts of 2, 30 (5.02%) counts of 3, 28 (4.68%) counts of 4, and 13 (2.17%) counts of 5. The top five largest counts are 192, 186, 180, 177 and 168, respectively. They correspond with men that were less than 30 years old at the start of their observation period, did not seroconvert by the end of their observation period, never moved outside the rural study area, and lived in exactly one of these communities: C7, C37, C40, C39 and C22. The mobility table for women has only 939 positive counts-see Table F

Statistical modeling framework
In this paper we make use of a Bayesian framework for solving the structural learning problem that is suitable for the analysis of hyper-sparse contingency tables with p = 48 variables. This framework [70] determines graphical loglinear models that are a special type of hierarchical loglinear models [8,9]. A graphical model for a random vector X = (X 1 , X 2 , . . ., X p ) is specified by an undirected graph G = (V, E) where V = {1, . . ., p} are vertices or nodes, and E � V × V are edges or links [9]. A vertex i 2 V of G corresponds with variable X i . The absence of an edge between vertices i and j in G means that X i and X j are conditional independent given the remaining variables X V\{i,j} . The graph G also has a predictive interpretation. Denote by nbd G (i) = {j 2 V: (i, j) 2 E} the neighbors of vertex i in G. Then X i is conditionally independent of X Vnðnbd G ðiÞ[figÞ given X nbd G ðiÞ which implies that, given G, a mean squared optimal prediction of X i can be made from the neighboring variables X nbd G ðiÞ . The structural learning problem estimates the structure of G (i.e., which edges are present or absent in E) from the available data x = (x (1) , . . ., x (n) ) by sampling from the posterior distribution of G conditional on the data x, i.e.
where Pr(G) is a prior distribution on the graph space G p with p variables, and Pr(x | G) is the marginal likelihood of the data conditional on G [10]. We use a prior on the space of graphs G p that encourages sparsity by penalizing for the inclusion of additional edges in the graph G = (V, E) [10]: where β 2 (0, 1) is set to a small value, e.g. b ¼ 1=ð 48 2 Þ � 0:00089. Under this prior, the expected number of edges for a graph is 1. This means that sparser graphs with few edges receive larger prior probabilities compared with denser graphs in which most edges are present.
Determining the graphs with the highest posterior probabilities (1) is a complex problem since the number of possible undirected graphs 2 ð p 2 Þ becomes large very fast as p increases. For example, our two mobility tables involve p = 48 variables, and the number of possible undirected graphs in G 48 is approximately 10 325 . This motivated the development of computationally efficient search algorithms for exploring large spaces of graphs that have the ability to move quickly towards high posterior probability regions by taking advantage of local computation. Among them, the birth-death Markov chain Monte Carlo (BDMCMC) algorithm [70] determines graphical loglinear models. BDMCMC is a trans-dimensional MCMC algorithm that is based on a continuous time birth-death Markov process [71]. Its underlying sampling scheme traverses G p by adding and removing edges corresponding to the birth and death events. This algorithm is implemented in the package BDgraph [72,73] for R [74]. By employing the BDgraph package, we ran the BDMCMC algorithm for 250,000 iterations to sample graphs from the posterior distribution (1) on G 48 for the mobility tables for men and women. Figs C and D in S1 Supporting Information give the estimated posterior inclusion probabilities of the ð 48 2 Þ ¼ 1128 edges across iterations. We see that, after about 50,000 iterations, the subsequent posterior edge inclusion estimates stabilize. For this reason, the first 50,000 sampled graphs were discarded as burn-in, and the remaining 200,000 sampled graphs were used to estimate posterior edge inclusion probabilities.

Limitations
Representing residential locations data as mobility matrices leads to information loss, as follows: (i) the order in which an individual resides in two or more areas is no longer accounted for; (ii) residential movements that occur within the same area are missed; (iii) the amount of time an individual maintains a residence in the same area is overlooked; and (iv) the number of times an individual establishes a residence in the same area is lost. Although this loss of information can be seen as significant, the major advantage of our proposed methodology for analyzing repeated residential movements is its ability to capture repeated presence and absence patterns from several areas. For this purpose, mobility matrices suffice.
Another limitation is related to the graphs identified by structural learning in graphical loglinear models. The prior on the graph space (2) gives the same probability of existence of an edge between any two areas irrespective of the actual spatial distance between them. In this application, the use of this prior is justified: there is no reason to assume that more distant areas are less likely to be connected than areas that are closer to each other. In fact, as we will see in the Results section, the repeat-testers were more likely to make residential movements between more distant locations (e.g., a location inside the rural study area and another location outside the rural study area) than between less distant locations (e.g., two locations inside the rural study area). As such, while specifying a prior on the graph space that takes actual physical distances between areas into account is mathematically possible [75], the use of this type of spatial prior in this study was not necessary.
A third limitation of our study is related to mapping the locations inside the rural study area into 45 communities (spatial units), and of all the locations outside the rural study area into an additional spatial unit. These specific choices could induce biases related to the modifiable areal unit problem (MAUP) [76,77]. MAUP identifies the inevitable statistical bias that occurs due to scale (i.e., different sized spatial units) and zoning (i.e., different definitions of boundaries used to define spatial units). Due to MAUP, altering the choices of spatial units employed in a statistical analysis could potentially affect the results reported in a significant manner. However, in our application, the spatial units employed were not arbitrary: the 45 communities have not been defined for the purpose of this study alone. Instead, these communities were employed in several studies conducted in AHRI/PIPSA-see, for example, [69]. These communities have specific social, economic and demographic relevance for the rural study area. For this reason, reporting results based on spatial units constructed with respect to these 45 communities is meaningful.

Descriptive summaries
We recorded residency changes for 8,857 men over 35,500.01 person-years, and for 12,158 women over 57,945.35 person-years. The median observation period for men was 3.72 years (IQR = 4.00), while the median observation period for women was 4.41 years (IQR = 5.47). Tables 2, 3, 4 and 5 give cumulative durations of exposure of the repeat-testers stratified by age, calendar year, marital status and education level. The calculation of person-years is based on a random imputation of the seroconversion date between the date of the last negative and first positive test for HIV sero-converters [67], and on the date of the last negative test for those who are censored. We see that longer exposure periods are recorded for younger study participants between 15 and 24 years old. The length of exposure over calendar years remains relatively unchanged between 2005 and 2011, but has a slight tendency to decrease  Table 6 gives seroconversion rates stratified by gender, age (younger or older than 30 years at baseline), and residency outside the rural study area. The largest seroconversion rate 22.47% (95% CI: 21.49-23.45) is for young women who resided in the rural study area for their entire exposure period. The seroconversion rate for young women who resided outside the rural study area is slightly lower: 19.20% (95% CI: 17.88-20.52). The largest seroconversion rate for men is 13.24% (95%CI: 11. 76-14.72), and corresponds to the young group that moved outside the rural study area. The seroconversion rate for young men who did not move outside the rural study area is significantly lower: 7.56% (95% CI: 6.88-8. 24). Table 6 also shows that the seroconversion rates for both men and women in the older age group are higher for the repeat-testers that moved outside the rural study area as compared to the repeat-testers that did not move outside the rural study area. We determined the number of repeat-testers that moved their residence between any two communities, or between a community and a location outside the rural study area. The resulting mobility flow diagrams are shown in Figs 1 and 2. We see that, while men and women move between the 45 communities, substantially larger flows are associated with changes of residencies to and from locations outside the rural study area. Table 7 gives a summary of the frequency of residential movements inside the rural study area, and also between a location outside the rural study area and another location inside or outside the rural study area by age group and gender. Women in the 20-24 age group move outside the rural study area more often than men in the same age group (26.56% vs. 23.31%). Residential movements outside the rural study area become less frequent for women in the 25-29 age group, but are comparable in frequency with residential movements of men in the 25-29 age group. Men in the 30-34 age group move to and from locations outside the rural study area more frequently than women in the 30-34 age group. Residential movements outside the rural study area of women become significantly less frequent in the age groups 35-39, 40-44 and older than 45 as compared to residential movements of men in the same age group. Residential movements inside the rural study area of both men and women are substantially less frequent than residential movements to and from a location outside the rural study area in any age group. However, inside the rural study area, women tend to be more mobile than men in the younger age groups.
We remark that residential movements inside the rural study area occur over much smaller distances (mean = 10.44 km, IQR = 9.14 km) compared to residential movements that involve locations outside the rural study area (mean = 128.50 km, IQR = 178.33 km).  The repeat-testers are cross-classified by whether they moved outside the rural study area (Outside: Yes/No) and whether they were less than 30 years old at the start of the study (Young: Yes/No).

Graphical loglinear models for mobility tables
https://doi.org/10.1371/journal.pone.0217284.t006 includes the edges with estimated posterior inclusion probabilities greater than 0.5 as our estimate of the conditional independence graph. The median graph for men's mobility table has 995 edges, while the median graph for women's mobility table has 1,022 edges. We refer to these two graphs as men's and women's mobility graphs.
The overall structure of the two mobility graphs is remarkably similar. In the men's mobility graph, the vertex associated with the variable Outside is connected with the vertices associated with 33 out of the 45 communities-see the map from Fig E in S1 Supporting Information. The subgraph that involves vertices associated with the 45 communities is dense: it has 1,922 edges-97.07% of the 990 possible edges. In the women's mobilty graph, the vertex Outside is connected with vertices associated with 39 out of 45 communities-see the map from Fig F in S1 Supporting Information. The subgraph associated with the 45 communities is also dense: it has 1,962 edges-99.09% of the 990 possible edges. In both graphs, there is no edge between the vertices associated with variables Seroconverted and Young, and the community vertices. This implies that, conditional on the variable Outside, the variables Seroconverted and Young are independent of the community variables C1, . . ., C45 for both men and women. The most relevant differences between the two mobility graphs are related to the edges that link the variables Outside, Seroconverted and Young-see Figs 4 and 5. For men, vertex Outside is connected with vertex Seroconverted, but the edges between vertices Outside and Young, and between vertices Seroconverted and Young are missing. For women, the situation is reversed: the edges between vertices Outside and Young, and between vertices Seroconverted and Young are present, but the edge between vertices Outside and Seroconverted is missing. This has the following implications: (a) for men, variable Young is independent of variables Outside and Seroconverted; (b) for men, only variable Outside is predictive of variable Seroconverted; (c) for women, variable Young is predictive of variable Seroconverted; and (d) for women, given variable Young, variable Seroconverted is independent of variable Outside.
The presence of an edge between Outside and Seroconverted in the subgraph for men means that whether a man moved outside the rural study area is predictive of whether he seroconverts (unadjusted OR = 2.003, 95% CI = [1.718,2.332]). The absence of an edge between Young and Seroconverted in the same subgraph means that age has less predictive power for the HIV seroconversion of a man given that we know whether this man had a residence outside the rural study area. We point out that this does not imply that age is not a risk factor for HIV acquisition in men. For women, the relative predictive importance of moving outside the rural study area and age is reversed: the edge between Outside and Seroconverted is missing, while the edge between Young and Seroconverted is present. Whether a woman is younger than 30 years is predictive of whether she seroconverts (unadjusted OR = 3.091, 95% CI = [2.693,3.561]). However, given that we know the age of a woman, knowing whether she moved outside the rural study area has less predictive power for HIV seroconversion. As such, residential locations seems to matter less for women as a risk factor for HIV acquisition in the presence of age. As an aside, we mention that the presence of an edge that links vertices Outside and Young in the women's mobility graph makes sense: women younger than 30 years are more likely to move outside the rural study area (unadjusted OR = 3.176, 95% CI = [2.787,3.633]). This edge is missing in men's mobility graph because the relationship between variables Young and Outside is weaker (unadjusted OR = 1.306, 95% CI = [1.124,1.523]).
Since the structure of interactions among variables Outside, Seroconverted and Young is essential for our understanding of the mobility tables, we performed a second statistical analysis of the three-way tables cross-classifying these variables-see Tables A and B in Percentages of repeat-testers stratified by gender who changed residences between a location outside the rural study area and another location inside or outside the rural study area (outside residency changes, upper panel), or between two locations inside the rural study area (inside residency changes, lower panel) https://doi.org/10.1371/journal.pone.0217284.t007 S1 Supporting Information. This time we followed a classical approach to hierarchical loglinear model determination [7,78] that also solves the structural learning problem, but is conceptually different from the Bayesian approach implemented in the BDMCMC algorithm. We note that this classical approach is suitable for analyzing these two tables because they involve only three variables and they do not contain any counts of 0. However, this approach is not feasible for analyzing the 48-dimensional mobility tables for men and women due to sparsity and the number of variables involved. Specifically, we fitted the eight hierarchical loglinear models that contain main effects for variables Outside, Seroconverted and Young, and also one, two or all three of the pairwise interactions between these variables. The results are presented in Tables C and D in S1 Supporting Information.
For men, the loglinear model that contains interactions between variables Seroconverted and Outside, and between variables Outside and Young, and the loglinear model that contains all three pairwise interactions do not fit the data well: the p-values for the likelihood ratio test against the saturated loglinear model are 0.348 and 0.215, respectively. The other six hierarchical models fit the data well at the significance level α = 0.05. To select the most relevant model among the remaining six models, we calculated their AIC and BIC. The smallest values for both AIC and BIC are realized for the model that contains the interaction between Seroconverted and Outside, and no interaction involving variable Young. This is precisely the graphical loglinear model we determined before using the BDMCMC algorithm-see Fig 4. For women, the loglinear model that contains all three pairwise interactions does not fit the data well (pvalue = 0.264). The other seven hierarchical models fit the data well at the significance level α = 0.05. Among these seven models, the model that has the minimum value for both AIC and BIC contains interactions between variables Outside and Young, and between variables Seroconverted and Young. As for men, we found the same graphical loglinear model as we did before using the BDMCMC algorithm for the women's mobility table-see Fig 5.

Discussion
We proposed a framework for statistical analysis of repeated residential movements. In the first step, residential histories are converted into a mobility matrix that gives the presence and absence patterns from the areas in which study participants have lived. After the inclusion of additional categorical variables of interest, the resulting matrix is converted into a multidimensional contingency table called mobility table. The multivariate associations in this table are modeled with graphical loglinear models. The structure of the graphs that characterize these models induces independence or conditional independence relationships among the residential areas and the other categorical variables. This framework is able to explicitly account for individuals that moved across several areas. Existing models for human mobility are able to represent only the movement of an individual from one area to another area without consideration of the areas in which the individual has resided in the past. Our framework also goes beyond those approaches that involve the determination of mobility measures of different kinds; such measures loose an explicit connection with the areas in which residencies were located.
We used this framework to link human mobility and the risk of HIV acquisition based on data from a population-based cohort in a hyper-endemic, rural sub-Saharan African context. The residential locations occupied by every study participant were classified as outside or inside the rural study area. The residential locations inside the rural study area was further classified as belonging to one of 45 non-overlapping communities that fully cover the rural study area. We also included age (younger or older than 30 years at the start of the exposure period) as an additional risk factor for HIV acquisition.
We found that, for both men and women, the majority of residential moves involved a destination outside the rural study area, rather than a destination within the rural study area. Thus, households in the rural study area are typical net-senders of mobile individuals to destinations in the KwaZulu-Natal province, or to other, more distant places throughout South Africa [6]. This circular migration stream effectively links a poor, rural community with more affluent urban centers where many employment opportunities are usually available, and also with other rural areas that offer more specialized types of employment (e.g., mining).
Multivariate predictive relationships are revealed in the mobility graphs for men and women we identified. In both graphs, in order to reach any of the communities vertices C1, C2, . . ., C45 from the vertex Seroconverted by following paths of adjacent edges, we must first pass through the vertex Outside. Therefore, once we know whether a man or a woman moved outside the rural study area, knowing which communities inside the rural study area they lived in becomes less relevant for the purpose of predicting whether they seroconverted. For this reason the communities in which an individual resides seem to play a lesser role as risk factors for HIV seroconversion as compared with having a residence outside the rural study area. This finding is surprising because this rural study area has three large irregularly-shaped clusters of new HIV infections near a national road and in a rural node bordering a recent coal mine development [69]. These spatial areas are characterized by HIV incidence rates higher the other surrounding regions. We expected at least some of the communities spanned by these three clusters to be linked by an edge with vertex Seroconverted. However, none of these edges are present in the two mobility graphs. Consequently, while the places of residency inside the rural study area certainly play a role in predicting HIV acquisition risk given the significant clustering of HIV infections in this rural community, their predictive power vanishes when taking into account whether a study participant moved outside the rural study area. While this is true for both men and women, the predictive importance of having a residence outside the rural study area differs for men as compared to women. These differences are evidenced in the subgraphs of the two mobility graphs associated with variables Outside, Young and Seroconverted-see Figs 4 and 5.
Our results indicate that, even if the frequency, duration and distance traveled associated with residential moves is similar for men and women who live in this rural study area [6], there must exist key differences between the behavioral processes that lead to HIV seroconversion of mobile men and women. In order to formulate gender-specific combination HIV prevention strategies for high-risk mobile individuals, particularly in the light of attaining the UNAIDS 90-90-90 treatment targets [79], it is of paramount importance to understand these differences with respect to the complex network of structural, biological and socio-demographic factors that characterize places of residency outside the rural study area, and significantly alter the social context of mobile individuals [42]. Tanser.