Modeling of occupant behavior considering spatial variation: geostatistical analysis and application based on American time use survey data

Numerous occupant behavior (OB) models that simulate occupancy, activity and action at home have been developed to improve the accuracy and quality of energy demand estimations. Previous studies have revealed that the consideration of inter-occupant diversity improves the performance of OB models. However, existing models ignore spatial variation in OBs or partially consider it using a simple method without evaluating whether it is sufﬁcient. Moreover, the modeling method to reproduce the spatial variation is missing. This study aims to develop a modeling method that can effectively reproduce spatial variation in OBs using American time use data. The global Moran’s index test conﬁrmed that spatial variations exist in OBs; however, they differ by time of day and activities for studied population. Subsequently, two spatial variation representations were generated using the ordinary kriging and spatial autoregressive methods. Finally, three spatial logistic regression models that consider spatial variations were developed and evaluated. The developed models generated smaller errors and higher inter-occupant diversity than the conventional logistic regression models at the state level. The established method is applicable to any country and region. Using higher spatial resolution and richer time use data-sets may further improve OB models to model region-speciﬁc characteristics of building energy demand. (cid:1) 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license


Introduction
Over the past few decades, numerous occupant behavior (OB) models have been established to simulate the occupancy, activity, and behavior of building residents, to improve the prediction accuracy of building energy demands.OB is a major source of uncertainty in building energy demand modeling because energyconsuming appliances are generally operated to meet people's daily needs in response to activities performed by occupants, and building energy systems and indoor environments are adjusted by occupants for comfort [1].Various methods have been applied to time use data integrated with additional survey data that cover social, economic, and building aspects, to develop representative OB models [2].However, a significant gap exists between simulation and reality [3] owing to (1) the use of oversimplified assumptions, such as a fixed schedule rather than a dynamic schedule; (2) assumptions on when and how residents use appliances and building systems; and (3) ignorance of inter-occupant diversity [4].Although some studies have attempted to address the first two gaps, inter-occupant diversity, particularly in terms of spatial variation, has not been thoroughly investigated [5,6].Druckman and Jackson [7] demonstrated that household energy use and the associated carbon emissions vary significantly with household socioeconomic conditions and locations.Rural/urban environments are another important factor in devising policies for a low-carbon society.Vega et al. [8] pointed out that although the spatial perspective has received limited attention in the literature, it is a significant factor in energy-related policy considerations.They observed that spatial factors are important, and ignoring them can lead to inaccurate conclusions.Furthermore, spatial variation also exists in time use.Several studies showed differences in the time use of occupants among countries, which revealed spatial variation existed in the time spent on OBs [9][10][11].Esteban et al. [12]

⇑ Corresponding authors.
E-mail addresses: Lym19940224@yahoo.co.jp (Y.Li), yohei@see.eng.osaka-u.ac.jp (Y.Yamaguchi).OBs conducted by people are spatially varied in European countries, which cannot be effectively explained by economic or demographic differences.Such spatial variation in OBs may further occur within a country or even within a region.Studying how people spend their time over space provides an important perspective for understanding living conditions, economic opportunities, and general well-being.However, a consistent approach to empirically represent spatial variation in OB and to consider it in OB modeling is currently lacking, but useful spatial analysis and modeling methods have been developed in other fields.
This study proposes and evaluates various methods learned from other fields for modeling OB considering spatial variation.The research gaps were addressed through three research questions: (1) When does spatial variation exist in OB? (2) How can spatial variation in OB be represented quantitatively?(3) How can spatial methods reproduce spatial variations in OB?We selected a spatial logistic regression model as the spatial method in this study as it is an extension of one of the most frequently used OB models.The remainder of this paper presents the methodology, results, and discussion, followed by our conclusions.

Review of methods for considering spatial variation in OB and energy modeling
Spatial variation essentially refers to the rules or tendencies of objects of the research exhibited in a given space.Spatial variation can be represented and considered in the modeling in different ways.There is a significant development in OB modeling that addresses space use.These space use studies considered spatial choice or individual preference based on geo-referenced data to determine space use [13,14].Tabak [15] developed a model called the User Simulation of Space Utilization that simulates space utilization in an office building by calculating the distances between the locations of different activities based on measured data.In addition to spatial utilization, the mobility and occupancy patterns of people can also be estimated based on dynamic spatial choices or preferences [16][17][18][19][20][21].However, the variation of OBs over space has not been considered in these studies.
Some studies have used spatial factors as independent variables to consider spatial variation during the modeling process to enhance the inter-occupant diversity of the model [22].Vega et al. [8] assessed various factors, including seven spatial factors (e.g., urban-rural gradient, city center, and village center), to develop a suitable policy for increasing the uptake of carbon emission reduction measures, and highlighted the importance of using spatial factors for designing energy policy frameworks.Marín-Restrepo et al. [23] identified OB patterns in office environments through data analysis and the Chi-squared test based on spatial (e.g., spatial layout and occupant orientation relative to control elements) and human factors.Wilke et al. [24] considered an independent variable that indicated whether an occupant lives in an urban/suburban area to simulate the starting probability of activities through a multinomial logit model.Okada et al. [25] applied the same method by considering city size as an independent variable to simulate the probability of undertaking activities.Rafiee et al. [26] revealed through regression analyses that spatial context (e.g.building density and urban form) is a significant determinant of household heat consumption.Abbasabadi et al. [27] presented an urban energy use model that captures both urban building operational energy and transportation energy consumption by localizing the energy performance data and considering various urban socioeconomic factors and spatial contexts (e.g., urban density and accessibility).Therefore, less focus has been paid to spatial variation in the OB modeling.Spatial variation has been insufficiently represented based on the actual data in previous studies.Although some studies used spatial factors, there is a lack of modeling methods to better reproduce spatial variation in OB.

Review of methods for spatial analysis and modeling
Disciplines associated with the fields of epidemiology, environmental meteorology, and econometrics have applied sound spatial analysis methods to solve subject-specific problems [28][29][30][31][32][33][34][35].This section summarized such methods used to either empirically represent the spatial variation or simulate the research object with the consideration of the spatial variation.Fig. 1 shows the summary of the methods.
Based on the mechanism and data input, the methods used in these studies can be classified as spatial interpolation and regression-based methods.Spatial interpolation methods simulate the spatial autocorrelation of surrounding observations to represent the spatial trend of the objects or to generate spatial predictions for unmeasured areas.Berke [36] applied the trend surface analysis and universal kriging to simulate acid-precipitation in Lower Saxony.Berke [37] also developed the modified median polish kriging method to generate more robust spatial predictions for Wolfcamp-Aquifer.Varouchakis [38] applied median polish kriging and sequential Gaussian simulation to explore the spatial distribution of source rock data in terms of total organic carbon weight concentration.In regression-based methods, they incorporate additional factors, such as sociodemographic variables, into the modeling process.Chasco et al. [35] analyzed the spatially varying impacts of some conventional factors, such as unemployment rate and average housing price, on the per capita household income in Spanish provinces based on geographically weighted regression.Xie et al. [39] employed spatial logistic regression to obtain the development patterns in regions and to assess the prognostic capacity of the model based on several factors such as population density and availability of usable sites.Paciorek [40] compared several models for fitting spatial logistic regression models and suggested that the spectral basis model is the best to provide a good compromise between the quality of fit and computational speed for the estimation of the spatial surface.
These spatial analysis methods may be useful in OB modeling, however have been sparsely considered and applied in the energy field.

Data
The multiyear American Time Use and Leisure Activity Survey (ATUS0319) collected the activity diaries and sociodemographic conditions of the survey participants.These activity diaries were recorded for a 24-hour period beginning at 4:00 am on the survey day.The data collected between 2009 and 2019 were used to ensure the consistent coding of the variables.Although there were 124,941 participants in total, we used those of women aged 30-59 living in mainland U.S. to homogenize the sample and better observe the effects of spatial variation in time use.This subpopulation was selected because women generally conduct various activities involving both paid and unpaid work [41][42][43][44].In particular, these unpaid activities (e.g., housework) may affect the operating conditions of many home appliances and building systems, thereby affecting residential energy demand.We checked statistically that this subpopulation features the highest level of unpaid work in the ATUS.In addition, this is supported by other empirical research on time use showing that women trigger residential peak electricity demand due to caring activities and unpaid work [45].We also selected four typical activities-sleeping, cooking and washing up, watching television, and commuting-to evaluate the proposed method.Sleeping is a basic in-home activity while commuting to work or school is one of the main out-of-home activities.These two activities had little influence on the use of home appliances.Watching television, cooking, and washing up are among the main indoor activities for women [46], and generally involve using appliances.Although the results are not sufficiently comprehensive to evaluate the applicability and usefulness of the proposed method to model the entire population and all activities, this design is sufficient to address the research questions described in the Introduction.
As a result of the selection of the subpopulation, the sample size was reduced to 36,438.However, this sample size is relatively large compared to many previous studies because we used the elevenyear data, whereas single-year time use data have often been used in previous studies [47].Notably, the ATUS sample is distributed approximately proportionally to each state's population, with the number of samples varying considerably from state to state, ranging from 55 to 3652.In addition, 70 % of each state's data were randomly selected as the training dataset and the remaining were used as the test dataset.The split between percentages for training dataset and test dataset is in line with modelling practices associated with models requiring data training.Also, many previous studies used this split [22,48,49].
We considered the states as the unit for modeling as it was the only available data with respect to space for the entire nation.The location of each occupant was defined by the internal point of the state in which the occupant lived.Therefore, only one location point was used to represent the entire state to smooth the spatial variations for the entire U.S. mainland.The cartographic boundary shapefile of the U.S. of 2018 was used to visualize the spatial distribution of the probability on the map.
The spatial distribution of the activity probability at each time interval is referred to as the spatial probability in this study.Note that the 1 min resolution data in the time use diary were converted to 1 hourly binary data by assigning 1 when an activity was conducted within a 1-hour interval distinguished by clock times and 0 otherwise.Based on this principle, we quantified the probability of activity frequency within an hour.In this paper, we refer to this probability as the ''activity probability".

Method
When simulating OB, numerous stochastic models use several modeling parameters (e.g., probability of undertaking an activity, probability of starting an activity and corresponding duration) [22].These modeling parameters were prepared during the presimulation process.Li et al. [22] revealed that many previous studies conducted segmentation of sample time use data and applied the logistic regession method to model the modeling parameters to better enhance the inter-occupant diversity originating from demographic and other influencing factors.Our developed modeling method followed this approach but involved a smooth function that representative of the spatial variation in the modelling parameters.The whole methodology of this study is shown in Fig. 2. Steps 1-3 address the three research questions discussed in Section 1. Section 3.2.1 describes the segmentation of the time use data.
The following Sections 3.2.2 to 3.2.4give a more detailed introduction to each step.

Segmentation
In the presimulation process, six groups were designed to represent different subpopulations of women.Each group was homogenized to avoid the influence of sociodemographic factors in the spatial variation as shown in Table 1.The conditions for segmentation were the type of day (i.e., weekdays and weekends) and employment status-commonly used parameters in previous studies [19,20,23,24,50].Groups 1 and 4 represent women with fulltime jobs; Groups 2 and 5 represent women with part-time jobs; Groups 3 and 6 represent unemployed women.Groups 1-3 and Groups 4-6 comprise activities performed during weekdays and weekends, respectively.
Table 1 presents the total sample size for each group.The national-level sample size for each case satisfies the Whitemore formula [51] for most of the time intervals.However, some states did not have a large sample sizes, as shown in Fig. 3. Some states, such as Delaware, District of Columbia, and Wyoming (numbers 10, 11, and 56) had small sample sizes.
One approach to avoid a decrease in sample size is to use the group conditions as variables.To evaluate this approach, Group 7 was considered to represent the entire population of women aged 30-59 years, including Groups 1-6, using dummy variables representing each group in the developed spatial logistic regression model.The comparison enables the determination of a superior approach for OB modeling, using the segmentation [52] or using grouping conditions as variables.This analysis was conducted by considering the watching television activity.

Step 1: Existence of spatial variation
We employed the global Moran's index (Moran's I) test to confirm the time intervals during which the selected four activities exhibited spatial variation.The Moran's I test is used to verify the significance of the random distribution of qualitative determination in the areas of a map [53].The Z score was calculated to evaluate the significance of Moran's I.If the Z score is not statistically significant (p > 0.05), it is probable that the objectives are randomly distributed in space; if the Z score is positive and significant, the objectives display a clustered distribution (similar tendency); if  the Z score is negative and significant, the objectives display a dispersed distribution (competitive tendency).The subsequent steps only considered the time intervals during which spatial variation existed.

3.2.3.
Step 2: Methods to represent spatial variation Two representations of spatial variation that quantify the average probability of an activity in each state s i at each time interval were designed using the ordinary kriging and spatial autoregressive (SAR) methods.However, as they measure the probability of an activity, their values were restricted within 0 and 1.Furthermore, the ordinary kriging and SAR methods can generate representations at higher resolutions if detailed location data are available.
A) Ordinary kriging method.
The ordinary kriging method uses the observations of the surroundings to predict the values of unmeasured locations [54].Considering a certain time interval during which spatial variation exists, the prediction G s 0 for the location s 0 u 0 ; v 0 ð Þis given by: where G s j u j ;v j ð Þ is the average probability of an activity in the state s j represented by the internal points u j ; v j À Á ; and k j is the unknown weight subjected to P N j¼1 k j ¼ 1, for obtaining an unbiased estimation of G s 0 .We considered the commonly used theoretical semivariogram-spherical model to estimate k.

B) Spatial autoregressive method.
The SAR method is used to examine the impact of the probability of an activity in one state on the neighboring states by including other factors in the modeling process.It is generated based on the cross-sectional spatial model defined by Equation (2): Þ is the average probability of an activity in the state s o ; x represents the variables; W is the weighting matrix constructed in the form of adjacent edges or points corresponding to each state; and k is a scalar autoregressive parameter.The variable Wy s 0 is the spatial lag of y s 0 .

Step 3: Spatial logistic regression
In this study, we developed three spatial logistic regression models through Equation (3):  the southern region being the reference group; c is the corresponding coefficient for each regional dummy variable.In Models 2 and 3, the estimations of G $ s i;t was extracted from the ordinary kriging and SAR results in Step 2 to represent the spatial variation.
Stepwise analysis was applied to all the models to statistically test the significance of the considered variables, including the spatial factors and representations.

Performance assessment
The performance of the models was assessed in terms of the reproducibility of the spatial variations in OB and the comprehensive performance.The ordinary kriging method was applied to visualize the spatial probability, thereby assessing the reproducibility of the spatial variation.The comprehensive performance was evaluated by indicators to assess the error and inter-occupant diversity considering the training and test datasets.

Error indicators
Total absolute error (TAE) and root mean squared error (RMSE) were considered to measure the error between the estimations obtained from the models and the observations.These two indicators were quantified at national and state levels.Previous studies only considered the national level, which measures the error for each time interval.At the state level, the errors were quantified based on the combinations of the time interval and state, thereby introducing an error because of spatial variations.

Inter-occupant diversity indicators
Inter-occupant diversity indicators assess the ability of the model to represent the total variations of OB among the simulated occupants.The indicator RMSE_GA [22] based on the Hosmer-Lemeshow test [55], which measures the root mean squared error between the averaged estimated probability and averaged observed probability of different subdivisions, is provided by Equation (4): where d denotes the subdivision (D ¼ 10).RMSE_GA was only quantified at the national level owing to data limitations.To compare the inter-occupant diversity at the state level, another indicator-the mean standard deviation (MSD), was used to measure the deviation of each estimation from the mean at the national and state levels.

Confirmation of the existence of spatial variation
Fig. 4 shows the representative probabilities of the women in Group 4 sleeping, those in Group 3 cooking and washing up, those in Group 6 watching television, and those in Group 1 commuting.As shown in Fig. 4, the probability of activities exhibited certain variation among the states at different times of the day.Such variation results are obtained from the combination of the differences in sociodemographic variables, the spatial variation, and sampling error [56], according to Equation (3).The effect of the first element was reduced by segmentation.
Spatial variations for each time interval for all combinations of groups and activities were confirmed using Moran's I test.Fig. 5 summarizes the results of the Moran's I tests.The results showed that spatial variation existed only during limited time intervals and varied with the type of day (weekdays or weekends), subpopulations with different employment statuses, and activities.For example, spatial variation in sleep existed at different time inter-vals for different groups because these groups may have different sleeping styles, containing various sub-activities, such as lying awake and napping, in the ATUS dataset.As shown in Fig. 5, considering sleeping, on weekdays, employed women in Group 1 exhibited lesser spatial variation than unemployed women in Group 3 during the relevant time intervals.On weekends, women exhibited the same number of spatial variations during the relevant time intervals, irrespective of their employment statuses.
Considering cooking and washing up, unemployed women in Group 3 exhibited more spatial variation during the weekdays, whereas women with full-time jobs in Group 4 exhibited more spatial variation during the weekends.No spatial variations existed for women with full-time jobs in Group 1 on weekdays, and for unemployed women in Group 6 on the weekends.Considering watching television, women with part-time jobs in Group 5 did not exhibit any spatial variation during the weekends.Women with part-time jobs in Group 2 further exhibited a low spatial variation during the weekdays.Considering commuting, irrespective of their employment status, women in Groups 1-3 exhibited more spatial variation during the weekdays than those in Groups 4-6 during the weekends.Women with part-time jobs in Group 5 did not exhibit any spatial variation during the weekends.
In most time intervals, the spatial variation exhibited a clustered distribution, with only limited time intervals exhibiting a dispersed distribution.Fig. 6 illustrates the probability of the women in Group 6 watching television at 13:00.An obvious clustered distribution can be observed at the state level.The observed spatial probability ranged from 0 to 21 %.

Representations of spatial variation
Fig. 7 shows the spatial probability of the women in Group 6 watching television at 13:00 based on the representations of the spatial variation generated by the ordinary kriging and SAR methods.The kriging-based representation ranges from 6 to 14 %, whereas SAR-based representation ranges from 4 to 17 %.The variation was narrower than the observation shown in Fig. 6.The kriging-based representation can simulate the changing tendencies of spatial probabilities.However, the clustered pattern was not identified.The SAR-based representation can provide more accurate estimations for certain states, simultaneously providing a better representation of the cluster areas.Furthermore, we also compared the two representations considering all the combinations of groups, activities, and states.Regarding TAE and RMSE at the state level, the kriging-based representation was 126.5 and 9.9 %, and the SAR-based representation was 61.2 and 3.0 % respectively.
To better understand the cause of the error, Fig. 8 show the observed probability, a 95 % confidence interval, and the estimated probabilities of each state.Note that the observations of some states contain large sampling errors because of their small sample size.Some states had a probability of 0 because activity occurrence was not observed, which could also be attributed to the small sample size.As the error indicators were quantified based on the difference from the observed probabilities, they were at the scale described above.However, as shown in the figure, the two representations are within the 95 % confidence intervals of most states.

. Reproduction of the spatial variation
The reproducibility of the spatial variation by the developed spatial logistic regression models was evaluated based on four representative cases: (a) sleeping at 8:00 in Group 4; (b) cooking and washing up at 12:00 in Group 3; (c) watching television at 13:00 in Group 6; and (d) commuting at 10:00 in Group 1.These representative cases were selected among the time intervals with a spatial variation to compare the model performance.The four representative cases were selected based on their high probability compared to other intervals.Fig. 9 illustrates the spatial probability of activity in each of the four cases, based on the observations and estimations.The visualization of the spatial variations in all the subfigures was interpolated using the ordinary kriging method.Considering the reproduction of the spatial variations in these four cases, the spatial distributions determined by the three spatial logistic regression models were more consistent with the observations than those determined by the reference model.However, Model 2 for Case (b) and Model 3 for Case (c) yielded the same results as that of the reference model.This is because, the spatial representations, g s i ; h ð Þ, were eliminated during the stepwise process.The reference model also showed limited spatial variations (see subfigures in Fig. 9 for Cases (b) and (c)), which is attributed to the variations in sociodemographic variables.
As shown in Fig. 9, neither the reference model nor the spatial logistic regression models were not adequately consistent with the observations.To understand the reason, Fig. 10 shows the probabilities of each state.As shown in the figure, most of the estimated probabilities shown by the lines fall within the 95 % confidence intervals of the observations.The observation of states with a small sample size had either larger error bars or no error bars (probability = 0).Thus, the estimations were different from the observations for these states.However, spatial logistic models are still more consistent with observations than reference models.

Comprehensive performance
Fig. 11 shows the stacked values of performance indicators quantified at the national level, TAE nation , RMSE nation , MSD nation , and RMSE GA nation , for all the models considering the six groups in training and test datasets.The indicators are the cumulative values quantified for each activity and group combination.As shown, the error indicators exhibited similar performances with all the models for almost all the combinations of the training and test datasets.Considering the inter-occupant diversity, Models 3 and 1 exhibited a 7 % higher MSD nation than the reference model.Considering RMSE GA nation , all the models exhibited similar results with both the training and test datasets.
Fig. 12 illustrates the TAE, RMSE, and MSD values of the models for the six groups quantified at the state level.As shown in Fig. 11 and Fig. 12, the magnitudes of the error indicators increased from the national level; however, MSD exhibited the opposite trend.
Considering the error indicators, improvements were observed in the spatial logistic regression models compared to the reference model.Model 3 exhibited the greatest improvement compared to the reference model, reducing the stacked TAE state value by 9.9 and the stacked RMSE state value by 11 % for the training dataset.This was followed by Model 1 (stacked TAE state decreased by 4.4 and stacked RMSE state decreased by 3.6 %) and Model 2 (stacked TAE state decreased by 3.2 and stacked RMSE state decreased by 2.1 %).However, the spatial logistic regression models, particularly Model 3, did not provide such advantages with the test dataset.Considering MSD, the spatial logistic regression models, particularly Models 1 and 3, performed better than the reference model with both the training and test datasets.
The above results are confirmed in Fig. 13, which shows the accuracy evaluations of each model at the state level.The estimations and observations were obtained using the base-10 logarithmic transformation.Two R 2 values, with and without logarithmic transformation, were quantified.All the models exhibited high accuracies.However, the points in the reference model were relatively scattered compared to those in the spatial logistic regression models.Considering the values of R 2 , the spatial logistic regression models, especially Model 3, exhibited relatively higher R 2 values than the reference model.In this section, the spatial logistic regression model was applied to Group 7 the entire population of women aged between 30 and 59 years, for watching television.The Moran's I test results indicated that spatial variation existed during the time intervals 9:00-17:00 and 22:00-0:00.Therefore, the spatial logistic regression models were developed and assessed only for these time intervals.
Fig. 14 shows the same visualization maps of the spatial probability of watching television at 13:00 (Fig. 9) based on the observations and estimations of Group 7. The range of probability is narrower than Fig. 9 for Group 6, because Groups 1-6 were combined.The spatial logistic regression models, especially Model 3, showed a more accurate spatial distribution relative to the observations than the reference model.Table 2 shows the performance of all the models evaluated by the indicators, considering all the time intervals that exhibited spatial variation.The models performed effectively with Group 7. At the national level, all the models exhibited the same performance in terms of errors and MSD.However, the reference model showed a relatively lower RMSE GA compared to the spatial logistic regression models.At the state level, the spatial logistic regression models exhibited lower TAE and RMSE values, and similar MSD values to the reference model.

Comparison approach of segmentation and using grouping conditions as variables
Fig. 15 depicts the accuracy in the base-10 logarithmic transformation of Model 3 for watching television, considering Group 7 and different subpopulations at the state level.Only the time intervals that exhibited spatial variation considering Group 7 and the subpopulations of Groups 1-6 have been considered in this analysis.Model 3 developed for Group 7 was applied to certain subpopulations Group 1-6 corresponding to the different time intervals to represent estimations based on Group 7. The thick black line shown in the two subfigures of Fig. 15 represents the fitted line of the estimations obtained from Model 3 considering Group 7, which indicates the approach that uses variables, and the thick dashed line represents the estimations obtained from Model 3 con-   sidering the subpopulations, which indicates the approach using segmentation.The thin black line is the reference line, y ¼ x.Model 3 considering both the entire population and the subpopulations fitted significantly with the observations.However, the thick dashed line was slightly closer to the reference line than the thick black line, which implies that the estimations obtained from Model 3 through segmentation were relatively more accurate than those obtained from the variable-based approach.Table 3 shows the comprehensive performance comparison through the statistical indicators of the two approaches for all models at the state level.According to Table 3, all models performed adequately for both approaches.However, the segmentation-based approach yielded smaller TAE and RMSE for all the models.In contrast, for the inter-occupant diversity assessed by MSD, the variable-based approach showed a relatively better performance.

Discussion of the results
This study demonstrated the existence of spatial variations in OBs and established a modeling method to consider these spatial variations in OBs.The established method is an extension of the existing modeling method (i.e., the logistic regression method combined with time use data sample segmentation in the presimulation process).We confirmed with women aged 30-59 in the U.S. for the four representative activities that the method contributes to better reproduction of spatial variation and enhancement of inter-occupant diversity in OB modeling.The Moran's I tests in Section 4.1 showed that spatial variation exists and it differed according to the time of day and activity for different study populations.Therefore, spatial variation should be carefully considered in OB modeling.To this end, SAR-based and kriging-based spatial representations were developed to better represent spatial variation empirically and used in subsequent spatial logistic regression models.The results in Section 4.2 showed the SAR-based representation to be superior to the kriging-based representation, because the former accounts for the variation in other sociodemographic factors.Note that the representations deviated from the observations of each state, as shown in Fig. 8, particularly for those with small sample size.However, as the representations were within the 95 % confident intervals of the observations, using two representations contributed to avoiding the inclusion of the effect of sampling error in the following logistic regression modeling.If the location data required to develop a spatial representation is insufficient, spatial factors can be used to represent spatial variation for model development, as in the case of Model 1.
As discussed in Section 4.3, the developed spatial logistic regression models improved the inter-occupant diversity, as the single-activity MSD for subpopulations improved by 0.6 %, and the stacked MSD for all combinations improved by 12.5 % at the state level with the training dataset compared to the reference model.In particular, the developed models better reproduced the spatial variation of OB, as the error was further reduced (RMSE decreased by 0.3 %, and stacked RMSE decreased by 5.6 %).However, as shown in Figs. 10 and 12, the estimated results deviated significantly from the observed probabilities at the state level, mainly due to sampling error, as observed for the spatial representations.Note that the estimated probabilities were close to the estimation result of the kriging and SAR; TAE and RMSE were 89.5 % and 9.8 % for kriging and 30.2 % and 1.8 % for SAR, respectively.In addition, owing to the influence of sampling errors, the results for the test dataset showed significant differences compared to the training dataset.This result implies that the segmentation approach is disadvantageous as it involves more sampling errors.The variable-based approach examined in Group 7 was useful for increasing the number of samples for each location.As dis- cussed in Section 4.4, the variable-based approach was effective as it approximately reflected the inter-occupant diversity, and the error was only marginally larger than the segmentation-based approach (the stacked TAE and RMSE increased by 1.3 and 0.1 %, respectively).
To overcome the sampling error issue, it is important to ensure that a sufficient number of samples is available for each study location.The 95 % confidence interval was calculated as , where p is the activity probability, SE is standard error, and n is the sample size.Fig. 16 shows the required number of samples for the corresponding width of the confidence intervals based on the calculation.As shown, to narrow the width of the confidence interval by 10 times, the required sample size needs to be increased by nearly 100 times.To obtain enough samples, it would be effective to 1) use a variable-based approach instead of a segmentation approach, 2) use multiple-year time use data, and 3) merge neighboring areas.The last method is important when high spatial resolution data are available because considering spatial variation at a detailed level reduces the sample size per location.In this case, using spatial representation methods is effec-  tive in obtaining the spatial distribution of the modeling parameters of OBs.

Limitation and future work
As mentioned in Section 3 and dicussed in Section 5.1, a limited sample extracted from ATUS data representing women from states of the U.S. mainland for a limited number of activities and the lowresolution location data were used in this study.Therefore, the observations used to develop models contain non-negligible sampling errors.Thus, the developed spatial logistic regression models showed a large-scale error in the observed probabilities, whereas the developed models showed no significant improvement with the test dataset.However, further studies are required to address this issue.
Nevertheless, the developed modelling method can generate better results than traditional logistic regression methods, as revealed in Section 4.3.As time use data or equivalent datasets have been collected in many countries, the developed modelling method can be applied to different regions.For example, it applies to showing the differences in OBs between the areas in which lockdowns were implemented and those in which lockdowns were not implemented after the COVID-19 pandemic, thereby providing more useful references for relevant institutions.However, detailed information relevant to housing, households, and the environment should be supplemented by combining the data collected at the local level.Similarly, reliable new samples should be generated to enrich the sample size and represent spatial variation at the local level.In addition, the advancements in geographic information systems allow high-resolution location data to become more and more available.Thus, if the above conditions are satisfied, spatial representations can be generated with higher accuracy at the zip code or even household level.Therefore, subsequent spatial logistic regression methods can facilitate further improvements.

Conclusion
Existing OB models lack a comprehensive and systematic consideration of spatial variation.These models were primarily established within limited locations based on geo-referenced data to determine space use or to simulate occupant mobility.Some studies used spatial factors to insufficiently consider the spatial varia-   tion in OBs or energy demand.However, the real spatial distribution of OBs has not been comprehensively investigated, and modeling methods that reproduce spatial variation in OBs are yet to be developed.This study showed that spatial variation exists in OBs and developed new OB models that can consider spatial variation.The developed models significantly enhanced the reproducibility of spatial variations in OBs and generated smaller errors at the state level than the conventional logistic regression model.The developed modelling method is an extension of the exsiting logistic regression method which can be applied in different countries for any application context (i.e., any spatial scale and population).However, our results were obtained with limited samples at the state level from the ATUS data and low-resolution location data.Model performance may be improved with high resolution location data, and behavioral data with richer information and larger sample sizes.Therefore, with more comprehensive considerations of spatial variation in the new OB model, location-based OB patterns can be generated, which can be used in future studies to simulate more realistic energy demand profiles and to develop regionsensitive energy policies.

CRediT authorship contribution statement
Energy & Buildings 281 (2023) 112754 Contents lists available at ScienceDirect Energy & Buildings j o u r n a l h o m e p a g e : w w w .e l s e v i e r .c o m / l o c a t e / e n b

Fig. 1 .
Fig. 1.Summary of the methods for spatial analysis and modeling.

Fig. 4 .Fig. 5 .
Fig. 4. Probability of activities considering representative groups.The differently colored lines represent the different states and the black line represents the national estimate.

Fig. 6 .
Fig. 6.Spatial probability of the women in Group 6 watching television at 13:00 at the state level based on observations.

Fig. 7 .
Fig. 7. Spatial probability of the women in Group 6 watching television at 13:00 based on representations of the spatial variation generated by the ordinary kriging and SAR methods.

Fig. 8 .
Fig. 8. Probability of activity of the women in Group 6 watching television at 13:00 based on representations and observations.Error bars indicate the 95% confident intervals of the observations.

Fig. 9 .
Fig. 9. Comparison of the spatial probability of activities based on the observations and reproductions of the spatial variation by the reference model and the three spatial logistic regression models for Cases (a)-(d), respectively.The spatial distribution results were interpolated by the ordinary kriging method.

4. 4 .
Evaluation of spatial logistic regression models applied to the entire population 4.4.1.Application of Group 7

Fig. 10 .
Fig. 10.Probability of activity of each state for cases used in Section 4.3.1.Error bar was quantified by the 95% confident intervals.

Fig. 11 .
Fig. 11.Results of indicators at the national level for all the models in the training and test datasets.

Fig. 12 .
Fig. 12. Results of indicators at the state levels considering all the models in the training and test dataset.

Fig. 13 .
Fig. 13.Accuracy of the spatial logistic regression models at the state level.The horizontal axis shows the observation probabilities of the different combinations of the groups, states, and activities.The vertical axis shows the estimations.The black line is the reference line y ¼ x.Logarithmic transformation was performed in the range (À4, 0) Â (À4, 0).

Fig. 14 .
Fig. 14.Spatial probability of the women in Group 7 watching television at 13:00 based on observations and estimations.

Fig. 15 .
Fig. 15.Accuracy of Model 3 at the state level, considering two approaches (variables and segmentation).The different colors in the figure represent different groups.The circular and triangular shapes represent the entire population and the subpopulations, respectively.Logarithmic transformation was performed in the range of (À2, À0.5) Â (À2, À0.5).

Fig. 16 .
Fig. 16.Needed sample size for corresponding width of the confidence interval.p is considered as 0.5, representing the maximum value of p 1 À p ð Þ .

Table 1
Groups and their details.
Weekdays Full-time Survey year, age, presence of children, family income, carer, education, ownership of the housing unit, number of people in the household, region, and state Entire population of women aged 30-59 Items in Groups 1-6, as well as employment status and type of day 36,438 Fig. 3. Sample size of each group for each state.Y. Li, Y. Yamaguchi, J. Torriti et al.Energy & Buildings 281 (2023) 112754

Table 2
Results of indicators considering all the models with Group 7 at the national and state levels.RMSE_GA was calculated only at the national level.

Table 3
Comparison of the approaches through statistical indicators at the state level.