Migration nowcasting using Google Trends: cross-country application

Analysis of migration flows is crucial for understanding and forecasting social and economic trends. This paper presents an algorithm for obtaining migration estimates with minimal time delay (nowcasting) using Google Trends Index (GTI) search queries. The predictive power of the models is assessed across different periods, including one marked by the restrictions imposed due to the COVID-19 pandemic, which significantly impacted migration opportunities. The paper evaluates models for estimating migration from six different countries to Germany. The key findings are as follows: first, in periods free from external shocks, using a single search query such as «work in Germany» in the official language of the migration origin country, along with its 12-month lags in SARIMAX or distributed lag models, yields higher accuracy in migration estimates compared to SARIMA models. Second, during periods with external shocks, a multi-query distributed lag model, which incorporates additional search queries related to migration intentions, demonstrates superior predictive quality. Finally, the paper proposes an enhanced method for migration forecasting based on GTI data. It highlights the importance of using a distributed lag model, which includes multiple GTI time lags, rather than models with individual GTI lags. Models employing GTI with lags consistently show better predictive power than SARIMA models across all countries and time periods considered.


Introduction
Migration significantly impacts demographic, economic, social, and other public policies in both destination and origin countries.However, migration processes are often influenced by external shocks such as pandemics, armed conflicts, and natural disasters.These shocks can undermine the reliability of migration forecasting and threaten the economic stability of states by accelerating the outflow of the working-age population.Timely analysis of migratory flows and migration intentions is essential for an effective policy response to emerging shocks, helping to mitigate economic risks for affected states.However, such analysis is challenging due to difficulties in data collection, delayed release of information, and the fragmentary nature of official statistics, which often capture only a limited range of migratory movements.The advancement of information technology and the rise of big data from Internet user activities have opened new avenues for research.This includes using alternative data sources to analyze various economic indicators with minimal time delay, a process known as nowcasting (Jun et al. 2018).For instance, researchers have begun estimating migratory flows by analyzing digital traces left by migrants on the internet (Tjaden 2021).Key data sources include social network activity, GPS tracking of SIM card locations, IP address geolocation, and search query statistics, such as the Google Trends Index (GTI), which provides real-time information on search behavior.
This paper aims to forecast international migration using Google Trends Index (GTI) data and to test the applicability of distributed lag models, SARIMA, and SARIMAX for several countries and different forecasting periods.The predictive power of these models is assessed across periods marked by external shocks as well as periods without such disruptions.
The migration modeling in this research builds on approaches previously applied by other authors (Fantazzini et al. 2021;Tsapenko & Yurevich 2022;Bronitsky & Vakulenko 2022) who utilized time series models (SARIMAX) with GTI as exogenous variables, as well as on studies that employed distributed lag models (Bronitsky & Vakulenko 2024).Specifically, this paper draws on Wanner's (2021) work, which compares migration forecasting models based on GTI data for estimating migration flows from various countries to Switzerland.
This paper enhances existing methods, including those outlined by Wanner (2021), by incorporating several GTI lags into the model.Additionally, it presents an algorithm for collecting and processing GTI data for migration estimation.The study tests single-query models with lags, using the example of the search query «work in Germany» in the migrant's native language, as well as multi-query models with lags (Bronitsky & Vakulenko 2024), which incorporate multiple migration-related search queries as explanatory variables.The paper concludes with recommendations for model selection in future research and summarizes the advantages and disadvantages of each model.
To forecast migratory flows using monthly time series data, this study compares several models: the seasonal time series forecasting model SARIMA; the SARIMAX model, which includes exogenous variables (specifically GTI data for the search query "work in Germany" and its lags from 1 to 12 months as explanatory variables); and a distributed lag model, a multiple regression model for migration where selected GTI values and their corresponding time lags serve as explanatory factors.
The accuracy metrics for these models are calculated using migration data from Poland, Italy, Romania, Spain, Bulgaria, and Russia to Germany.These countries were chosen because they showed the highest values for the indicator "arrivals of foreigners, " according to the monthly data from the German Statistical Office for the period 01/01/2011 to 01/06/2023 The validation of the models was conducted over two distinct periods.The first period, 01/06/2020 to 01/06/2021, is associated with the shock caused by the COVID-19 pandemic, during which migration movements largely paused.The second period, 01/06/2021 to 01/06/2023, corresponds to the lifting of movement restrictions related to the pandemic.Additionally, starting in 2022, other migration-impacting shocks may have emerged due to the Special Military Operation launched by Russia.
The paper is structured as follows.The second section provides a review of the literature on the use of Google Trends Index (GTI) for forecasting various economic indicators, including migration, and examines models where GTI data enhances predictive accuracy.The third section details the data and algorithm used to collect search queries for use in the models, compares the different models, and draws conclusions about the observed patterns.In the conclusion, the paper summarizes the strengths and weaknesses of the models under study, discusses their limitations, and offers suggestions for further research in this area.

Literature review
With the advancement of information technology and the availability of big data on individuals' internet activity, a new strand of research has emerged that estimates economic indicators by analyzing the digital traces left by economic agents (Cesare et al. 2018).This type of data allows not only for rapid estimations of current indicators but also for forecasting the dynamics of target metrics several periods into the future (Wanner 2021).In the context of migration, these methods are particularly effective, helping to reduce the time lag between migration events and the release of corresponding statistical data (Tjaden 2021).Researchers are increasingly using new types of exogenous data to accelerate migration estimates (nowcasting).These data sources include: • GPS coordinates of mobile devices (Bengtsson et al. 2011); • Social networks data (Kim et al. 2020); • IP addresses of devices (Zagheni and Weber 2012); • Search query statistics (Jun et al. 2018); • Other data sources (international flights data, news reports, etc.) (Gabriel et al. 2019).However, some of these methods have been criticized (Chudinovskikh 2018) due to the difficulty of distinguishing between temporary and long-term migration.Other sources of migration data include surveys and databases, often compiled by private research agencies.However, these sources are not transparent in terms of methodology and are typically unsuitable for short-term projections.
Google Trends Index (GTI) has been widely used in academic research, offering new opportunities for studies in fields such as medicine, financial market analysis, and the forecasting of various economic indicators.One of the earliest studies in this area was conducted by Ginsberg et al. (2009), which forecasted the spread of the flu using search queries.In the decade following that study, over 650 more research papers utilizing GTI data were published (Jun et al. 2018).GTI data was first used in migration-related research by Choi and Varian (2012) to analyze tourist migration flows from various countries to Hong Kong.Their study demonstrated that incorporating search queries with lags im-proved the model's performance, likely because potential migrants often search for information about destination countries well in advance of their travel (Böhme et al. 2020;Fantazzini et al. 2021).
In total, only 10 papers have used Google Trends Index (GTI) data to study migration.Across these studies, the analysis typically follows a similar three-step algorithm for modeling migration based on search queries: first, building search query sets to collect the data; second, selecting the search queries to be included in migration models; and third, identifying the econometric models for migration estimations.
When relying on GTI for data collection, careful selection of search queries is crucial, as the chosen themes can significantly impact the predictive power of the model.Researchers studying migration often use queries related to visas, work, and housing in the native language of the migrant, combined with the name of the destination country.Some authors (Böhme et al. 2020;Golenvaux et al. 2020;Wanner 2021;Fantazzini et al. 2021;Jurić 2022;Tsapenko & Yurevich 2022) have developed models based on specific search queries, such as «moving to [destination country],» «work in [destination country],» and «wage in [destination country],» with the queries selected by experts.
Other researchers (Avramescu & Wiśniowski 2021) have partially automated data collection by using the WordNet corpus (Fellbaum 2005) to identify synonyms of key search terms, moving towards a more algorithmic approach to formulating multiple queries.In some cases, researchers employ machine learning methods, particularly natural language processing (NLP), to compile groups of queries and compare their efficiency (Bronitsky & Vakulenko 2022).
The second step in migration analysis using Google Trends Index (GTI) involves selecting specific search queries from the total set to be included in the model as factors.There are several approaches to this selection process.One method is expert-based, where the authors either select the search queries themselves or do not specify how the queries were chosen, sometimes using the entire set of search terms collected in the previous step (Fantazzini et al. 2021;Tsapenko & Yurevich 2022).
More commonly, search queries are selected based on their correlation with the dependent variable.For instance, some authors (Wladyka 2017;Avramescu & Wiśniowski 2021) choose the queries with the highest correlation to the term «migration.»Additionally, some researchers advocate for an initial search clustering approach (Avramescu & Wiśniowski 2021;Bronitsky & Vakulenko 2024), where all search terms are grouped into clusters, and each cluster is then used as a variable in the model.
It is worth examining migration estimation models using Google Trends Index (GTI) in more detail.Typically, researchers rely on ARIMA models without GTI data as baseline models.Another example of a commonly used model is the gravity migration estimation model (Böhme et al., 2020).One of the most frequently used models incorporating GTI data is SARIMAX, where GTI search queries are used as the exogenous variable X (Fantazzini et al. 2021;Avramescu 2021;Tsapenko & Yurevich 2022;Bronitsky 2024).Other approaches include pair regressions and linear regressions (Wladyka 2017;Wanner 2021;Jurić 2022;Bronitsky & Vakulenko 2024).For example, Wladyka (2017) estimates pair regression models in differences because the initial values are based on yearly time series.In such pair regressions used for migration estimations, the search queries themselves or their time lags serve as explanatory variables.
Other authors (Böhme et al. 2020;Golenvaux et al. 2020;Fantazzini et al. 2021) also include search query lags in their models when estimating migration.Typically, researchers use GTI data with a single period lag, either monthly or yearly, depending on the data fre-quency (Böhme et al. 2020;Golenvaux et al. 2020;Fantazzini et al. 2021).Some have further developed this approach by employing a multi-query model with lags, as seen in research estimating migration from Russia to Germany (Bronitsky & Vakulenko 2024).
Most often, the selected lags do not exceed one year.This is because, before migrating, individuals typically spend about a year preparing by obtaining visas, finding accommodation, and addressing other logistical needs (Benson-Rea & Rawlinson 2003).
Thus, the literature on migration forecasting using Google Trends Index (GTI) data has not yet explored the comparative advantages of forecasting models applied to multiple countries, nor has it assessed the performance of models using single search queries versus those using multiple search queries with lags.This paper aims to review the approaches used by other authors and to demonstrate that models incorporating all lags from 1 to 12 months simultaneously are more efficient than models where each lag is added individually.

Methodology and data
To achieve the research objectives, the following steps will be undertaken: First, the methods for international migration forecasting will be compared and improved to ensure timely estimates ahead of the official release of statistics.Second, the forecasting models will be validated both during periods of external shocks and times without disturbances.The research focuses on forecasting migration for several countries.
For comparability of results, the analysis will use data from similar sources for each pair of countries.Therefore, the study will utilize a single source of migration data-the German Federal Statistical Office-and apply a consistent approach to collecting Google Trends Index (GTI) data on search queries, which will be detailed in a subsequent section.
The monthly migration time series used for the analysis (Table A1) were obtained from the German Federal Statistical Office [Database of the Federal…, indicator 12711-0008].Note that only yearly migration data is available online, published with a 3.5-month delay. 1  The paper analyzes migration to Germany from six countries-Poland, Italy, Romania, Spain, Bulgaria, and Russia-that reported the highest numbers of 'arrivals of foreigners' (Figure 1) for the period from January 1, 2006, to June 1, 2023, according to German statistics.This approach is known as 'mirror statistics' (UN 1998), where migration data reported by the destination country is used as a more reliable source for understanding migration flows from specific origin countries.
International recommendations suggest that migration data exchange should rely on data collected by destination countries to provide more accurate estimates of outward migration flows for countries of origin (UNECE 2014).The migration data analyzed are based on the number of first-time residency registrations of individuals moving to Germany for permanent settlement.Further internal movements within Germany are not registered and thus are excluded from international migration statistics.Although some individuals may not register their departure, which could affect migration stock data accuracy, this does not impact the current analysis, which focuses solely on arrival data.
1 Description of the data collection methodology is available at https://www-genesis.destatis.de/genesis/online,code 12711.
Google Trends Index (GTI) data on search queries are used as explanatory variable to make migration estimations and to forecast migration trends.GTI data is open to public and available in real time.GTI reflects the dynamics of search popularity S d, r among users for specific keywords over time (d) within a particular region (r).However, the index S d, r does not show the absolute number of queries for the chosen search term V d, r , rather it shows the number of queries relative to all search queries in that region on that day T d, r .Thus, the index indicates the share of queries related to a specific search word relative to the total number of queries in the chosen geographic region at a given moment in time.The search popularity index values are collected for a migration origin country, and the language used in the search is the official language of that country.When using GTI data for the analysis, it is important to consider the following aspects: 1.The index reflects a normalized (to 100%) value within a determined period and theme, rather than actual number of searches on a selected query.This makes it problematic to compare several search queries because these are normalized by different maximum values.One way to address this issue is through the standardization of Google Trends data [Fantazzini et al. 2021] (1) which allows to compare indices for various search queries: where X X , σ are a mean value and a standard deviation of a random variable, respectively.2. Google introduced changes to its data collection algorithm on January 1, 2011; January 1, 2016; and January 1, 2022.These modifications have complicated the use of time series data for the entire available period, from January 1, 2004, to the present.The most significant change occurred on January 1, 2011, which affected the algorithm for determining query regions.This change rendered it impossible to combine data from periods before and after this modification.Therefore, this research recommends using Google Trends Index (GTI) data collected after January 1, 2011.
Alternatively, other data sources on search queries, such as Yandex Wordstat, could be used.Unlike Google Trends Index (GTI), Yandex Wordstat provides search statistics in absolute numbers, which simplifies forecasting.However, a major drawback of Yandex Wordstat is the limited availability of monthly data, with time series currently available only from January 1, 2018.Additionally, Yandex Wordstat has a geographical bias, as Yandex is predominantly used within the Commonwealth of Independent States (CIS) and is less common outside this region.This limitation makes it challenging to apply the algorithms to international migration analysis.Nevertheless, as Yandex Wordstat builds a larger database of search queries, it could become a viable alternative to GTI for estimating migration from the CIS.
The next step is to determine the set of search queries to be used as explanatory factors in migration modeling.For estimating migration from Russia to Germany, a multi-query model was employed by Bronitsky and Vakulenko (2024).This model included search terms related to themes such as 'work in Germany, ' 'study in Germany, ' 'embassy of Germany, ' and their corresponding lags.The authors utilized linguistic tools and machine learning methods to select these search queries, taking into account the specific context of search behavior in Russia, including the requirement for Russian nationals to visit the German embassy before migration.
Tjaden (2021) also emphasized the importance of linguistic analysis, noting that search terms used by residents of Syria differ from those used by Canadians when seeking migration information.In this paper, only one search term in the language of the migration country is used-'work in Germany'-to simplify the analysis.While including the search term 'embassy of Germany' might also be relevant, most of the countries studied (with the exception of Russia) are European Union members, whose residents do not need to visit the German embassy for migration purposes.
Below is the algorithm for collecting and processing the data, using Poland as an example: 1.It is necessary to define the official language in the country of migration origin (Polish) 2. It is also important to translate the search query 'work in Germany' into Polish, so that to have a search query in the native language of migrants «Praca w Niemczech»`.3. Using Google Trends, it is possible to formulate an enquiry to extract the trend of search queries data for the terms «Praca w Niemczech» from 01/01/2011, till 01/11/2023.Also, URL can be formatted for this query to include necessary parameters.4. Additionally, it is possible to extract similar popular queries within the same theme.
To do that, "leaders" is to be selected in the "similar queries" tab, and other queries used by migrants before or after the search 'Praca w Niemczech' will be displayed.
From those search queries, only those related to the selected theme 'work in Germany' are to be selected.This approach allows one to overcome several challenges at a time: first, when a country has more than one official language, this algorithm will help identify all used transliterations of the search query; second, the linguistic component is considered, and the set of queries includes also slang terms used by potential migrants.5. Before making use of GTI and before comparing GTI values for various queries, standardization of time series (1) is required, so that the data is rescaled, having a mean of zero and a standard deviation of one.
Using the above algorithm, data for the search query 'work in Germany' was collected for the period January 1, 2021, to November 1, 2023, for the selected countries (Table A2).Depending on the country, a total of 1 to 6 search queries were collected.However, due to the increase in dimensionality, it is challenging to include all queries simultaneously in the models.Each query includes 12 lags in the model.
Given the limited number of observations (155 observations from January 1, 2011) and the need to divide the data into test and control groups, the number of explanatory variables may exceed the number of observations.Analysis of the autocorrelation function (ACF) and partial autocorrelation function (PCAF) of the migration data, along with the standardized GTI time series, revealed yearly seasonality.Due to detected nonstationarity (based on the Dickey-Fuller test), estimating the distributed lag model directly would be challenging.Therefore, both the dependent and explanatory variables are converted into seasonal differences with a 12-month period.
Table A2 displays the selected search queries (in italics) for this purpose.Some search queries show correlation values below 20%, and for Spain and Russia, the correlation is even negative.This may be attributed to the time lag associated with migration.To test this hypothesis, additional calculations were performed to measure the correlation between the migration time series (in seasonal differences) and the search query lags from 1 to 12 months.The results (Table A3) indicate that it is crucial to consider not only the correlation with current values but also the correlation with lagged values when collecting search queries.For example, for Spain, the correlation is 25-26% at the 6th, 9th, and 10th lags.
The accuracy of the models is assessed by dividing the initial dataset into 'test' and 'control' samples.This methodological step is crucial for out-of-sample validation.The mean absolute percentage error (MAPE) and mean absolute error (MAE) are used for evaluation.In all evaluations, the test and control samples are mutually exclusive: the 'test' sample covers the period starting January 1, 2011, up to the beginning of the 'control' sample period.
Model validation and migration flow forecasting are conducted using time series in seasonal differences.For comparing the results and assessing model interpretability, the original set of variables is used to evaluate forecast errors rather than seasonal differences in migration.Pairs of test and control time series are used corresponding to the 2nd and 3rd forecasting years, respectively.The initial pair of test and control time series is used to evaluate model performance over a 2-year forecast period: from June 1, 2021, to June 1, 2023, with the test sample covering the period from January 1, 2011, to June 1, 2021.
It is important to note that, during the first period, most COVID-19 pandemic restrictions were lifted, so model performance is not significantly impacted by external shocks.The second pair of time series is used to investigate a 3-year forecast period from June 1, 2020, to June 1, 2023, with the corresponding test sample covering January 1, 2011, to June 1, 2020.At the beginning of this period, COVID-19 pandemic restrictions were still in place, leading to a decline in migration activity across all countries due to travel challenges.Analyzing this period allows for assessing how well the models perform amidst the shocks caused by the pandemic.The findings are generalized to other contexts involving external factors, such as armed conflicts and natural disasters.
The third group pair consists of a 2-year sub-sample within the 3-year group from June 1, 2020, to June 1, 2022.This sub-sample is used to test the hypothesis that the accuracy of the forecast is more influenced by external shocks than by the length of the forecasting period.

Migration forecasting models
This section describes the models used to forecast migration flows from various countries to Germany.It provides an algorithm for estimating these models and identifying the best parameters using information criteria (such as AIC).The author compares the predictive power of two models for 2-year and 3-year forecasting periods and determines which model is more suitable considering the impact of external disturbances.Additionally, using migration flow data from Russia to Germany, the author compares single-query models with multi-query models and evaluates the necessity of including multiple lags in the models.
One model under study is a distributed lag model, where the 'arrivals of foreigners' from a particular country to Germany serve as the dependent variable.Time-series lag values from 1 to 12 months for the search query 'work in Germany' are used as explanatory variables.This approach accounts for the time lag between searching for information on the internet and the actual migration event.The lag values may vary across countries due to factors such as migration policy or logistical difficulties.The distributed lag model for variables in seasonal differences, with lags from 1 to 12 months, can be represented as follows: where • Y t -dependent variable, indicator 'arrivals of foreigners' in Germany • X t-l -explanatory variables (search query GTI with lags l = 1…12) • ε t -regression errors, ε t ~iid(0,σ 2 ) • t = 1…T -the year To determine the required number of time lags in the distributed lag model, an algorithm was developed testing all possible models including time lags from 1 to 12 months (a total of 8,192 models were estimated for each country).The best model was determined using the Akaike Information Criterion (AIC).When applying AIC to compare the models, it is necessary to use the same number of observations as was used for estimating the parameters of the model.For SARIMAX models, the AIC was also employed to determine the model parameters p, d, q, where p is the order of autoregression, d is the order of integration, q is the moving average order, and the parameter s=12 is chosen based on the Auto-Correlation Function (ACF).These parameters are also selected using the AIC by testing parameters on the 'test' group data.In the SARIMAX model, the exogenous variables include the search query itself and its time lags, which were determined for the distributed lag model.The best lags identified were not explored further due to the larger number of model parameters to be tested and limited computing capabilities.
The paper compares various models for forecasting migration flows, including comparisons with SARIMA models that do not use GTI search queries data.The following models are considered: 1. SARIMA: Seasonal Autoregressive Integrated Moving Average.This model includes a seasonal component based on the ACF analysis, which indicates seasonality trends at 12-month intervals.SARIMA is used to forecast migration flows without incorporating exogenous data.2. SARIMAX: A variant of SARIMA that incorporates exogenous variables.In this model, GTI data and its lags from 1 to 12 months are included as explanatory variables to enhance the forecasting of migration flows.
3. Distributed Lag Model: A multiple regression model where the number of migrants is the dependent variable.GTI values and their corresponding lags from 1 to 12 months serve as explanatory variables in this model.Besides the models based on the single query "work in Germany," a multi-query model incorporating several search queries linked to different themes was used for comparison (see Table A4).Due to the limited number of observations in the test sample, the Principal Component Analysis (PCA) dimensionality reduction technique was applied to manage the different search themes ("studying," "work," "embassy") (Bronitsky & Vakulenko 2024).
The multi-query model includes three PCA vectors corresponding to these three themes and their time lags from 1 to 12 months, resulting in a total of 35 variables.This approach is more time-consuming as it involves working with many search queries, which could also be determined using Natural Language Processing (NLP) methods and machine learning techniques.Additionally, it requires an analysis of the linguistic peculiarities of searching for information in the Russian language on the internet.
In this paper, a simplified approach to data collection for multiple countries is applied.However, using migration flow data from Russia to Germany, the paper compares the predictive power of the one-query model with the multi-query model.The distributed lag model (2) for the multi-query approach, which includes multiple search queries (after PCA reduction) and their corresponding lags, can be written as follows: where • Y t -a dependent variable, indicator 'arrivals of foreigners' in Germany • X 1 …X k -explanatory variables (PCA reductions of GTI search queries with lags) Using the above method, the author estimated three types of models (a distributed lag model, SARIMAX, and SARIMA) for six countries.As a baseline, averaging predictions were considered.However, these estimations for most of the countries under study were less convincing compared to SARIMA models and thus are not included in the comparison.For both the 2-year and 3-year forecasting periods, model parameter estimations and prediction accuracy metrics (MAPE, MAE) were calculated separately (see Table 2).
Figures 2-7 present the forecasting results of the distributed lag models and SARIMAX for the 2-year control period (refer to Table 1 and A5).To test the hypothesis that the predictive accuracy of the model could be influenced by the length of the forecasting interval, validation metrics were also calculated for models with 2-year forecasting periods, both with and without external shocks.The evaluation of models with a 3-year forecasting period is not included in the paper because the results did not significantly differ from those of the 2-year forecasting period models.
The analysis of the results demonstrates that incorporating Google Trends Index (GTI) data into migration estimations enhances the predictive power of the models (Table 2).For instance, the distributed lag model yielded lower forecasting errors for four countries, and the SARIMAX model showed improved performance for all six countries when using outof-sample data compared to the SARIMA models for the 2-year forecasting period.These results indicate that utilizing GTI as an exogenous variable significantly improves migration forecasts.When comparing SARIMA models for 2-year and 3-year forecasting periods, it was observed that the predictive accuracy deteriorates across all countries when the forecasting interval overlaps with the COVID-19 pandemic (Table 2).This decline in performance could be attributed to trends in the time series or the lack of exogenous data to account for the external shocks caused by the pandemic.However, other factors might also contribute to the reduced accuracy over longer forecasting periods.
Regarding the 3-year forecasting period, the distributed lag model outperforms both SARIMAX and SARIMA models for all countries.This result underscores the significance of incorporating external factors when estimating and forecasting migration, particularly during periods of disruption.However, it is important to note that the distributed lag models exhibit reduced accuracy when the forecasting period includes intervals affected by the COVID-19 pandemic.For the distributed lag model ( 2), the contribution of each lag can be calculated using the average time lag L k for each country separately: A large average lag indicates that the migration event occurs with a considerable delay relative to the moment of the information search.Since the average lag is a random variable, its confidence interval is estimated using Monte Carlo simulation.Wanner (2021) analyzed 12 separate models for each country, each using only one lag (e.g., models with the first lag, second lag, etc.).This paper compares these single-lag models with models incorporating multiple lags simultaneously (distributed lag models).According to the AIC criterion and predictive accuracy metrics, distributed lag models demonstrate higher predictive power for five out of six countries.
The calculations revealed that the average lags exceeded 6 months for five countries, thereby supporting the hypothesis that including lagged GTI values in migration analysis is beneficial.In contrast, Bulgaria exhibits a significantly shorter average lag (2.65 months), which may be linked to the country's migration policy and the stronger migration intentions of its working-age population, leading to quicker migration decisions.
In the estimations of migration from Russia to Germany, the model coefficients are negative (Table 1).This observation may be due to two factors: first, migration flows from Russia to Germany have been decreasing (except during the shock period associated with the Special Military Operation), while search queries have decreased at a slower rate; second, the search queries (and their lags) show poor correlation with migration data (Table A3).This suggests that the search term might not be very effective and that incorporating related themes could improve the model.
For instance, a multi-query model, which includes multiple search themes, demonstrates significantly better predictive accuracy compared to a single-query model with just 'work in Germany' and its lags (Figures 7-8).The multi-query model also performs better during periods of shocks, including the pandemic.It is the only model that shows improved predictive accuracy when shocks are factored into the forecasting period.Notably, while the model forecasted an increase in migration flows following the launch of the Special Military Operation, the actual migration flows exceeded these predictions.One hypothesis for this discrepancy is that the average lag decreases during shock periods (i.e., prospective migrants make migration decisions more quickly after searching for information online) compared to quieter periods without shocks.

Conclusion
The paper explores international migration forecasting models that provide estimates without the delays typical of official statistical sources.The study evaluates model performance with and without Google Trends Index (GTI) data, specifically using the search query 'work in Germany' .The analysis focuses on migration flows from six countries to Germany, assessing forecasting accuracy for both 2-year and 3-year periods.The latter period is particularly noted for its association with external shocks, such as the COVID-19 pandemic's impact on migration.

Information about the author
Figures 2-7.Estimates of migration flows from different countries to Germany, using the distributed lag model for 2-year and 3-year forecasting periods

Table 1 .
Evaluation of distributed lag models (2) for a 2-year control period

Table 2 .
Evaluation metrics of the models estimating migration from various countries to Germany, 2-year and 3-year forecasting periods

Table A2 .
Descriptive statistics of the standardized GTI values used for estimating migration to Germany in 2013-2023 In total, there are 155 observations over time.The indices have been standardized to have a mean of zero and a standard deviation of one, and therefore are not included in the table.Search queries presented in italics indicate those with the highest correlation with the 'arrivals of foreigners' indicator in Germany from the respective country.