Reconstructing production networks using machine learning

,


Introduction
The literature on input-output economics is old and well-established, but the vulnerability of just-in-time supply chains -recently under the spotlight (Goodman & Chokshi 2021) -has led to a renewed interest in the study of shock propagation in production networks.While early research has been mainly carried out at the industry level (Leontief 1986, Miller & Blair 2009, Acemoglu et al. 2012), it is increasingly evident that more fine-grained data is needed to predict the impact of shocks.Unfortunately, information on firm-to-firm relationships is by nature confidential and, therefore, often hard to access and incomplete.In the US, public companies are required to disclose prominent customers (Atalay et al. 2011).In a few countries, such as Belgium or Hungary, VAT reporting allows national statistical o ces to provide production networks to researchers (Tintelnot et al. 2018, Diem et al. 2022); in others, such as Japan, large commercial datasets are available (Mizuno et al. 2014, Inoue & Todo 2019, Carvalho et al. 2021).In the Operations Research and Supply Chain Management literature, rich datasets have been analyzed (Brintrup et al. 2018, Demirel et al. 2019, Chauhan et al. 2021, Dolgui et al. 2018, Schueller et al. 2022), but they are usually limited to a specific industry or assembled to study the supply network of a specific firm.
In most countries and for most periods, however, the data on firm-to-firm relationships is unavailable, making it crucial to develop methods to reconstruct these networks based on available data.In this work, we develop a method for predicting links in production networks using data on firms' financial statements, industry, and location.For simplicity and due to data limitations, our focus is on reconstructing binary relationships (the existence of links) rather than their weight (the value of transactions).We approach this as a classification problem and tackle it with standard modern machine learning techniques.Let u and v be two nodes of the network G, f u and f v be vectors of u's and v's covariates (e.g., sales, industry, etc.), and f (u,v) be a vector of dyadic features (e.g., the geographical distance between the two companies).We can write the probability P u,v of a link between u and v as ⌘ , where is unknown and network-specific.This formulation encompasses a wide variety of models where is defined explicitly or implicitly.For instance, the literature on the reconstruction of financial networks uses explicit functional forms for , or varying complexity, from simple gravity models to more complicated fitness models (De Masi et al. 2006, Garlaschelli et al. 2005, Garlaschelli & Lo↵redo 2004).In the production network growth literature (Atalay et al. 2011, Carvalho & Voigtländer 2014), is often implicit but could be derived from the knowledge of the stochastic mechanisms generating the network.Here we propose to learn using a typical supervised learning framework.We train a machine learning model on a portion of the network and study its capacity to predict links in the unobserved part.We validate the predictions of our model through its Receiving Operator Characteristic (ROC) curve.Our method shows remarkable results for all the tested datasets.In addition, these methods make it possible to understand which features of the firms are key to predicting trade connections through an analysis of the features' importance.For our datasets, firms' industrial sector, geographical location, and size are the main performance drivers.
Literature.Our approach is related to two streams of research: network reconstruction and link prediction.In general, network reconstruction tries to infer as much as possible about the network from the available data (often nodes' degrees and strengths) while limiting the number of unsupported assumptions.These methods have been widely applied to financial networks and systemic risk estimation (Squartini et al. 2018, Almog et al. 2019, Squartini & Garlaschelli 2011, Squartini et al. 2015), but their application to firm-to-firm production networks is still in its infancy (Hooijmaaijers & Buiten 2019, Mattsson et al. 2021, Ialongo et al. 2022).Similar techniques were also applied to the international trade literature (Squartini & Garlaschelli 2014, Garlaschelli & Lo↵redo 2004, 2005, Garlaschelli et al. 2007, Almog et al. 2019) to reconstruct the binary topology of the trade network between countries.
Link prediction instead only tries to infer whether two network nodes are connected.Some of the most popular techniques in link prediction (L ü & Zhou 2011) are based on computing similarity scores between nodes.These scores are then assumed to be a proxy for the likelihood of a link.There are many methods to compute these scores.The most celebrated ones, like Jaccard (Liben-Nowell & Kleinberg 2007), Katz (1953), LHN (Leicht et al. 2006), Preferential Attachment (Barabási & Albert 1999), Adamic-Adar (2003), and Resource Allocation (Zhou et al. 2009) are derived knowing the neighbors of each node.
There is little work being done on link prediction for production networks specifically.Reisch et al. (2022) used mobile phone data to reconstruct the production network of an undisclosed European country.Ialongo et al. (2022) uses a maximum entropy approach to reconstruct the Dutch firm-level interactions.In Hillman et al. (2021), the authors designed an algorithm that stochastically links customers and suppliers so that the production network matches the sectoral linkages provided in the OECD Input-Output Tables.Brintrup et al. (2018) and Kosasih & Brintrup (2021) pioneered the use of machine learning for link prediction in supply chains.An important di↵erence between these studies and ours is that they use features derived from the network's topology, either manually, as in Brintrup et al. (2018), or automatically through Graph Neural Networks as in Kosasih & Brintrup (2021).In contrast, here, we consider firms' features.We believe this to be eventually advantageous from an operational point of view, as, in countries where no supply chain data is available, firm-specific information (like sales or geographical position) is still widely available.
The outline of this paper is as follows.Section 2 describes the data and the methods.Section 3 provides the results; We conclude in Section 4.

Data
Datasets.We test our methods on three datasets: Compustat, FactSet, and a firm-level administrative dataset from Ecuador1 .Compustat is a financial, statistical, and market information database on active and inactive publicly listed companies.It provides several company-level fundamentals (such as income statements and balance sheets) and information on firms' commercial relationships.Compustat primarily draws its data from Security and Exchange Commission (SEC) filings, and standardized financial statements required from the US SEC.SEC filings require companies to indicate those customers who account for 10% or more of their total revenues, allowing the identification of supplier-customer relations between di↵erent companies.Like Compustat, FactSet is a proprietary database of financial and market data.It also collects information on companies' trade partners from SEC filings but integrates them with press releases, news, and other sources of business insights.The third dataset, which we call "Ecuador" for short, is assembled by Ecuador's Tax authorities from firms' tax declarations.It contains information on companies' legal status, sales, and location.Most importantly, it has detailed information on every firm's trading partners for virtually all the firms in Ecuador's formal economy2 .
We downloaded Compustat from Wharton Research Data Services.Firms' annual fundamentals can be found in the eponymous table in the Compustat directory.Supply Chain data can be found in the "WRDS Supply Chain" table in the "Linking Suite by WRDS" folder.No preprocessing was performed on this data.We accessed the FactSet data through FactSet's proprietary data feed.Firms' fundamentals and supply chain information can be found in the folders with the same names.The supply chain data was aggregated at the ultimate parent company level, using FactSet's ownership structures data, while the monetary variables in the fundamentals were converted to USD (see Online Appendix A for details) 3 .
The Ecuador dataset was provided by Ecuador's government to one of the authors.Additional details on this dataset can be found in Astudillo-Estevez (2021).Bacilieri et al. (2022) reviews existing firm-level production networks datasets and their key properties, including Ecuador and Factset, and contains further references to papers using these datasets.
Compustat and FactSet's data are provided at a yearly frequency, but we only retain a oneyear snapshot, choosing the year with the highest number of links (2013 for Compustat, 2018 for Factset).In each dataset, we remove firms with incomplete information and retain only firms with at least one connection in the supply chain.For Ecuador, we restrict our analysis to the largest 10.000 private companies due to computational constraints.We now motivate and describe three sets of variables that we will use as features: financial variables, geographical variables, and industry a liation.
Financial variables.Larger firms are likely to have more links (Krichene et al. 2019, Bernard, Dhyne, Magerman, Manova & Moxnes 2019, Bacilieri et al. 2022).As a result, firm sales are likely to be an important feature.In FactSet and Compustat, we also retain two other indicators: labor productivity (sales per worker) and R&D intensity (R&D expenses over sales).For Ecuadorean companies, we include expenses among the features. 4eographical variables.An extensive literature going back to Marshall (1890) in economic geography and Tinbergen (1962) in international trade has documented that firms tend to trade with physically closer firms (see also Bernard, Moxnes & Saito (2019)).The three datasets contain the addresses of firms' headquarters.We merged this information with that in the GeoNames database to compute the geographical distance between each pair of firms. 5Moreover, we used a firm's country (for Compustat and FactSet) or province (for Ecuador) as a feature.Specifically, we created a dyadic feature listing all the possible ordered combinations of countries (provinces), and assigned to each possible link the corresponding value given the supplier's and the customer's location.Note that we include only dyadic features (distance and location pair), and we do not include location as an individual firm's feature.
Industrial sector.The type of product that two firms produce should be a strong determinant of their probability to trade.In the extreme case where a product has a fixed "recipe", as in Leontief production functions, a producer will buy only from firms producing the required inputs.All the datasets contain information on companies' industrial sector.We used 3-digit NAICS codes for Compustat, 3-digit SIC codes for FactSet, and 3-digit ISIC codes for Ecuador.As for firms' geographical location, we used the industrial sectors to create a dyadic feature for every possible link.For instance, if firm 1 is in sector A and firm 2 is in sector B, the industrial sector feature for the couple (1, 2) will be AB; and if firm 1 is in sector B and firm 2 is in sector A, the industrial sector feature for the couple (1, 2) will be BA.As for geographical location, we include industry only as a pairwise feature, that is, we do not include industry as a feature of an individual firm.
Table 2: Summary of the features used in our model for each dataset.

Setup
Structure of the dataset.We create a row for each possible (directed) pair of firms 6 .First, we fill the row with suppliers' and customers' individual features (sales, and labor productivity, R&D intensity, total expenses).Second, we include dyadic features (geographical distance, and the two categorical variables containing the industrial sector and the country/province of the two firms).The column existence provides the classification target for prediction, that is, 1 if a link is present in the dataset and 0 otherwise.globe with a population > 500.The dataset contains the geographical coordinates of each settlement, and can be downloaded from http://download.geonames.org/export/dump/.The two datasets can be merged on the cities' name, the state and the country ( "State" is only available for US, Australia, Brazil, and a few other federal countries).Once we have the geographical coordinate of each firm, the distance is computed as the geodesic distance between the two sets of coordinates.
6 Self-loops are excluded by default, despite being sometimes observed in the data.Dealing with sparsity using subsampling.Only a tiny fraction of all possible links exist, so the existence column contains vastly more zeroes than ones.If untreated, this imbalance drives the model always to predict a non-existing link.We tackle this issue by randomly undersampling the datasets (He & Garcia 2009, More 2018); that is, we retain all the positive entries but we keep only a small randomly selected fraction of zero entries.We call the undersampling ratio the ratio between the number of elements in the two classes in the subsampled dataset.We choose an undersampling ratio of 200 for Compustat and Factset and 50 for Ecuador (compare to the ratios in the non-undersampled datasets, reported in Table 1) -these provide a good balance between model performance and computational requirements.For each network, we generate five di↵erent subsampled datasets.We then split each of these 5 datasets into a training and a testing set in a 70 : 30 ratio7 .
Randomly undersampling the data is not the only possible solution to learning on imbalanced datasets, nor is it an inconsequential choice.By deleting a portion of the data, undersampling might lead to an information loss and hinder a model's performance.Several "informed" undersampling algorithms have been proposed to delete links with minimal information loss (e.g., Zhang & Mani (2003)).However, these methods are computationally more demanding, as they usually require computing some definition of distance between the di↵erent data points and, thus, are harder to adopt when dealing with large datasets.Another approach, oversampling, consists in making copies of the datapoints associated with existing links (in a possibly sophisticated way, see e.g., Chawla et al. (2002)), but again this is computationally intensive and might lead to overfitting if implemented naively.
Algorithm.Our main approach is an ensemble method, specifically Gradient Boosting (Friedman 2001).Ensemble methods combine multiple algorithms (weak or base learners) to obtain predictive performance that any constituent algorithms alone could not achieve alone.These are considered to be among the best algorithms for classification and predictions on tabular data (Grinsztajn et al. 2022).They also have the advantage of being widely available in software packages, and are fast enough for us, given the size of our datasets.
The idea at the core of boosting is to train several learners sequentially, each trying to compensate for its predecessors' shortcomings.Assume a given dataset of n examples and m features , y i 2 R), and a function (x i ) = y i that maps inputs into outputs.
Gradient Boosting tries to build an approximation ⇤ K (x i ) as a sum of K functions, where the functions f k = f (x i , ✓ k ) are the ensemble's base learners, parametrized by ✓ k .The approximation ⇤ K minimizes the expected value of a loss function L (y i , ŷi ) and is built in K steps.First, a constant approximation is obtained as (2) The following models are then built sequentially, where ⇢ m and f m minimize Ideally, to solve the minimization problem in equation 4, we would choose f m to be equal to the negative gradient of the loss function, and find the value of ⇢ m with a line search, However, equation 5 can't be always satisfied, and we settle for the learner f m (x i ) = f (x i , ✓ m ) that mostly correlates with g m over the data distribution.This is the solution of the problem A common choice for base learners is using classification and regression trees (Breiman et al. 1984, Sutton 2005).Broadly speaking, trees are made of branches, starting at the same node.Each branch is composed of a set of internal nodes and terminates with a leaf.Internal nodes host decision rules; by starting at the tree's root and following the decision rules, each data point can be allocated to one of the leaves, or a set of scores can be assigned to each leaf, and later combined into a single prediction.The goal is to create a model that predicts a target variable's value by learning the correct decision rules inferred from the data features.For this class of functions, finding the optimal parametrization in equation 7 corresponds to finding the optimal tree structure and leaf weights.This is a very demanding computational task: a simple "greedy" approach requires to enumerate all the possible split points for every feature of the training data.Recently, a series of algorithms and engineering solutions have been proposed to train gradient boosting models more e ciently (see, e.g, Tyree et al. (2011), Chen & Guestrin (2016) and Ke et al. (2017)).Among these, LightGBM (Ke et al. 2017) was developed with the goal of optimizing training time on large datasets.According to Bentéjac et al. (2021), LightGBM significantly outperforms the other gradient boosting implementations in terms of computational speed and memory consumption with minor compromises on predictive performance.In line with LightGBM's default recommendation, we treat categorical features as numeric (see Appendix C for a discussion).We mostly stick to the default parameters; Appendix A reports what we use in details.
ROC curves.A model trained to distinguish between existing and non-existing links is an example of a binary classifier.To test its performance, one can compute True Positives (TP), True Negatives (TN), False Negatives (FN) and False Positives (FP) (see Fig. 1).In practice, our classifier is predicting a probability p that a link exists.It is up to us to decide the threshold ⌧, such that if p > ⌧, the link is predicted as existing; the model's confusion matrix (Fig. 1) ultimately depends on the threshold we adopt.To evaluate the model in a way that does not depend on the threshold, we use the Receiving Operator Characteristic curve (ROC).The ROC curve is created by plotting the True Positive Rate (TPR=TP/(TP+FN)), also called Recall or Sensitivity, against the False Positive Rate (FPR=FP/(FP+TN)) at various values of the threshold ⌧.In our framework, the ROC curve can be thought of as the set of points in the FPR/TPR space obtained by sequentially adding links in the network, from the most to the least probable.
We can summarize the information in a ROC curve in a single metric, the Area Under the Curve (AUC): the higher the AUC, the better the model performance.AUC can take values between 0 and 1, and a "random" classifier, that is, a classifier that makes its prediction by drawing from a Bernoulli distribution, achieves an AUC equal to 0.5.
In strongly unbalanced datasets, it is extremely easy to predict the negatives, so the di culty lies in making a small number of excellent predictions, that is, predicting only a fairly small number of links and doing so accurately (having TP and few FP).AUROC does not measure this ability very well, because even when many of our predicted links are non existing (many FP), the FPR=FP/(FP+TN) remains relatively small due to the huge number of TN.Precision-Recall Curves (PRCs) are interesting alternatives to ROC in this context (see, e.g., Brintrup et al. (2018)).Precision (TP/(TP+FP)) gives the number of correct guesses out of all guesses, and Recall is the TPR defined above (TP/(TP+FN)), which gives the number of correct guesses out of all the positives in the dataset.The area under the precision-recall curve (PR-AUC) can be used to summarize the performance of the model.Nevertheless, here we present our results in terms of AUROC (AUC for short) for two reasons (see Appendix B).First, when a model has a curve that dominates on the TPR-FPR space, it dominates on the P-R space.Since these curves convey relatively similar information, it makes sense to present the more commonly used metric.Second, PR-AUC, in contrast to AUROC, is highly sensitive to the undersampling ratio.Since the undersampling ratio is a relatively arbitrary choice we make, and future researchers would likely make a di↵erent choice, we prefer to establish our benchmark performance using AUROC and include Precision-Recall Curves in Appendix D.

Results
We first show the performance of our approach and compare it with those of a few relevant benchmarks.Next, we show which features substantially impact the model's performance.Finally, we train the model with data from a specific country and show its performance in predicting links in other countries, mimicking a real-world application more closely.

Prediction performance
Fig. 2 shows the results of our machine learning model on the three di↵erent datasets.The model provides very good results, with a value for the AUC always above 0.9, vastly outperforming the 0.5 AUC benchmark value of random classifiers.These results are in line with those obtained by Kosasih & Brintrup (2021), who also get AUC values slightly above 0.9, although the comparison is not straightforward because the two methods di↵er substantially in their inputs, the networks analyzed, and the overall approach.
Fig. 3 shows the corresponding ROC curves.Recall that the ROC curve is built by ranking all pairs of firms by their probability of being connected, and considering that a link exists only for the n pairs with the highest probability.The steep ascent at the beginning of the curves in Fig. 3 tells us that if we increase n a little (i.e., if we move on the curve in the right direction), we will correctly predict more and more links at the cost of misplacing a few.
What would these numbers imply for a real-world, truly out-of-sample test case?In such a case, we would not be able to undersample the set where predictions are made, since, by defini- tion, we wouldn't know whether links exist or not.To get better understand what these numbers would imply in practice, Appendix B provides an analysis of Compustat with no undersampling.We found that if we predicted a number of links equal to the existing number of links in the test set (310), 23% of the predicted links would be true links (and by definition, these predictions would recover 23% of all the positive links).

Benchmarks
To further assess the performance of our model, we provide three relevant benchmarks: a salesdriven maximum entropy model, a gravity model, and an exponential random graph model (ERGM).All the benchmark models were tested on the same test sets used for the gradient boosting model.However, the training procedure and the information used vary from benchmark to benchmark.
Sales-driven Maximum Entropy model.We use a model similar to the model used by Almog et al. (2019), Squartini & Garlaschelli (2014), Garlaschelli & Lo↵redo (2004, 2005), Garlaschelli et al. (2007) to predict the topology of the International Trade Network.In one of its simplest forms, in the context of trade between countries, the model predicts that, if i and j have GDP Y i and Y j respectively, the probability of trade between i and j (i.e., of goods flowing from i to j) is where z is a parameter to be estimated.We use the previous formula and substitute firms' sales for countries' GDP to compute the probability of a link between two companies.Since p ij is an increasing monotonic function of Y i Y j , assuming z > 0, we can simplify the expression above and compute a score s ij as We build the ROC curves by using the score s ij to rank the links from the most to the least likely to exist.
The advantages of the sales-driven maximum entropy model is that it does not need training (it can be used directly on the test data) and it requires very little data.A substantial drawback, however, is that while reciprocity tends to be low in production networks (e.g.around 5% in the Ecuador network and lower in FactSet and Compustat, Bacilieri et al. (2022)), this model predicts perfect reciprocity, p ij = p ji .
The next benchmark we introduce keeps a similar structure but allows for non symmetric predictions and uses more information.
Gravity model.The gravity model owes its name to a loose analogy with Newton's gravitational law.First pioneered by Ravenstein (1889) in the study of migration patterns, it was later used by Tinbergen (1962) to explain international trade flows.The model was immensely successful in this field due to the good fit to observed trade flows, and its parsimonious and tractable representation of economic interactions (Anderson 2010).In a generalized form, the Gravity Model of international trade states that the expected amount of trade where d ij is the geographic distance between the countries and K, ↵, , and are free parameters.
We test whether

D
w ij E can be used as a meaningful score for link prediction.Specifically, if we define a score

E⌘
we can rewrite Eq. 8 as To estimate this model, we take the "existence" variable as the dependent variable, replacing s ij .Since it is binary, we estimate the model using logistic regression, which we perform on the training samples8 .A limitation of this model is that it does not use the information on firms' industrial sectors.While we could, in principle, add a set of dummies, we had limited success doing this, partly because many industry-pairs appear only once or, more rarely, appear in the test set but not in the training set.We refrain from pursuing this further while noting that the transparency of the logit (or linear probability) models may make them useful in practice.
The estimated values for the parameters ↵, , and are shown in Table 3.The logistic regression picks up a few relevant features of the network.In all three datasets, takes positive values -unsurprisingly, as distant firms are less likely to be connected than closer ones.The values of ↵ and are more interesting, as they o↵er some insights about the di↵erences between the datasets.Recall that Y i denotes the sales of the supplier, and Y j the sales of the customer.For Compustat, the value of ↵ is negative, while is positive.These values suggest that, holding customer size constant, pairs with larger suppliers are less likely, and holding supplier size constant, pairs with larger customers are more likely.This somewhat counterintuitive result is a consequence of Compustat's way of collecting supply chain data: it is hard to find large firms that sell more than 10% of their production to a single customer.The ↵ value becomes positive again when this bias is lower (FactSet) or absent (Ecuador).
Exponential Random Graph Model (ERGM).An ERGM is a probability distribution P e over the set of possible networks G, where x (G) is a vector of network G's statistics and the vector ✓ contains the model's parameters.
The statistics can include individual, dyadic or global information of a network, such as the sales of firms, the geographical distance between pairs of firms, and the average density of the network.These parameters are estimated so that the expected network statistics match the observed ones,

⌘
. ERGMs are popular in the study of socio-economic networks, in part because they can shed light on the mechanisms driving the network formation process.For instance, looking at Japanese firms, Krichene et al. (2019) find that link formation is driven by geographical distance, industrial sector, size (although with dissasortative mixing), common main bank, reciprocity, and transitivity with common partners.
Finally, ERGMs make link prediction tasks straightforward.Let G +ij and G ij be two identical networks, except that i is connected to j in G +ij but not in G ij .Thus the odds ratio p ij of an edge from i to j being present rather than absent is ⌘ .
We provide a more thorough discussion on link prediction with ERGMs and explain how we fit the model in Appendix B.
Results.Fig. 4 shows the results.The Gradient Boosting model substantially outperforms the three benchmarks.An interesting result is that, on the Compustat dataset, the maximum entropy model has weak performance and is vastly outperformed by the gravity model.This is again due to the way Compustat collects information on the supply chain.The correlation between sales and indegree (number of suppliers) is 0.76, but only -0.16 between sales and outdegree (number of customers).As a result, good models should be able to assign greater probability to pairs in which a large firm is the customer rather the supplier, something that the gravity and the gradient boosting model are flexible enough to do, but the sales-driven maximum entropy model fails to do because it predicts p ij = p ji .

Importance of di↵erent features
Computing features' importance means -in general -quantifying the relative predictive power of the features.Here we compute each feature's permutation importance (Breiman 2001).A feature's permutation importance is the decline in the model's performance when the values of the feature are randomly shu✏ed.Shu✏ing breaks the relationship between the feature and the target and helps us assess how strongly our predictions depend on that feature.The algorithm works as follows.Let m be a fitted predictive model, D be a dataset with units in row and variables in columns (here D is the test set), and K be a given number of repetitions of the randomization.We first compute the reference performance P of the model m on D. Then, for each repetition k = 1, . . ., K, and for each feature j in D, we first randomly shu✏e the column j of the dataset to generate a corrupted version of the data Dk,j , and then compute the score P k,j of m on the corrupted data Dk,j .Finally, we compute importance I j for feature j as Permutation feature importance can give misleading results in correlated features that need to be permuted together and whose contribution is hard to disentangle.In our data, the features "country pair" and "geographical distance" are highly correlated, so we permuted these jointly (that is, we randomized both columns simultaneously).
In all the datasets, we observe that the industrial sector is the main driver for the performance (see Fig. 5).This is a sensible result.Firms producing similar goods will buy similar inputs, and, consequently, knowing the industrial sectors of a pair of firms helps us a lot in predicting commercial partnerships.
It is hard to make an unambiguous ranking of the other features; however a few facts can be highlighted.The combination of Geographical distance and Country pair (Province pair for the Ecuador dataset) is very relevant for Ecuador and FactSet.These features are less relevant in Compustat.This could be because most Compustat firms are based in the U.S., so knowing a pair of firms' countries is not very informative.
Finally, features related to size, while less important, do appear significant.In Compustat, and to a lesser degree in FactSet, the sales of the customer is an important feature; again, this makes sense since Compustat and to a lower extent FactSet include data arising from the "disclosure of large customers" rule; the sales of the supplier is also important, but less.In Ecuador, the expenses variables appear more important than the sales variables.
R&D intensity and labor productivity appear to have some mild importance in Compustat, but none in FactSet (these variables are not available for Ecuador).This is an interesting negative result, suggesting that overall, most of the predictive ability comes from intuitive and widely available data: industry pairs, distance, and firm sizes.Of course, we expect that future studies should be able to identify and design better features, based on network and economic theory.

Unobserved countries
In many countries, including several large advanced economies, no production network data is available.Can we predict the production network of these countries, using what we learn from countries where the production network is available, coupled with standard data on firms' industries, locations, and sizes?
In principle, yes.We can train a model on a country where network data is available and apply this model using only firm-level data.Here we demonstrate that this is technically feasible (we only need to renormalize the variables to make the model portable from one country to another), and we establish two benchmark results.
The first uses the fact that FactSet contains data on several di↵erent countries.We remove a country from FactSet, train the model on the remaining data, and predict the network of the country that has been removed.If we perform well, we could, in principle, predict the production network of a country where no production network data exists "as if FactSet had collected it".
We then attempt a harder prediction task: Can we train the model on Ecuador, and predict FactSet?And vice-versa?Our results here will be much less promising, and we will explain why.
Normalizing variables.Given our results on features' importance, we consider only the most important features: firm sales, industrial sector, and geographical distance.Working with raw quantities is sometimes not feasible (e.g. because the classification systems for industries are di↵erent), sometimes non-sensical (e.g. if sales are expressed in a di↵erent currency), and sometimes sub-optimal (e.g. because the geography of the countries is very di↵erent; for instance, the distance between any pair of Japanese firms is lower than the distance between Boston and Los Angeles).
To make the features more homogeneous across countries so that learning in one can be used in the other, we rescale each feature such that within a given country, it ranges between 0 and 1.If x i represents the sales of firm i based in country c, and if ! is the set of all the firms based in c, we compute the quantity X i as Similarly, we substitute for the distance d ij between i and j the quantity 9 Finally, to homogenize the industry classification systems, we convert both FactSet's and Ecuador's industrial sector code to NAICS classification 10 .
Di↵erent countries in FactSet.FactSet contains information on companies based all over the world.However, most firms are based either in the US, China, or Japan: each of these countries hosts roughly one third of the firms in the dataset.These countries are thus excellent candidates test cross-country predictability, as taking 2 out 3 in the training set implies roughly the same train-test ratio as in the main task (0.7/0.3).We build a dataset as described in Section 2.2, and then filter it to retain only pairs of firms based in the same country. 9To avoid computing the logarithms of null values, we added a small quantity = 10 2 to the sales and the distance of each firms couple.
10 SIC to NAICS crosswalk was provided by NAICS association https://www.naics.com/sic-naics-crosswalk-search-results/.ISIC (Revision 4) to NAICS concordance table was downloaded from https://unstats.un.org/unsd/classifications/Family/Detail/27.We take SIC, ISICs, and NAICS at the thirddigit aggregation level.When the mapping between codes is not 1-to-1, we choose the more common combination (e.g., a SIC sector S 1 might be mapped 75% of the times to a NAICS sector N 1 and 25% of the times to a NAICS sector N 2 .We consider S 1 !N 1 as the correct mapping).If more than one combination of codes appear with the same frequency (11% of the SIC codes and 10% of the ISIC codes), we select one at random. Figure 6: In the previous section, the links to predict (dahed lines in light blue) were randomly picked from the network.Now, the network is split into disjoint parts.Note that in training set, we remove inter-country (blue-to-orange) links.
More precisely, while previously we considered all links and split them into a testing and training set at random (Fig 6,left), we now take all the within-country links in a specific set of countries as training set, and all the links within a target country as a testing set (Fig. 6, right).Note that all the between-country links are entirely discarded -they are part of neither the training nor the testing set.
FactSet on Ecuador, and vice-versa.Aside from normalizing and harmonizing the variables, we again remove from FactSet all the links between firms based in di↵erent countries.For both the datasets, we kept the undersampling ratios of Sec.3.1.
Results.Fig. 7 shows the results for the cross-country prediction tasks using FactSet.Our approach retains a decent predictive performance with an AUROC greater than 0.8; while the quality of the prediction decreased compared to the previous section (Figs. 2 and 3), our approach is still consistently better than the benchmarks.The simple maximum entropy model is a particularly interesting benchmark for this task, because it requires no training, and is therefore a straightforward method already available in many countries to reconstruct production network data.
To understand why our approach is not as e↵ective as the previous cases, we look at the distribution of the rescaled quantities D (distances) and X (sales) for the three di↵erent countries (Fig 8 and 9).The point here is that we cannot expect an algorithm to predict well on a dataset that is very di↵erent from the training sample, so we explore basic statistical properties of each dataset separately to see if they appear similar (i.e., as if they were drawn at random from the same sample).
We see that Japan's D distribution has a prominent peak for small values, which is not present for the other countries, and another peak around D = 0.9, while the distributions for the US and China peak around D = 0.95.The distribution of rescaled sales X also appear quite di↵erent: while most of the mass of the distributions for China and the US is between X = 0.5 and X = 0.9, that of Japan is between X = 0.4 and X = 0.7.These di↵erences are noticeable and likely to contribute to the decline in performance, but overall, there is a good degree of homogeneity in FactSet, making cross-country prediction possible.
By contrast, the results of the second experiment, where we predict FactSet using Ecuador and the other way around, are not as encouraging.The performance of our model hardly surpasses those of simpler classifiers (see Fig. 10; Maximum Entropy would have similar performance).We again attribute this outcome to the considerable di↵erences between the two datasets.The distributions of rescaled sales X and rescaled distances D, shown in Fig. 11 and 12, support this intuition11 .In particular, the distributions of firm sizes are very di↵erent in FactSet, which is based on large, listed firms, and in Ecuador, which is an administrative dataset.
Aside from firm sizes and distances, the key features helping prediction are the industry pairs.In Fig. 13, we ask, for each dataset and each sector-pair, "if we observe two firms with a specific sector-pair, what is the (empirical) probability that there is a link between them?".In other words, for each sector pair, we check the share of observations in the (undersampled) dataset that correspond to existing links.The percentages di↵er dramatically between Ecuador and FactSet, showing basically no correlation.
We think this is the result of di↵erences in structure of the economies, di↵erences in data collection methods, and issues with matching classification systems.
Overall, the results suggest that our approach can predict links on an unobserved country as long as the data on the production network of the target country is collected using similar methods.We cannot be sure that the good results we have for cross-country predictions using  For ease of comparison, we report in the first five rows the results of Fig. 2 and Fig. 7 FactSet would extend to cross-country predictions using administrative datasets, but we think this should be tested and our work here provides a clear benchmark.

Conclusions
We used machine learning classifiers to infer the presence of commercial relationships between companies.Our approach shows solid predictive performance.Given how parsimonious our model is regarding training features and how consistent the results are across datasets, we believe this is a striking result.
Our approach outperforms a few well known-benchmarks, although the comparison is difficult because the models have di↵erent data requirements.Nevertheless, the strength of our model lies in the possibility of leveraging company-specific features, numerical and categorical.For supply chains, these properties (sales, industry, and location) are often easier to find than network-specific metrics that other methods require.
Our results also suggest that reconstructing the production network of country A, given the production network of another country B, might be a feasible challenge.In this paper, we made one first attempt to establish a benchmark that we expect can be beaten in the future, for example, by including in the predictions some previous knowledge on the target production network.If successful, this e↵ort would dramatically cut the e↵orts required to obtain production networks' data and make fine-grained data much more widely available to researchers.
An obvious extension of our work would be to include and design of new features, company and pair-specific, from both network and economic theory.A further step would be to include network topology.Simple link prediction models based on local similarity indices (Zhou et al. 2009), or more sophisticated models based on topological information have proven to be e↵ective in predicting links for a wide set of networks, including supply chains (Brintrup et al. 2018, Kosasih & Brintrup 2021).In addition to this, as is well known in the forecasting community,  Figure 13: Percentage of existing links in each sector couple in the two datasets.The two quantities are uncorrelated (the correlation coe cient ⇢ is only 0.09), suggesting a significant di↵erence in the economies' structures and the data collection process.forecasts combination often improves performance; this principle also applies in the context of link prediction: optimal predictions are often obtained by stacking together the output of several di↵erent models (Ghasemian et al. 2020).Combining the approach described in this work with other topology-based link prediction methods is an interesting and important future direction for research.
A related avenue for further research would be to find better metrics for evaluating performance.Here we have used the classic AU-ROC, noting its limitations, but in the future, it would be interesting to find performance metrics that focus on the ability to predict existing links, are invariant to the undersampling ratio, evaluate the ability of the model to predict topological features, and evaluate whether the reconstructed network is useful when plugged in economic models.

Appendix A Model details
The experiments were performed on an Amazon AWS EC2 c5 machine.The model we used is the gradient boosting classifier provided in the LightGBM python library, which turned out to be the best-performing across the di↵erent experiments.Table 4 reports the models' parameters for the di↵erent experiments.We performed a grid search around a few of the parameters' default values and the default values of another well-known gradient boosting implementation (XGBoost) on a very coarse grid.The tweaking of these parameters did not appear to make a significant di↵erence in our results, and we did not pursue a more fine-grained optimization.

B Undersampling and evaluation of model performance
As the main text explains, our primary metric for comparing models is the Area Under the Receiving Operating Curve (AUROC).This metric has a well-known drawback in the case of strongly unbalanced datasets such as ours: The ROC curve uses the FPR=FP/(FP+TN), so a large change in the number of FP leads to only a minor change in the FPR due to the vast number of TNs.
In other words, ROCs fail to put emphasis on the performance obtained when predicting only a small number of existing links.This issue is well-known, and the main alternative suggested in the literature is the Precision-Recall Curve (PRC) (see Fig 1B for definitions).While PRCs are very intuitive and useful for link prediction tasks, there are three reasons why we prefer to use AUROCs in the main body of the paper.First, to a large extent, ROCs and PRCs convey the same information; in fact, it is not di cult to show that if a model has a ROC that strictly dominates that of another model, then its PRCs also strictly dominates, although the ranking between models can change when their ROCs cross (Davis & Goadrich 2006).Second and more importantly, in contrast to ROCs, PRCs depend substantially on the undersampling ratio: if we construct datasets with many more positives, our guesses of positives are more likely to be true.In this paper, we need to undersample the data to create training and testing samples of manageable sizes, so the dependence of the performance metric on the undersampling ratio is potentially problematic.
To explain the issue in more detail, we explore ROCs and PRCs for a large span of values for the undersampling ratio for Compustat, which is small enough to allow us to estimate the models even if we don't undersample at all (see Table 1).Fig. 14  line with Kosasih & Brintrup (2021, Figs. 5 & 6).While ROCs are fairly stable under di↵erent undersampling ratios, the PRCs change dramatically.Essentially, if we remove many negatives, it becomes easier for any guess of a positive to be correct.This observation also serves to note a trivial but important point: in a case where we really do not know the labels (positive/negative), we cannot undersample the dataset.Therefore, to get a sense of the performance of the model in a genuine out-of-sample task, we need to compute these metrics in a non-undersampled test set.Since Compustat is small enough to do this, we provide a few specific points along the PR-curve (Table 5).If we predict as many links as the true number of links, we recover 23% of the true links, and 23% of our predicted links are indeed existing links.If we wanted to be sure that 80% of our predictions are correct, we should only pick ⇡ 67 links, thus identifying roughly 6% fo the links in the network.If instead we wanted to identify half of the links in the network, we would have to make ⇡ 1446 guesses, of which only ⇡ 10% would correspond to an existing link.
We expect these numbers would be somewhat lower for Factset and Ecuador, but we have not tested.
While we could have compared all the various models using AUPRCs throughout the paper (see Online Appendix D for additional results), here we prefer to report AUROCs, which provide a more robust benchmark for future researchers, who will use undersampling ratios appropriate to their network density and computational capability.

B Exponential-Family Random Graph Models
An ERGM is a probability distribution over the set of possible networks connecting a collection of N nodes.It takes the form: is a random adjacency matrix, • x is a specific realization of X, • ✓ is a vector of model parameters, • z(x) is a vector of network statistics, • k(✓) is a normalization constant.
ERGMs are popular in the study of socio-economic networks because they can deal with nodes' covariates (e.g., the sales of a firm), dyadic properties (e.g., the reciprocity of an edge), and the features of the full network (e.g., the expected density); as a result they can shed light on the mechanisms driving network formation (see Krichene et al. (2019)).We briefly discuss how we fitted this model and used it for link prediction.
Fitting.The ergm R library is a standard for working with ERGMs.From a network and a list of features to include, it provides estimates of the coe cients of an ERGM through a (pseudo) likelihood maximization procedure.ERGMs are hard to calibrate on large networks, and we have only succeeded in making the calibration process converge for Compustat, the smallest of our networks.For FactSet and Ecuador we have adopted a di↵erent strategy.First, we have subsampled ten di↵erent subnetworks for each of the two datasets.These smaller networks were sampled by randomly choosing a node and then retaining all its tier-1 and tier-2 neighbors (a procedure known as snowball sampling).We have calibrated an ERGM for each subnetwork and computed the average of their coe cients.We have used the average coe cients to make predictions on the larger network.The statistics used in the three datasets are reported in Table 6.
Link prediction.Once the distribution is fitted to the data (i.e., once we have an estimate for ✓), using an ERGM for link prediction is straightforward.Consider predicting a link between firm i and firm j, that is, predicting whether the adjacency matrix entry X ij is equal to one or equal to zero.Let us define X c as the rest of the network, X c = {X kl } 8 (k, l) , (i, j).For example, consider the following network G, where we know the presence/absence of each link except the one between 2 and 3: We may represent the adjacency matrix as x = 2 6 6 6 6 6 6 6 6 6 6 6 4 0 1 1 0 0 0 ? 1 0 0 0 1 0 0 0 0 3 7 7 7 7 7 7 7 7 7 7 7 5 .
Compustat FactSet Ecuador edges number of edges X X X transitive number of triangles / transitivity X nodecov(f) nodeicov(f) absdiff(f) Table 6: ERGM statistics.The first columns shows the R functions used, the second column their explanation.X + is equal to the set of the coordinates of existing links and f is either sales, productivity, R&D intensity.The first two functions have a straightforward interpretation: they measure the expected number of edges and transitive triads inthe network.The following two measure the e↵ect of the feature f (i.e., they answer questions like: is a link more likely to exist if the suppliers' sales are larger?).The last one computes the expected di↵erence between connected firms' features.For a complete description of these functions, see the ergm package documentation (Handcock et al. 2019).We want to find the probability that x 2,3 = 1, while the rest of the matrix x c is equal to x c = 2 6 6 6 6 6 6 6 6 6 6 6 4 0 1 1 0 0 0 • 1 0 0 0 1 0 0 0 0 3 7 7 7 7 7 7 7 7 7 7 7 5 .We can define two networks: G +23 , where x 23 = 1 (figure on the left), and G 23 , x 23 = 0 (figure on the right); we call x + and x their adjacency matrices.This procedure can be generalized to any desired network and link.Note that throughout the previous discussion, we assumed a fixed value for ✓,i.e., we assumed that -once calibrated -the parameters of our model would not change.This assumption is coherent with our experimental procedure: we first calibrate the model using the whole network data (thus obtaining a single value for ✓) and later use this model for link prediction.The previous discussion would have been in agreement with a di↵erent yet sensible approach: calibrate the model on the observed portion of the network, again obtaining a single ✓, and then use this model for link prediction12 .A consequence of using a single ✓ is that, as can be seen in the last formula for p + , one does not need to go through the di cult challenge of computing the normalizing constant k (✓) (also known as the partition function) to find a link's odds to exist.However, it is worth mentioning that in the literature, one can encounter a di↵erent approach, where p+ and p are computed using two di↵erent models, one fitted on G + and the other fitted on G .This procedure leads to a slightly di↵erent formula (see Kumar et al. (2020)), which falls back to the one we showed, assuming that, in a large network, the presence or absence of a single link would not generate a significant di↵erence in the values of ✓.

C Categorical Features
As we saw in the main body of the paper, the industrial sector of firms plays a crucial role in predicting supply connections, and it is represented as a categorical variable in our work.Consequently, it is important to provide the most salient facts on how the LightGBM implementation deals with categorical variables.Fig. 16 shows the equivalent of Fig. 4.There is a somewhat higher variability in the performances when evaluated using PR-AUCs compared to AUROCs.The performance on Factset is now more clearly lower than on Compustat.The performance on Ecuador is higher, which is due to the fact that PRCs are sensitive to undersampling ratios (Appendix B).
Fig. 17 shows the PR-AUC for the three di↵erent datasets and all their respective benchmarks.Again, this confirms the higher performance of the GBM.
Fig. 18 shows the PR-AUCs for the Factset cross-country prediction task, for di↵erent models, to be compared with Fig. 7, showing similar qualitative conclusions.

Figure 1 :
Figure 1: (A): True Positives, True Negatives, False Positives and False Negatives are often reported in the confusion matrix.(B): TPR, FPR, and Precision can help us summarize the information in the confusion matrix.

Figure 2 :
Figure 2: AUC values for the Gradient Boosting model on the three datasets.Average values (bars) and standard deviations (error bars) are computed on the five di↵erent realizations of the subsampled datasets.Each error bar shows ± one standard deviation from the average value.

Figure 3 :
Figure 3: ROC curves of the Gradient Boosting model.For each dataset, we plot 5 ROC curves, obtained on five di↵erent train-test splits of the datasets

Figure 4 :
Figure 4: Values of the AUC for the benchmark models.Average values (bars) and standard deviations (error bars) are computed on the five di↵erent realizations of the subsampled datasets.Each error bar shows ± one standard deviation from the average value.

Figure 5 :
Figure 5: Features' permutation importance.Average values (bars) and standard deviations (error bars) are computed on ten random permutations of one of the subsampled dataset.Each error bar shows ± 1 standard deviation from the average value.

Figure 7 :
Figure 7: AUROCs for the Factset cross-country prediction task, for di↵erent dataset splits.Average values (bars) and standard deviations (error bars) are computed on the five di↵erent realizations of the subsampled datasets.Each error bar shows ± 1 standard deviation from the average value.

Figure 8 :
Figure 8: Distribution of rescaled distances D for the US, China, and Japan

Figure 9 :
Figure 9: Distribution of rescaled sales X for the US, China, and Japan

Figure 11 :
Figure 11: Distribution of D for FactSet and Ecuador.

Figure 12 :
Figure 12: Distribution of X for FactSet and Ecuador.

Figure 14 :
Figure 14: Compustat's Receiver-Operating (Left) and Precision-Recall (Right) curves, for di↵erent values of the undersampling ratio (SSR), with Area Under the Curve (AUC) shown in the legend.

Figure 15 :
Figure15: Same decision rule implemented with a categorical variable or its ordinal encoding.

Figure 16 :
Figure16: Area under the Precision-Recall curves for the three di↵erent datasets for the subsampling ratio specified in the main body of the paper.

Figure 17 :
Figure 17: Area under the Precision-Recall curves for the three di↵erent datasets and the respective benchmarks.

Figure 18 :
Figure 18: Area under the Precision-Recall curves for the three di↵erent splits of FactSet into di↵erent countries' networks.

Table 1 :
Table 1 reports the number of nodes and links in each dataset.
Number of nodes and links in the three datasets.The last column shows the dataset's imbalance, i.e., the ratio of the number of pairs that do not have a link to the number of pairs that do have a link.

Table 3 :
Average value and standard deviation of the three coe cients (across the five subsampled datasets).

Table 4 :
Model parameters across the di↵erent experiments.Values in bold font are LightGBM's default values.

Table 5 :
shows the results, which are in Precision and Recall at various points of the PRC, corresponding the darkest line in Fig.14, right panel.The first row corresponds to the true number of links in the testing set.