Spatial Structured Prediction Models: Applications, Challenges, and Techniques

Spatial structure patterns are prevalent in many real-world data and applications. For example, in biochemistry, the geometric topology of a molecular surface indicates protein functions; in hydrology, irregular geographic terrains and topography on the Earth’s surface control water flows and distributions; in civil engineering, wetland parcels in remote sensing imagery are often made up of contiguous patches. Spatial structured prediction aims to learn a prediction model whose input and output data contain a spatial structure. Modeling spatial structural information in prediction models is critical for interdisciplinary applications due to two reasons. First, explicit spatial structural information often indicates the underlying physical process, and thus enhances model interpretability. Second, spatial structural constraints also have positive side-effects of enhancing model robustness against noise and obstacles and regularizing model learning when training labels are limited. However, spatial structured prediction also poses several unique challenges, such as the existence of implicit spatial structure in continuous space, structural complexity in geometric topology, and high computational costs. Over the years, various techniques have been proposed for spatial structured prediction in different applications. This paper aims to provide an overview of the spatial structured prediction problem. We provide a taxonomy of techniques based on the underlying approaches. We also discuss several future research directions. The paper can potentially not only help interdisciplinary researchers find relevant techniques but also help machine learning researchers identify new research opportunities.


I. INTRODUCTION
Structured prediction (also called structured learning, or structured output learning) aims to learn a predictive model whose input or output data contain a structure between samples [1], [2]. Structural prediction is important in many real-world applications. In natural language processing, words in a sentence often follow a syntactic structure called a parse tree. In biology, a protein consists of amino acids in a network structure, and an amino acid consists of atoms such as carbon, nitrogen, and oxygen again in a network structure. In speech recognition, vocals follow a sequential structure with a current element correlated with its precedents. The existence of structure is non-trivial in supervised learning because learning samples are no longer independent and identically distributed (i.i.d.). The i.i.d. assumption is prevalent The associate editor coordinating the review of this manuscript and approving it for publication was Berdakh Abibullaev . in many supervised learning methods such as decision tree, random forest, maximum likelihood classifiers. The assumption simplifies model representation and learning algorithms since an identical model can be used to predict each sample independently. Removing the i.i.d. assumption dramatically increases the complexity of models and learning algorithms.
Spatial structured prediction, as its name suggests, aims to learn a prediction model for which input or output samples follow a spatial structure [3]- [6]. Such a prediction model is also called a spatial structured (prediction) model. For example, in biochemistry, the geometric topology of a molecular surface indicates protein functions; in hydrology, irregular geographic terrains and topography on the Earth's surface control water flows and distributions; in civil engineering, wetland parcels in remote sensing imagery are often made up of contiguous patches.
Modeling spatial structure is important because the structural information is often closely related to the underlying physical process. It provides an opportunity to bridge machine learning models with existing theories or knowledge in a scientific discipline. For example, hydrologists have long studied the distribution and flow of water on the Earth's surface based on geographic terrain and topography, and have developed various simulation models. Incorporating such geographic terrain structure into machine learning models helps enhance interpretability of the black-box models. From this perspective, structured prediction is intrinsically interdisciplinary, involving the convergence of multiple disciplines. It represents an intersection between data-driven machine learning models in computer science and physics-driven models in an application discipline.
However, learning spatial structured models is non-trivial due to several unique challenges. First, the spatial dependency structure is often implicit in continuous space. Second, the dependency structure can be complex both in Euclidean space and in geometric topology space. Third, spatial structural patterns can be further complicated by additional spatial effects such as spatial heterogeneity (local effects) and multi-scale effects. Fourth, learning spatial structured models is computationally challenging due to model structural complexity. Finally, the problem can be challenging when training labels are limited.
Over the years, various techniques have been developed for different applications. Existing surveys on relevant topics such as spatial prediction [3] often focus on the general challenges of spatial data (e.g., spatial autocorrelation, heterogeneity, multi-scale hierarchy, etc.) without a systematic review of methods from the structured prediction perspective, particularly geometric topology structures. This paper fills the gap with a systematic review of different types of spatial structures. We provide a taxonomy of techniques based on the underlying approaches (Section V), including spatial contextual feature generation, spatial structured model representation, and deep neural networks. Specifically, within structured model representation, we further group the methods based on the types of spatial structures, including spatial neighborhood graph, spatial distance kernel, and geometric topology. Finally, we discuss several future research directions, including geometric deep learning, enhancing model transparency and interpretability, and scalable inference (Section VI). The paper can potentially not only help interdisciplinary researchers find relevant techniques but also help machine learning researchers identify new research opportunities.

II. PROBLEM STATEMENT
In order to formally define the spatial structured prediction problem, we need to first introduce the concept of spatial learning sample. In traditional prediction problems, input data is often viewed as a collection of sample records. Similarly, in spatial structured prediction, input spatial data can be viewed as a collection of spatial data samples, whereby each sample corresponds to a spatial object (e.g., point, line or polygon), or a cell in a raster grid. A spatial data sample has multiple non-spatial attributes, one of which is the target response variable to be predicted and the others are explanatory features. Additionally, a spatial data sample also has location information and spatial attributes (e.g., distance to another point, length of a line, area of a polygon). These additional information distinguish spatial data samples out from traditional data samples in two important ways: first, the location information and corresponding spatial attributes can provide additional contextual information in the explanatory feature list; second and more important, implicit spatial structure exists based on sample locations, violating the common i.i.d. assumption. For example, in ground sensor observations on soil properties, a sample corresponds to information from one geo-located sensor. Explanatory features can include soil texture, nutrient level, and moisture. The response variable can be whether a type of plant can grow or not at the location.
Formally, spatial data is a set of spatial data samples where n is the total number of samples, s i is a 2 by 1 spatial coordinate vector for the ith sample, x(s i ) is a m by 1 explanatory feature vector (m is the feature dimension), and y(s i ) is a scalar response (it is categorical for classification, and continuous for regression). The set of spatial samples can also be written in the matrix format, (X, Y), where X = [x(s 1 ), x(s 2 ), . . . , x(s n )] T is a n by m feature matrix, and Y = [y(s 1 ), y(s 2 ), . . . , y(s n )] T is a n by 1 response vector. A list of all symbols and descriptions are summarized in Table 1.
Given a set of spatial data samples with explanatory features X = [x(s 1 ), x(s 2 ), . . . , x(s n )] T and responses Y = [y(s 1 ), y(s 2 ), . . . , y(s n )] T , the spatial structured prediction problem aims to learn a model (or function) f such that Y = f (X). Once the model is learned, it can be used to predict the responses at other locations based on their explanatory VOLUME 8, 2020 features. The problem can be further categorized into spatial classification for categorical response and spatial regression for continuous response.
The problem defined above is called spatial structured prediction because there exists spatial dependency (instead of independence) between variables at different locations (x(s 1 ), x(s 2 ), . . . , x(s n ), y(s 1 ), y(s 2 ), . . . , y(s n )). It is different from other types of structured prediction (e.g., classifying sentences based on syntactic structure such as parse tree, recognizing speech signals based on the sequential structure of vocals) in that the dependency structure is related to spatial locations. Spatial structured prediction is different from traditional non-structured prediction problems. In non-structured prediction problems, samples are commonly assumed to be independent and identically distributed (i.i.d.). Thus, a same model can be used to predict every sample independently. In other words, a same model y(s) = f (x(s)) can be applied to ∀s individually. However, the i.i.d. assumption is often violated in spatial data due to implicit spatial structure based on sample locations. Therefore, a model needs to capture the relationship between entire feature maps and response variable map, as indicated by Y = f (X).
Scope: There are other types of spatial structure prediction problems beyond the geospatial domain. Examples include image and video classification as well as retrieval in the computer vision community, where spatial structures between image parts play a critical role in prediction models. Examples include fine-grained image classification [7], [8], video captioning [9] and multi-media retrieval [10], [11]. However, we realize that the amount of literature on these topics in the computer vision community is so extensive that the topic these topics alone could be a separate survey. Thus, we do not provide exhaustive coverage in this paper.

A. TYPE OF SPATIAL STRUCTURES
This subsection introduces several common types of spatial structures based on the underlying spatial domain. A spatial domain is an underlying space in which locational coordinates are measured. Here we listed three most common spatial domains, including two-dimensional (2D) Euclidean plane, spatial network, and three dimensional (3D) geometric topological surface. Various spatial structures exist in each spatial domain. For example, in a 2D Euclidean plane, spatial dependency structure can be established based on a distance between spatial points, adjacency between spatial raster cells, or topological relationships between geometric objects. On spatial networks, point samples often follow linear patterns along network paths (e.g., traffic crashes along road segments). On a geometric topological surface, spatial structure can be both regular geometry shapes and irregular terrains, topography, as well as contours and flow directions. We now introduce each category in detail below.

1) SPATIAL PROXIMITY ON 2D EUCLIDEAN PLANE
In a 2D Euclidean plane, spatial samples can be points, raster grid cells, or geometric objects (lines, polygons). For spatial point samples, implicit spatial structure exists between spatial points based on their distances. According to Tobler's first law of geography, ''everything is related to everything else, but near things are more related than distance things'' [15]. The law indicates that geographical data is essentially structured, with nearby locations mutually dependent on each other. The law is consistent with our common sense, e.g., households in the same neighborhood often have similar income levels. Thus, spatial structure can be modeled as a function of distance, with stronger dependency (or correlation) between sample attributes or classes that are closer to each other, as illustrated in Figure 1(a).
For raster grid cell samples such as earth image pixels, the most common spatial structure is the cell adjacency in a grid graph (Figure 1(b)). The most common assumption is the Markov property, i.e., a cell's attribute only depends on its neighbors. In this way, the joint distribution of all cells' attributes can be decomposed into a set of neighborhood cliques. The entire map is called a Markov random field, which has been widely used in image segmentation applications.
For spatial samples as complex geometric objects such as lines or polygons, the spatial structure between sample objects can be based on topological relationships, such as touch, overlap, disjoint, within, as shown in Figure 1(c). For example, a railroad (spatial line) may ''cross'' the Mississippi (spatial line) that ''passes through'' a national park (polygon) ''within'' the state of Minnesota (polygon). As another instance of example, in a construction field, the spatial relationship structure between the footprints of building, 38716 VOLUME 8, 2020 FIGURE 2. An example of US interstate highway networks (source: [18]).
pathways, and instruments are critically important for safety maintenance.

2) LINEAR DEPENDENCY ALONG SPATIAL NETWORKS
A spatial network (also called geometric graph) is a unique type of network whose nodes and edges are geometric objects. In other words, a spatial network combines the spatial relationships both in the Euclidean space and in the network space. Common examples include road networks and river stream networks. As illustrated by Figure 2, in a road network, a node is a road intersection (spatial point), while an edge is a road segment between two intersections (spatial line segment). The spatial structure of samples in a spatial network domain has several unique characteristics. First, samples are embedded in the geometric lines of network edges. Thus, samples show a linear dependency structure along network edges [16], [17]. For example, when predicting traffic volume, samples are not spreading across an entire Euclidean plane but are constrained by the direction of a network path. Second, between different network edges, there is also a network topology structure (e.g., direction, neighbor, predecessor, and successor). For example, in a highway system, a road segment (edge) has one or two directions, and different road segments are connected based on intersections (turns).

3) GEOMETRIC TOPOLOGICAL SURFACE
A geometric topology surface is a 2D manifold embedded in a 3D Euclidean space. Complex spatial structure patterns can exist on the surface based on geometric and topological features. For example, on a geographical terrain map collected from LiDAR point clouds (Figure 4(a)), the terrain structures not only show the shape of the Earth's surface but also control water distributions (surface extents) and water flow directions. Another example is 3D point clouds in building information modeling ( Figure 3). The geometric shape of objects and their topological relationships can indicate the structural properties of buildings, which is very important in construction and structural engineering.

III. APPLICATIONS
This section introduce several representative applications for spatial structured prediction, with a particular focus on applications that involve complex spatial structures in geometric topology surface.

A. HYDROLOGY
One fundamental problem in hydrology is to map the distribution of water (e.g., flood, river streams) on the Earth's surface from remotely sensed imagery. The problem plays a critical role in flood disaster response [23] and national water forecasting at the NOAA National Water Center on the University of Alabama campus [24], [25]. The problem, however, is significantly more complex than image classification, since the flow and distribution of water is constrained by the complex terrain on the Earth's surface, as shown in Figure 4(a). Integrating such complex spatial structures into a data-driven model can significantly enhance the model's interpretability and allow for comparison with existing hydrological simulation models.

B. MATERIAL SCIENCE
The properties of materials arise from atomic structures. Exploring atomic structures requires a comprehensive understanding of the potential energy landscape, a function between all atomic positions and their potential energy [26]- [32]. Figure 4(b) is an illustrative example. Complex topological structures on the landscape indicate the physical process of materials transiting from a stable state (local minimum energy) to a saddle state, and then relaxing to a nearby stable state. Modeling such complex spatial patterns into data-driven models helps reveal the relationship between local atomic displacement and material properties (e.g., shear strain) towards developing new materials.

C. BIOCHEMISTRY
Analyzing the chemical properties of a molecular system plays an important role in understanding the life process. Various data have been collected on the spatial structures of molecules through X-ray crystallography, NMR, and electron microscopy [33]. Theoretical chemistry has shown that VOLUME 8, 2020  spatial structural information can predict protein function and interaction [34]. Building transparent data-driven models with such complex geometrical and topological structures ( Figure 4(c)) can enhance model interpretability by the existing theories in the field [22], [34], [35].

IV. CHALLENGES
Learning spatial structured prediction models is challenging for several reasons. First, spatial structural dependency is often implicit in continuous space. For example, in disease risk mapping, nearby locations tend to have a similar level of disease risks due to the spatial autocorrelation effect. The effect is implicit from input data, which contains continuous maps such as environmental variables and human movements. This requires machine learning algorithms to explicitly model the spatial dependency structure. Second, spatial structural dependency can be very complex, both in an Euclidean plain and in a geometric topology surface. For example, in flood mapping, flood locations are derived not only based on spectral pixels in an earth imagery, but also constrained by the topography (e.g., contours, flow directions) of elevation map (a 3D surface). Third, spatial structure can be further complicated by additional effects such as spatial heterogeneity and the multi-scale effect. Spatial heterogeneity means that the dependency may be local instead of global. The multi-scale effect means that the spatial dependency structure may exist at different spatial scales in a spatial hierarchy. For instance, in traffic prediction along road networks, the spatial network structured patterns can be modeled both at a coarse scale (only with major highways) and a fine scale (with county roads and local streets). Fourth, learning a spatial structured model for a large data volume can be computationally challenging, due to the structural complexity in model representation. Many learning problems related to graph structures are NP-hard. Finally, there can be limited ground truth. For example, in earth science applications, collecting ground truth training class labels often involves sending a field crew on the ground or visually interpreting earth images, which is tedious and time consuming. This is a particularly challenge for spatial structured prediction since a complex model often requires more training samples in the learning process in order to avoid overfitting.

V. A TAXONOMY OF TECHNIQUES BY APPROACHES
We now introduce a taxonomy of existing spatial structured prediction models, as shown in Figure 5. Based on the underlying approaches to capture spatial structure dependency, existing works can be categorized into three groups: spatial contextual feature representation, spatial structured model representation and deep neural network. Spatial structured feature representation approach focuses on creating or learning contextual features that captures spatial dependency structure. Examples of such features include spatial relationships between geometric objects (e.g., within, overlap), outputs of spatial raster operators (e.g., texture), as well as learning contextual features to implicitly reflect spatial dependency (e.g., spatial network embedding). Spatial structured model representation approach aims to explicitly capture spatial dependency structure within machine learning model representation. Examples include modeling spatial neighborhood graphs in Markov random fields, modeling spatial correlation over distance in Gaussian process (Kriging) and modeling geometric topology in flow direction trees. Finally, deep neural network approach focuses on learning a blackbox model to automatically learn complex spatial structure dependency in an end-to-end manner. Specifically, it can be categorized into deep convolutional neural networks for raster imagery or regular spatial grid and graph neural networks for spatial graphs. We now introduce each approach and its specific methods, and compare different methods in their advantages and disadvantages.

A. SPATIAL CONTEXTUAL FEATURE GENERATION
One way of incorporating spatial dependency into prediction methods is to augment input data with additional spatial contextual features. The spatial context of a sample location refers to information related to the spatial structure surrounding it, such as relationships to other objects or locations, attributes of nearby samples, auxiliary semantic information from additional data sources. Once spatial contextual information is added into explanatory features, traditional non-spatial prediction methods can be used. We now introduce several approaches to generate spatial contextual features.

1) SPATIAL RELATIONSHIP FEATURES
Spatial contextual features can be generated based on spatial relationships with other locations or objects, such as distance or direction, touching, lying within or overlapping with another object. Spatial relationship features can be readily used in rule-based or decision tree-based models. Examples of techniques include spatiotemporal probability tree model to classify meteorological data on storms [36], [37], multi-relational spatial classification [38], prediction based on spatial association rules [39].

3) SPATIAL CONTEXTUAL FEATURE LEARNING
Spatial contextual feature learning aims to learn a low dimensional feature representation of spatial samples that reserve spatial structural information in the original feature space. Common techniques include embedding [54]- [56], matrix factorization [57], as well as recent unsupervised deep learning methods such as auto-encoder [58]. Recently, deep convolutional neural networks have been used to automatically learn complicated structured features in spatial raster data with convolutional operators [58]- [64]. One unique property of spatial data is that information from different sources can be fused into the same spatial framework, providing important spatial contexts for learning samples. For example, when predicting a fine-grained air quality map for an entire city, we can generate contextual features by fusing air quality records at ground stations, weather information, road network and traffic data, as well as POIs [65]. When predicting human behaviors from location history, auxiliary data from geosocial media can provide important semantic annotations [66], [67]. Generating contextual features through data fusion can have its own challenges (e.g., multi-modality, sparsity, noise). Various techniques have been explored, such as coupled matrix factorization, and context-aware tensor decomposition with manifold [68].
Spatial contextual feature generation is important in many practical applications (e.g., urban computing) due to two main advantages. First, generating appropriate contextual features can significantly enhance prediction accuracy due to the effectiveness of those features in explaining the response variable. Second, after spatial contextual features are generated, many traditional non-spatial predictive models can be used (e.g., random forest, support vector machine). This is sometimes convenient since there is no need to modify non-spatial prediction models or learning algorithms. At the same time, spatial contextual feature generation may require significant knowledge about the application domain.

B. STRUCTURED MODEL REPRESENTATION
Another general approach for spatial structured prediction problems is to incorporate spatial structural constraints into machine learning model representation. One of the most common framework to model dependency structure in machine learning is probabilistic graphical models [69]- [74], which use nodes and edges in a graph to model variables and dependency between variables. Graphical models can be further grouped into Markov random field for undirected graphs [75], Bayesian networks for directed graphs [76], and Markov logic networks for graphics with first order logic [77]. We now introduce several categories of techniques based on the type of spatial structural constraints being incorporated, including spatial neighborhood graph, spatial distance kernel, and geometric topology.

1) SPATIAL NEIGHBORHOOD GRAPH BASED MODELS
Spatial neighborhood graph based models assume that the spatial dependency structure follows a neighborhood VOLUME 8, 2020 graph, i.e., dependency exists between neighboring locations. Spatial neighborhood graph can be derived based on cell adjacency on a raster framework or spatial distance in a continuous field. These models are generally considered as the Markov random field model family.
Markov random field (MRF) [71], [73], [74], [78]- [81] is a widely used model for areal data such as earth observation images, MRI medical images, and county-level disease count map. An MRF is a random field that satisfies the Markov property: the conditional probability of the observation at one cell given observations at all remaining cells only depends on observations at its neighbors. This property is consistent with the first law of geography that ''nearby things are more related than distant things''. According to the Brook's lemma [82], the joint distribution of cell observations can be uniquely determined based on conditional probability specified in the Markov property. Furthermore, according to the Hammersley-Clifford theorem [83], the corresponding joint distribution of MRF has a unique structure: it can be expressed by a set of potential functions on spatial neighbor cliques (i.e., symmetric functions that are unchanged by any permutations of input variables within a clique). Such a joint distribution is also called Gibb's distribution. Equation 1 is an example. Its potential function is W i,j (y(s i ) − y(s j )) 2 based on cliques of size two (s i , s j ). The Markov property simplifies the modeling process: as long as the neighborhood structure is specified, the joint distribution of an MRF can be expressed by a potential function on neighbor cliques. Spatial prediction methods based on MRF include ones that explicitly capture spatial dependency such as Simultaneous Autoregressive models (SAR), ones that implicitly capture spatial dependency such as Conditional Autoregressive models, and ones integrating MRF with other models such as Bayesian classifiers and support vector machines. P(y(s 1 ), . . . , y(s n )) ∝ exp{− 1 2σ 2 i,j W i,j (y(s i )−y(s j )) 2 } (1) Simultaneous Autoregressive models (SAR) explicitly express spatial dependency across response variables [84], [85]. The SAR model extends traditional linear regression with an additional spatial autoregressive term, as shown in Equation 2, where Y is a n by 1 column vector of all response variables, X is a n by m sample covariate (feature) matrix, β is a m by 1 column vector of coefficients, and is a n by 1 column vector of i.i.d. Gaussian noise (residual errors), W is row-normalized W-matrix, and ρ reflects the strength of spatial dependency effect. The ith row of the spatial autoregressive term WY is a weighted average of response variables at all neighboring locations of s i . Parameters in SAR can be estimated based on the maximum likelihood method. SAR model can also be extended for spatial classification via logit transformation. It is worth noting that the spatial autoregressive term can also be added into other variables than the responses, such as covariates as in the spatial Durbin model or residual errors as in the spatial error model [86].
Another model similar to MRF is conditional random field (CRF) [87]- [90], which directly models spatial dependency within the conditional probability function P(Y|X). Its potential function on class labels within a clique is conditioned on feature vector X. Several variants of CRF have been proposed including decoupled conditional random field [88], discriminative random field [89] and support vector random field [90]. The difference between MRF and CRF is that the former is generative while the latter is discriminative.
The advantage of MRF-based models is that spatial dependency can be modeled in a very intuitive and simple way (designing potential functions). The limitations include high computational costs in parameter estimation and strong assumptions on the structure of joint probability distribution. In addition, neighborhood relationships are assumed to be given as inputs. Thus, fixed neighborhoods such as square windows are often used for simplicity. For applications where spatial data is anisotropic, determining spatial neighborhood structure is also a challenge.
There are other models for spatial model representations, including spatial statistics and spatial econometrics [84], [85] such as spatial panel model and spatial Dublin model [86], [91], as well as spatial regularization on loss function [92]- [94]. Methods based on spatial network connectivity are relatively less studied, including spatial network Kriging [95], and spatial network autoregressive models [96].

2) SPATIAL DISTANCE BASED MODELS
Spatial distance based methods assume that the strength of dependency is only determined by the distance between sample locations, regardless of directions. The most common method is the Gaussian process model (also called Kriging).
Different from MRF based models that are used for areal data, Gaussian process based models are used for spatial prediction (interpolation) on point reference data [85]. Given observations of a variable at sample locations in continuous space, the problem aims to interpolate the variable at an unobserved location. The Gaussian process assumes that observations at any set of sample locations jointly follow a multivariate Gaussian distribution. The mean term at a location is determined by its local covariates. The residual error term at a location is assumed to be weakly stationary and isotropic, so that the covariance matrix can be expressed as a function of distance (covariogram).
Specifically, Gaussian process (Kriging) assumes that any set of sample observations Y = [y(s 1 ), . . . , y(s n )] T follows a multivariate Gaussian distribution N (µ, ), where µ = Xβ and is the covariance matrix ij = Cov(y(s i ), y(s j )) = C(s i − s j ). The main difference of Gaussian process from classical regression is that the residual errors are not mutually independent (the covariance matrix is not diagonal and the non-diagonal elements can be determined based on covariogram C(h), which is a function describing the degree of spatial dependence over location differences [85]). It can be shown that the optimal predictor (minimizing expected square loss) of y(s 0 ) given other observations y(s 1 ), . . . , y(s n ) is the conditional expectation as shown in Equation 3, which can be estimated based on the covariance structure from covariogram [85]. This method is also called universal Kriging since it involves covariates X. Special cases without covariates include simple Kriging (with known constant mean) and ordinary Kriging (with unknown constant mean) [97].
Gaussian process shares some similarities with MRF in that both of them incorporate spatial dependency (autocorrelation) across sample locations into model structure. There are several differences, however: first, Gaussian process is developed for point reference data while MRF is developed for areal data; second, MRF models spatial dependency through W-matrix (W ) while Gaussian process models spatial dependency through covariance function on spatial distance (covariogram). Similar to MRF which assumes a given neighborhood structure and the Markov property, Gaussian process has assumptions on spatial stationarity and isotropy.

3) GEOMETRIC TOPOLOGY BASED MODELS
Geometric topology based models learn spatial structures on a topological surface (e.g., 3D terrain map) beyond the common Euclidean space (e.g., raster image or video). Such spatial structure is often related with geometric shapes or contour patterns. It has been extensively studied in computational geometry, or more specifically computational topology [33], [98]- [100].
One such spatial structured model is geographical hidden Markov tree (g-HMT) [101], [102], a probabilistic graphical model that generalizes the common hidden Markov model (HMM) from a total order sequence to a partial order tree. The tree structure is motivated by disciplinary knowledge in hydrology. As illustrated in Figure 6, due to gravity, water flows from a high elevation to nearby lower elevations. If location 5 is in the flood class, locations 2-4 and 6-7 should also be flood. Such partial order is captured in a tree structure (Figure 6(b)) in the hidden class layer. Each hidden class node has an associated observed feature node for the same pixel. Such a spatial dependency structure can potentially reduce classification errors due to noise, obstacles, and heterogeneity among spectral features of individual pixels. Compared with existing spatial structured models (e.g., Markov random field, post-processing label propagation), proposed g-HMT can reflect directed dependency following physics in hydrology. Empirical evaluations have shown that g-HMT significantly improves accuracy over existing methods due to overcoming noise and obstacles in feature maps [101]. The model has also been extended to address limited observation based on the physics-aware spatial structural constraints (e.g., water flow directions on a terrain surface) [103].  Another model example is hidden Markov contour tree (HMCT) [104]. HMCT is a probabilistic graphical model that generalizes the common hidden Markov model from a total order sequence to a partial order contour tree. It extends geographical hidden Markov tree with 2D structures such as contours. Figure 7 illustrates an example in flood mapping in hydrology. Due to gravity, floodwater flows from one location to nearby lower locations, as illustrated by arrows in Figure 7(b). Such dependency structure is represented by a contour tree, as shown in Figure 7(c), where a node represents the class (flood or dry) of a contour segment (e.g., the three contiguous pixels with elevation 1), and an edge represents the flow direction between two adjacent contour segments. In the HMCT model, each hidden class node in the contour tree is also associated with a set of observed feature nodes, corresponding to the spectral information of pixels in the same contour segment, as shown in Figure 7(d). In this way, class inference is based on not only local spectral information but also flow direction and contour structure. Empirical evaluations on datasets from Hurricane Mathew floods showed that HMCT outperformed existing methods with 10% to 30% gain in F-score, due to overcoming noise and obstacles in features [104]. It can be more robust to g-HMT in cases where different sides of river banks should model the dependency structure separately.
Geometric topology based methods are different from the previous two methods in that they capture topological structure on a 3D surface instead of 2D spatial plane. Such topological structures can be directed (e.g., flow directions based on surface gradient). The specific techniques discussed above can be considered as special cases of Bayesian network (they are directed trees or polytrees instead of general directed graph). Such unique topological structures on 3D surface are important for applications in hydrology, material science, and biochemistry.

C. DEEP NEURAL NETWORKS
Deep neural network is a new approach that becomes increasingly popular in recent years for structured prediction problems. Compared with the previous two approaches, i.e., spatial contextual feature generation and structured model representation, the deep learning approach could automatically learn structured features in an end-to-end manner (without manually handcrafting spatial contextual features, or explicit specifying spatial structural in model representation). Deep learning methods for spatial structured prediction can be generally categorized into deep convolutional neural networks (DCNN) and graph neural networks (GNN). DCNN is originally developed for image data in the computer vision community, and thus is naturally applicable to spatial raster data such as earth imagery [60]- [62] and a discretized spatial grid [63], [64]. Graph neural networks are generalization of deep neural networks from images to graphs. GNN can be used for spatial structured prediction because spatial data can be represented as graphs with nodes representing locations and edges representing spatial relationships between locations [105], [106]. We put deep learning as a separate approach from the previous two due to its wide popularity in recent years. Since DCNN is widely known, we introduce more details on GNN in this subsection.
Graph neural networks are a set of techniques that generalizing deep neural networks to graphs. Generalizing deep learning to graph data is non-trivial due to the lack of a regular input domain like imagery or videos, and the structural complexity of graphs. Several surveys exist on this important emerging topic [107]- [110]. Existing techniques can be categorized based on the underlying approaches to capture graph dependency between nodes, such as graph convolutional neural networks, graph recurrent neural networks, and graph attention neural networks.
Graph convolutional neural networks generalize deep convolutional neural networks into graph data. The main challenge is that there is no regular neighborhood window as in imagery data. There are two general approaches: the spectral-based approach and the non-spectral-based approach. The spectral-based approach uses the spectral representation graph (i.e., eigenvalues and eigenvectors of the graph Laplacian matrix). Techniques have been proposed to reduce the number of parameters and computational complexity, such as ChebyNet [111] and simple Graph Neural Network [112]. The non-spectral approach directly generalizes convolutional operations from a regular grid to a graph, such as using different weight matrices for nodes with different degrees [113] and graph diffusion neural networks [114]. Graph convolutional networks can generalize convolutional filters from raster imagery to graph neighborhoods. One potential issue is that the methods often assume a fix graph topology (and corresponding Laplacian matrix), and thus are not applicable to cases where the underlying graph structure changes from training data to test data.
Graph recurrent neural networks generalize recurrent neural networks from sequences to graphs. Its main advantage is the capability of capturing directed dependency (directed graph). The main idea is to use gating mechanism to control the flow of information across graph nodes along directed paths. The most common technique is graph Long Short-Term Memory (graph LSTM) [115]- [119]. An LSTM [58] is a special recurrent neural network that uses internal gates to control the memory of information from previous steps in a sequence, possibly capturing long range dependency. An LSTM can be generalized from sequences to trees and graphs by allowing for multiple predecessors or successors of a node. Such models have been applied to natural language processing [115]- [117], but the underlying directed graph is often small. Graph recurrent neural networks can potentially be applied to directed spatial graphs (e.g., river networks, road networks considering traffic directions). The potential issue is the high computational costs in learning when the graph is large.
Graph attention networks [120] uses the self-attention mechanism on graphs to perform node classification a graph structure. The main idea is to compute the hidden representations of each node in the graph by learning weights over its neighbors. Weights between a node to its neighbors are learned by a self-attention strategy. Self-attention means mapping a graph to the same graph itself while learning the weights between a node and its neighbors (the amount of attention on its neighbors required to classify the current node). The self-attention mechanism can address the issue of varying node degrees in graph convolution (this violates the assumption of a fixed window size in traditional convolutional operations) through averaging hidden representations from all neighbors based on their weights.
Research on the field of deep neural network, particularly graph neural network, is still quickly growing. The main advantages of this approach, compared with the traditional approaches, are that it does not require handcrafting or learning spatial contextual features, and it is able to learn complex spatial structure dependency patterns. The main limitations include that deep neural networks are often black-box models and thus are hard to interpret and that it requires a large amount of training labels. Table 2 provides a summary of comparisons between the three major approaches of spatial structured prediction models: spatial contextual feature generation, structured model representation, and deep neural networks. Incorporating spatial structures through contextual feature generation is an effective approach when the generated features are effective predictors for the response variable. It does not require the change of machine learning models, and thus traditional non-spatial machine learning methods can be used (e.g., random forests, support vector machine). Due to this reason, it is often used in practice by interdisciplinary application domain scientists. The limitation is that the approach requires sufficient domain knowledge on the application and the process needs to be repeated for different applications. In contrast, the approach of incorporating spatial structure in model representation provides general tools that can be used for different applications, but it also has the limitation of high computational costs of model learning compared with the traditional non-spatial models. Finally, the emerging deep learning approach reduces the burden of handcrafting spatial features (the deep neural networks automatically learn effective feature representations), and is able to learn complex spatial structure patterns. The main disadvantages are that the models are often hard to interpret and model training requires a large amount of training labels.

VI. FUTURE DIRECTIONS
This section identifies research gaps and discusses some potential future research directions in the field of spatial structured prediction.

A. GEOMETRIC DEEP LEARNING
Recently, the topic of geometric deep learning [121] has attracted growing interests from the deep learning research community. Geometric deep learning generalizes traditional deep learning from regular raster frameworks (e.g., images and videos) in Euclidean to irregular spatial structures on a geometric surface (e.g., 3D point cloud, protein surface). The problem is very challenging due to several reasons: First, there is no fix input spatial domain that we can easily learn a local invariant convolutional operator; Second, explicit topological relationship can exist on a geometric surface (e.g., water flow direction along a terrain map); Third, there may be limited training labels. According to a recent overview article [121], existing methods are largely based on graph convolution neural networks [107], [108], which rely on a fix graph topology structure, and thus cannot generalize to cases where the graph structure changes from one data to another.
Addressing the challenges require the integration of deep learning with computational geometry (or computational topology, computer graphics). There is already extensive research on modeling complex spatial structures in the field of computational geometry (e.g., topological data analysis, contour trees, shape analysis). These existing works provide unique opportunities to extract spatial structural constraints from observed geometric surface data. Such spatial structural constraints can then be integrated with graph neural network models. The integration of data-driven approaches (graph neural networks) with physics-aware structural constraints (computation topology) into the backbone of model representation has several advantages. First, explicit spatial structural information often indicates the underlying physical process, and thus enhances model interpretability. Second, spatial structural constraints also have a positive side-effect of providing regularization for model learning when training labels are limited. For example, we can potentially integrate a contour tree from an elevation surface with graph recurrent neural networks to capture both non-linear relationships between variables and the flow direction and contour patterns on an elevation surface.

B. MODEL TRANSPARENCY AND INTERPRETABILITY
Another potential future research direction is to improve model transparency and interpretability. Existing spatial structured prediction models based on deep learning techniques are often based on a black-box model representation, making it hard to interpretable by the existing knowledge and theories in an application domain. This is a particular VOLUME 8, 2020 handicap for scientific applications such as earth science, biology, and chemistry. Recently, there is a growing interest on interpretability in the machine learning research community. Interpretable machine learning [122]- [126] (also called explainable machine learning or AI [127], [128]) refers to techniques that help human users better understand the behavior of machine learning models. According to a recent survey [124], existing approaches can be categorized into intrinsic interpretation that focuses on constructing self-explanatory models, and post-hoc interpretation that creates a second model to provide explanations for an existing black-box model. Techniques for intrinsic interpretation include transparent model structures [129] such as decision trees, rule-based models [130], linear models with sparse features; adding globally interpretable constraints [131] such as regularizing loss to learn disentangled representations in CNN [132] or constructing a capsule network [122]; and locally interpretable structure such as attention mechanism [133]. Techniques for post-hoc interpretation [134], [135] includes learning interpretable models to mimic a black-box model [136]; testing model sensitivity to input features [137]- [139]; and explaining representations in black-box models [140]- [143]. Spatial structured prediction seems more suitable for intrinsic interpretation techniques given the existence of structural information. However, existing methods that build transparent model representation often assume simple structures such as trees or rules. Techniques for transparent models with complex spatial structural constraints are largely underexplored.
There are two potential strategies to address the challenge. One strategy is to incorporate domain knowledge into the design of model structures or model components [144], [145] so that the model is aware of the underlying physics. The other strategy is to explicitly model the spatial structural constraints (e.g., from computational topology) and build it into the backbone of model representation.

C. SCALABLE INFERENCE
Inference of structured prediction models are computationally very challenging due to structural complexity, which violates a common assumption that data samples are independent and identically distributed. Indeed, inference on a general graph is NP-hard [72]. Existing approximate inference algorithms include loopy belief propagation [146], variational inference [147], and sampling methods [148]. However, these methods are largely based on statistical properties instead of considering unique spatial structural properties. For example, spatial data can show a local effect (i.e., dependency is stronger within local areas) and a multi-scale effect (i.e., spatial patterns can be analyzed at different spatial scales and resolutions). Based on the local effect, we can potentially develop techniques of divide-and-conquer to conduct inference within individual zones and synchronize results together towards a global solution. Based on the multi-scale effect, we can potentially investigate multi-resolution filtering that first infers unknown classes on a coarse spatial scale (much smaller data size) and refine the inferred class boundaries on a finer spatial scale. Another potential direction is to develop parallel algorithms for scalable inference. The problem is further complicated by the lack of a regular grid structure in input data like images or videos, making it hard to utilize existing parallel computational frameworks such as GPU. Several works have been done on parallelizing graph neural networks in GPU [149], [150], but the problem is largely underexplored.

VII. CONCLUSION
This paper provides an overview of the spatial structured prediction problem. We define the problem with different types of spatial structures and their applications. We also provide a taxonomy of common techniques categorized by the underlying approaches. In the end, we discuss several potential research directions.