Journal of Network and Computer Applications

This work contributes to a real-time, edge-centric inferential modeling and analytics methodology introducing the fundamental mechanisms for (i) predictive models update and (ii) diverse models selection in distributed computing. Our objective in edge-centric analytics is the time-optimized model caching and selective forwarding at the network edge adopting optimal stopping theory, where communication overhead is signiﬁcantly reduced as only inferred knowledge and suﬃcient statistics are delivered instead of raw data obtaining high quality of analytics. Novel model selection algorithms are introduced to fuse the inherent models’ diversity over distributed edge nodes to support inferential analytics tasks to end-users/analysts, and applications in real-time. We provide statistical learning modeling and establish the corresponding mathematical analyses of our mechanisms along with comprehensive performance and comparative assessment using real data from diﬀerent domains and showing its beneﬁts in edge computing.


Introduction
Real-time inferential analytics (Lazerson et al., 2016;Cormode, 2013) support exploratory (hypotheses formulation), diagnostic (why is it happening), predictive (when is likely to happen) and descriptive (what is happening now) data analysis via predictive statistical models e.g., multivariate linear & quartile regression over live data (Renart et al., 2017). The derived and incrementally updated predictive models, mainly regression and time-series forecasting models, are used in such analyses supporting analysts/applications in terms of: (i) real-time prediction of new/unseen data (regression) (ii) investigation how observed data fit such models (function estimation) and (iii) forecasting of future data trends of incoming data (Renart et al., 2017).
Real-time inferential analytics are materialized after contextual data are transferred from sensing devices and data sources to the Cloud aiming to build global on-line models over all observed data (Konečny et al., 2016). Then, analysts/applications issue arbitrary regression & exploratory queries over such models for real-time data exploration, on-line prediction, and adaptive knowledge extraction (Jain and Tata, 2017;Ferreira and Ruano, 2009). This refers to query-driven predictive analytics Triantafillou, 2017a, 2017b), which has been adopted in large-scale distributed computing systems.
However, major challenges arise adopting this baseline approach for supporting query-driven & real-time inferential analytics. Firstly, massive raw data transfer is needed for building and updating such central E-mail address: christos.anagnostopoulos@glasgow.ac.uk. models. Since this is prohibitive for Internet of Things, i.e., energy-/bandwidth-constrained environments due to constraints like limited network bandwidth, computational power, latency and energy, Edge Computing (EC) comes into play (Garcia Lopezet al., 2015;Anand et al., 2017;Gianget al, 2015). Such paradigm can be adopted to cope with this challenge by pushing as much intelligent computing logic for inferential analytics as possible close to computing & sensing Edge Devices (EDs) and/or Edge Gateways (EGs) (Renart et al., 2017;Weiet al., 2017), that is to the network edge as illustrated in Fig. 1. It is desirable then for the EDs to deliver only data summaries, e.g., sufficient statistics & regression model coefficients to the Cloud for query-driven inferential analytics. Moreover, current inferential analytics methodologies like statistical summaries (synopses, multidimensional histograms, topographical maps , data digest methods), data-driven sampling schemes, and query-driven methods (Savva et al., 2020;Savva et al., 2019) use global models built over all transmitted data, as will be elaborated in the related work section. This limits the perspective of local data/knowledge diversity experienced on each ED or EG, reflecting the local context awareness of ED's or EG's surroundings. Disregarding the inherent diversity of local models due to the (geo-)distribution of EDs and EGs degrades the locality of contextual information sensed/captured thus eliminating the specificity of local inferred knowledge, which is of high importance in model selection and quality of inferential analytics as will be evidenced in this paper. Even if we desire to exploit such diversity by building, e.g.  ferent models, data should be firstly transmitted to the Cloud and then processed and maintained centrally.

Motivations & goals
We envisage an edge-centric inferential analytics paradigm, where cliques of EDs and EGs are employed as first-class analytics platforms Sharma and Wang, 2017). Our motivation is based on establishing a methodology that supports inferential analytics materialized at the network edge including e.g., physical sensors (sensing contextual information), mobile EDs, unmanned vehicles for participatory sensing, and Edge Gateways (EGs) interacting with EDs; see Fig. 1.
Moving real-time data from EDs and EGs to remote data centers incurs network latency, which is undesirable for interactive, realtime data exploration and inferential analytics applications; e.g., urban surveillance applications generate humongous volumes of data (speed cameras; environmental time-series; earthquake monitoring) that are bandwidth prohibitive to completely move to the Cloud in real-time (Sharma and Wang, 2017). The network connectivity is intermittent causing loss of functionality if Cloud connectivity gets lost.
Cloud should not be the panacea of inferential analytics paradigm shift. We advocate edge-centric inference and data analysis by pushing the analytics frontiers from centralized nodes to the network periphery, fostering at the same time the inherent models diversity at the network edge. The pushed inferential intelligence is distributed among EDs and EGs in order to (i) provide services to regression & exploratory queries issued by analysts/applications and (ii) advocate model/knowledge diversity in a distributed computing ecosystem, which will be adopted for appropriate models selection. This triggers the idea that EDs, or a clique of co-operating EDs, locally build on-line predictive models, which are maintained and selectively delivered to the EGs for efficient model selection and sophisticated aggregation, instead of sending raw data from EDs to EGs and/or to Cloud. Based on this diversified knowledgeonly communication between EDs and EGs, we desire to obtain the same quality of analytics, e.g., prediction accuracy and model/curve fitting, compared to the centralized approach by being communication efficient.
We stress that our edge-centric approach retains the core advantages of using Cloud as a support infrastructure but puts back the inferential analytics processing to the edge given that computing capacity of EDs and EGs still increases (Sharma and Wang, 2017). Our approach establishes two fundamental flow tasks (and their sub-tasks) to support edge-centric analytics: • Task Flow 1: Knowledge Transfer from EDs to EGs comprises (1) incremental local model building at the EDs, (2) communication efficient models updating, reacting timely to incoming information, thus preventing concentration of raw data to central locations, and (3) respecting privacy of sensitive information generated/gathered at EDs.
• Task Flow 2: Query-driven Analytics Provision at EGs, which are equipped with novel model selection strategies over diverse models to determine the most appropriate models received from EDs to be engaged per query issued by analysts/applications during data analyses.
Task 1 refers to data thinning by determining which is the statistically sufficient knowledge to transfer from the EDs to EGs and when to update such pieces of knowledge in a dynamic environment. Task 2 refers to provisioning of inferential query-driven analytics at the edge over streams of queries deciding on the most appropriate collection of models cached at the EG per regression query. Both task flows converge at the EGs materializing edge-centric analytics.
The predictive analytics and inferential challenges at the edge infrastructure in real setups are associated with the capability of the edge computing environment to extract the most relevant data for model training and inference without significant delays so as to not break the analysts-EGs interactivity constraint, which in real setups is set around 500 ms (Liu and Heer, 2014). This constraint supports that any answers returned over that limit can have negative effects on analysts experience, productivity and decision making. Concretely, analysts engage in exploratory analysis (Idreos et al., 2015) to better understand the data. Such analysis is an invariable step in the process of further constructing hypotheses or constructing and training inferential models to answer business questions. As an indicative real-life scenario, we consider predictive queries received over EGs regarding crime-index indicators in regions in the city of Chicago. 1 A workload for this data set 2 consists of range-queries with aggregation functions over spatial coordinates . Local models are trained to forecast crime indicators over city regions while dedicated EGs aggregate theses models to provide holistic insight on the crime trend in larger areas of the city (or even the whole city). The analysts-EGs interaction to obtain these trends is achieved by query workloads over EGs, which guide the inferential analytics process ranging from which ED to gather which data (Task Flow 1) to which local models to be aggregated to secure timely predictive analytics that 'follow' the analysts' exploratory methodology (Task Flow 2). Within this spectrum, the challenges elicited from the data-management perspective include (among others): missing value imputation, fast and efficient trained models update and adaptation; efficient features selection and data dimensionality reduction, identification of relevant data, and prediction of the induced query workload over the EGs. Our proposed scheme supports with two task flows this challenges spectrum.

Challenges & desiderata
Multidimensional contextual data have special features such as bursty nature and statistical transiency, i.e., values expire in short time while statistical dependencies among attributes change over time (Kaneda and Mineno, 2016;Cormode, 2013;Lazerson et al., 2016). Hence, the challenges for edge-centric inferential analytics are: (i) local model learning on EDs requiring real-time model updating and selective model forwarding to EGs in light of minimizing communication overhead (Task Flow 1 challenges), (ii) best diverse models selection at EGs per regression query, and (iii) model caching techniques that achieve as high analytics quality/accuracy as the centralized approach (Task Flow 2 challenges).
The desiderata of our approach are: (1) a model update mechanism from EDs to EGs and model caching at EGs proved to significantly reduce the communication overhead as only model's parameters and sufficient statistics are disseminated instead of raw data. This meets the desired latency and energy efficiency, and reduces the closed-loop latency to analyze contextual data in real-time.
(2) Model selection at EGs allows for fusing diverse local models per query w.r.t. sufficient statistics coming from EDs, thus, retaining the locality of knowledge on each ED without transferring and processing data at EGs.

Rationale & problem fundamentals
Consider inferential analytics, e.g., (Ferreira and Ruano, 2009;Wang and Li, 2016;Kaneda and Mineno, 2016) in a (d + 1)-dimensional data space (x, y) ∈ ℝ d+1 , where the analysts/applications seek to learn the dependency between input can refer e.g., to attributes temperature x 1 and CO 2 emission x 2 , while y is humidity, or x = [x t−1 , x t−2 , x t−3 ] ⊤ ∈ ℝ 3 can refer to SO 2 concentration at the previous time instances and y = x t the current value.
Let us consider regression/exploratory queries issued by analysts/applications over real-time contextual data. Such a query can be represented via a point q ∈  ⊂ ℝ d such that we locally desire to explore the behavior of f(x) around q and are provided the predictionŷ = f(q) with prediction error e(q) = y − f(q); e.g., predict humidity y given q = [q 1 , q 2 ] ⊤ ; temperature q 1 and CO 2 q 2 . Moreover, an exploratory query in terms of forecasting or diagnosis is also rep- where we investigate the past/future values y t = f(q) of a bunch of time series for the recent/future time horizon of d (embedding dimension); for instance, forecast the sulphur dioxide SO 2 in the next hour in a specific city area given its current values and the recent CO values obtaining a forecasting error e(q). Furthermore, let a bunch of query points {q L } L l=1 ∈  in the input data space  ⊂ ℝ d . Analysts/applications are interested in obtaining an estimation of the underlying function f (function estimation task) such that it best fits the input-output space  ×  for those actual inputs (observed input data points) x closest to query points q l , where f fits (explains) the observed pairs (x, y). For instance, estimate the correlation function of temperature and humidity in a specific area on the sea surface as recorded by Unmanned Surface Vehicles (USVs) 3 in the last 30 min by obtaining a model fitting deviation, e.g., goodness of fit, constructed from individual e(q l ) errors.
Such edge-centric inferential analytics learns on-line the unknown However, due to the diverse nature/contextual surroundings of each ED, e.g., environmental urban monitoring sensors in a smart city experience different and/or overlapping data ranges of temperature, CO 2 emission, UV radiation, and humidity in different city regions and sea surfaces (Kaneda and Mineno, 2016), a global model f G fitting all data and interpreting all statistical and/or spatiotemporal dependencies among attributes cannot capture the very specific characteristics of data subspaces in each ED i. This raises the necessity of estimating local predictive models f i per ED i representing their specific local data {(x, y) i } and knowledge, as will be discussed later. 3 In Section 8, we experiment with inferential analytics over data captured by USVs funded by the EU/GNFUV project.
We ought to efficiently and effectively combine such naturally diverse local models f i built over different data into an EG, thus, the EG being able to interpret the diverse statistical dependencies and provide accurate predictions to queries in real-time. The rationale behind the intelligence on the EDs is that they sophisticatedly decide when to deliver their local models f i to the EG, where EG caches these models, notated as f o i , to provide real-time analytics. Evidently, the cached model update mechanism is an imperative task of the ED's intelligence trading-off analytics quality at their EG for communication overhead, especially in dynamic data spaces.
The EG supports then inferential analytics given an ensemble of cached local models introducing a sophisticated model selection over The final fused model should perform as accurately as if one were told beforehand which local model(s) from  was the best for a specific query and which was the best global model f G over all the collected data from all EDs. Obviously, given a query, the best possible subset of local models to be engaged for prediction/forecasting/function estimation cannot be known in advance on the EG and its estimation will be proved to be NP-hard later. Moreover, due to the above-mentioned constraints, we cannot build the global f G on the EG or even at the Cloud over all the data; the EDs do not transfer raw data for efficiency. As a (naive) alternative, the model selection can be simply averaging all local models: However, as will be analyzed in Section 7 and shown in our experiments in Section 8, the f AVG induces unnecessarily large variability in prediction resulting in significantly degraded quality of analytics.

Problem formulation
Let us focus firstly on the Task Flow 1 in Section 1.1. The rationale behind each ED i intelligence is to locally and incrementally built a model f i and then to efficiently decide, based on sufficient statistics, when to update its connected EG with the up-to-date model. Our first challenge is to provide a time-optimized and communication efficient mechanism for statistics-dependent model update. Consider now the Task Flow 2 in Section 1.1. The rationale behind the EG intelligence is to selectively engage some of the cached local models received from its connected EDs given a query by appropriate weighting than averaging while the case of global f G modeling over all data in the EG is not feasible; no data are transferred from the EDs to EG. Given a query q, our second challenge is to predict the most appropriate local models subset  ′ ⊆  at EG to engage by being as accurate compared to f G as possible given (i) communication constraints, (ii) cached model replacement, and (iii) without knowing the distribution of the queries over input data { i } n i=1 . We desire to predict the most appropriate  ′ per query that achieves almost the same or better accuracy than f G and f AVG without having to send all data from the EDs to the EG. Our third challenge is to establish the sufficient statistics that continuously represent knowledge of performance of the local models, which will be used for the ED's model update mechanism (Problem 1) and for the EG's model selection mechanism (Problem 2). Specifically, each ED i seeks to derive sufficient knowledge of the underlying data and the associated trained local model f i in order to optimally assess when the model should be updated. Such sufficient derived knowledge should be also exploited by the EG in order to judge whether the up-to-date cached model f o i is to be involved in the local model subset  ′ given a random predictive analytics query.

Problem 2. Given an ensemble
Problem 3. Given a local model f i at ED i, define the sufficient statistics and derived knowledge of the underlying data and the associated local model performance, which will be used by the (i) efficient model update mechanism (Problem 1) and the (ii) model selection mechanism in the EG (Problem 2).

Related work
In centralized approaches (Kaneda and Mineno, 2016;Bottou and Bousquet, 2007) all collected data are transferred centrally for analysis, thus, centralized predictive modeling and maintenance suffer from heavy burden of massive data transfer and expensive fusion centers including significant delay in providing real-time inferential analytics (Bilal et al., 2018). In some cases, network nodes might not be willing/are not allowed to share their original data due to privacy issues though (Bilal et al., 2018). Our approach pushes analytics to the edge coping with such constraints via communication efficient methods for model updates and high quality of analytics via diversity-oriented model selection.
Distributed approaches for predictive analytics (Wang and Li, 2016;Gabel et al., 2015;Lazerson et al., 2016) focus explicitly on the distributed estimation of a specific global model's parameters over nodes, where the goal is to achieve the same prediction performance as the corresponding centralized one given that gathering all data centrally is expensive and/or impossible. Distributed predictive modeling (1) does not exploit data subspace locality and local models diversity (which are the key components in ensemble-based inferential analytics as evidenced in our experiments), (2) focuses on training a pre-defined global model, where all involved nodes have to agree in advance thus there is no option/allowance for heterogeneity in predictive models per EDs, and (3) requires extra techniques for parameters update and synchronization protocols especially in real-time inferential analytics. Such approach enforces nodes to adopt the same predictive algorithm, which is not required in our approach providing the flexibility of hiring different predictive models in EDs; our approach relies on the prediction performance of local models independently of the adopted predictive algorithms on EDs. Moreover, semi-distributed approaches like federated learning (Konečny et al., 2016) involve the building of a global model centrally, which is incrementally updated through partially built local models in EDs. In this context, all nodes focus on maintaining such a model by regular updates, which is in turn disseminated back to the EDs. Apart from the inherent limitation of focusing on a commonly agreed unique model and the lack of communication efficient mechanisms for model update, such approach disregards data locality and models diversity at the EDs. In addition, it does not support model selection apart from the naive model averaging, where the poor quality of analytics (Mendes-Moreira et al., 2012) is showed in Section 8. Such limitations are not present in our methodology as evidenced in our comparative assessment.
Recently, approaches for pushing analytics to the edge are proposed (Kamath et al., 2016) either reduced to distributed predictive modeling (Wang and Li, 2016) (whose limitations are discussed above) or to selective data forwarding Raza et al., 2015;. Specifically, Harth and Anagnostopoulos (2017) deals with time-optimized data forwarding among EDs and EGs in light of maximizing the quality of inferential analytics. Such approach reduces data communication, however, data processing and model training are still built on EGs. This requires careful data transfer to control model maintenance & adaptation. Our work further pushes model building, sophisticated model update and maintenance to the network periphery (EDs) thus avoiding completely data transfer (cop-ing also with data privacy), while only parameters & sufficient statistics are conditionally disseminated for models adaptation and selection via time-optimized communication efficient mechanisms. The methods in Raza et al. (2015) and  deal with data suppression based on local forecasting models on sensors in light of re-constructing data at the sink. However, they do not focus on inferential analytics and statistical dependencies learning at EDs (sensors) but only on reducing data communication via data suppression using forecasting models, also adopted in . These models selectively disseminate data and univariate re-construction models used at the sink, thus, actual predictive modeling is achieved at the EG/sink with no guarantee on the analytics quality/prediction performance. Moreover, predictive modeling does not scale since the EG lacks of model selection and caching mechanisms for selecting and maintaining the best models per query, other than simple model averaging, whose limitations were discussed above, in Brown et al. (2005), and shown in Section 8.
This work, as the first edge-centric, real-time, and communication efficient inferential analytics methodology, significantly extends our previous work in Harth and Anagnostopoulos (2018). The fundamental extensions are: In Task Flow 1, (1) we depart from the instantaneous model update mechanism in Harth and Anagnostopoulos (2018) by introducing an error-tolerance model update mechanism based on the theory of optimal stopping (Shiryaev, 2008), which evidences significantly higher communication efficiency and higher quality of inferential analytics compared with (Harth and Anagnostopoulos, 2018); (2) we provide the optimality achieved by our mechanism via mathematical analyses; (3) we introduce a mechanism that provides immediate feedback to the novelty/familiarity technique at EDs in light of reducing the misjudgments for local model adaptation. In Task Flow 2, (4) we establish a diverse model selection theory at the EG per query and prove that such selection is NP-hard, and (5) we introduce computationally efficient error-aware model selection schemes; (6) we analyze the expected communication overhead in Harth and Anagnostopoulos (2018) and in this work, and (7) provide analytical upper bounds in sufficient statistics updates.
It is worth mentioning that the proposed synergies among EDs and EGs trigger the introduction of incentives in EC. Incentivisation mechanisms in such environment introduced in Liu et al., n.d. could encourage EDs to participate in e.g., knowledge, crowd-sensing (Liu et al., n.d.) trading off privacy, edge analytics provision and predictive analytics quality. Moreover, our scheme could incorporate incentive mechanisms for content sharing, e.g., video Ads, in opportunistic device-to-device networks as introduced in Liu et al. (2018), which integrates users' mobility with crowd-sourcing. Finally, synergies among networks of EDs and EGs, e.g., femtocell and macrocell edge nodes in mobile computing environments require optimized strategies to allocate transmission powers efficiently and share network resources as proposed in Liu et al. (2019b); this will guarantee optimized network deployment and resilient provision of inferential analytics.

Contribution
Our eminent contribution is: • An optimal communication-efficiency aware scheme based on the theory of optimal stopping for model updates solving Problem 1; • Model selection algorithms at EGs solving Problem 2; • A novel input-error hetero-associative statistical learning algorithm and its convergence analysis extracting sufficient statistics solving Problem 3; • Stochastic algorithms that establish the optimality of our schemes w.r.t. communication efficiency and analytics quality; • A feedback-based mechanism for tuning the model adaptation at the EDs; • Analysis of the expected communication overhead and estimation of the upper bound of the expected statistics; • Comprehensive comparative assessment against methods: global/baseline, model averaging (Konečny et al., 2016;Harth and Anagnostopoulos, 2018; and (Raza et al., 2015) using three real datasets derived from static and mobile EDs/computing/sensing devices and unmanned vehicles. We experiment with well-known regression and time-series forecasting models over various data dimensionality.

Predictive familiarity & model update: overview
In this section, we provide a bird's eye view of our approach to edgecentric inferential analytics introducing the model update mechanism at the EDs (Task 1) and the model selection mechanism at the EGs (Task 2). The ED i in Fig. 1 locally learns a parametric model f i (x; b i ) based on the recent local data in a sliding window The ED i is responsible for updating the EG when there is a significant discrepancy of the prediction performance of the local f i and cached f o i at EG. The ED i keeps a copy of f o i locally to drive its decision making discussed later and sends the parameters b i and some sufficient statistics, if it is deemed necessary, only to its EG. This decision has to be taken in real-time by sequentially observing input-output pairs and the current prediction discrepancy between the local and cached models.
Consider a discrete time domain t ∈ = {1, 2, …}. The ED i at time t captures the tth input-output pair (x, y) t and, in real-time: • Case A: Decides whether the pair (x, y) t significantly changes the prediction performance of the current local f i or not. In this case (A.I), the ED i appends (x, y) t to window  i discarding the oldest pair and incrementally adjusts or partially re-trains f i accordingly based on the updated  i . Otherwise, (A.II), f i is not adjusted or re-trained given (x, y) t . • Case B: Decides whether the updated local f i (decided in the case A.I) should be sent to EG or not. In this case (B.I), ED i updates EG with the up-to-date f i provided that a significant prediction performance discrepancy is observed compared with the cached f o i . Otherwise, (B.II) no model update and no delivery is performed between ED i and EG.
In Case A, the ED i should be able to instantaneously determine whether the new pair is drawn from the input-output subspace ( i ,  i ) defined by the pairs in  i or not. In the former case, the new pair interpolates within the current input-output data subspace thus being considered as familiar. This familiarity indicates that the current model In this case, ED i does not need to adapt or re-train the current model f i given that the tth pair is familiar (Case A.II), thus no communication with EG is needed.
If the tth pair is considered unfamiliar or novelty w.r.t. the current input-output subspace, it renders a re-training or adaptation of the current model f i (case A.I) (depending on the regression model f i ). For instance, f i is adapted to new pairs using recursive least squares & incremental support vectors (Kaneda and Mineno, 2016;Engel et al., 2004), incremental/gradient Radial Basis Function (Schwenker et al., 2001), or re-training is required over  i ; see Appendix B. In general, a new local model f i is derived after adaptation or re-training, thus, yielding ED i to examine: (1) the instantaneous model performance discrepancy between the new f i and the cached model f o i (Case B) and (2) the past behavior of such discrepancy to obtain a holistic insight on whether to update the EG or not with the updated local model f i . We quantify this discrepancy as the absolute difference of errors of f i and f o i : (1) Based on the current discrepancy z t and its evolution {z 1 , z 2 , … , z t } since the last model update, the ED i decides on updating EG with the new model f i and locally updating the cached model f o Otherwise, there is no need ED i to update EG, even if the cached and new models do behave similar regarding the prediction performance expressed by this discrepancy. We enforce ED and EG to both have inferential and predictive models that behave the same in terms of prediction performance for the same input.
Remark 1. Our approach is generic in predictive models. Our algorithms extract knowledge only from the input space and prediction error being independent on the nature of the predictive algorithms/models/parameters on the EDs and their statistical expressiveness, which is application-specific/dataanalysts decision on which models to adopt for inferential analytics. This supports the flexibility and heterogeneity of edge-centric inferential analytics and modeling.
Focusing on Task 1, ED goes with: assessing the familiarity of incoming input-output pairs (Section 4.2) and deciding on model updates (Section 5).

Familiarity Inference & local sufficient statistics
The first challenge is to define an on-line method for assessing the novelty of a new pair, i.e., implementing the decision in Case A. Based on the outcome decision of Case A, the ED i might trigger a model update to EG. The novelty of an incoming pair (x, y) might trigger both: local model adaption and cached model update. In order to assess the novelty of a pair in terms of prediction accuracy, our idea is to associate the input vector space x ∈  with the prediction error space e i (x) ∈ ℝ w.r.t. model f i , thus, being capable of approximating the expected prediction error given an unseen input x. We learn this association by jointly quantizing the input-error space generating input and error representative of the  × ℝ space. Specifically, the ED i incrementally learns the k-th vector input subspace and simultaneously associates the model prediction error with that input subspace. To achieve this (hetero) association, we need to on-line quantize the input space into K unknown subspaces, each one represented by an input prototype w k ∈ ℝ d , k ∈ [K] 4 and then associate the prediction error e(x) = y − f i (x) over input x lying around prototype w k with an error prototype u k ∈ ℝ. That is, a new input x is firstly mapped to the closest w k and then the correspond- The ED locally learns the autoregressive-recursive least squares (AR-RLS) model; see Section 8.1. Fig. 2 (right) shows the input-error space  × ℝ where for a specific area around the input prototypes w k , we estimate the associated error plane u k . The outcome of this associative quantization is that we estimate the expected prediction error x [e(x)] of any model given an unseen input x. The error prototype u k refers to the conditional expectation x [e(x)|w k = arg min l∈ [K] ‖x − w l ‖] of the model error around the input space represented via w k , which is the closest prototype to input x. As shown in Fig. 2 (right), there are subspaces of  where the model prediction performance is relatively satisfactory and in some areas where the expected error is relatively high. Such statistical information not only drives the input familiarity/novelty inference at the ED but also provides the basis for model selection at the EG.

Hetero-associative input-error learning
We associate the local performance of f i in the input subspace, represented by w k ∈ ℝ d , with the local prediction error, represented by u k ∈ ℝ; We propose a novel methodology for incremental (hetero) associative input-error space quantization at ED i with unknown number of prototypes K. The objective joint optimization function in our case minimizes the combined (i) conditional Expected Quantization Error (EQE) in the input space, used for learning the best input prototypes representing novelty in input space, and (ii) conditional Expected Prediction Error (EPE) used for learning the best error prototypes capturing local model performance. The condition is based on the closest input prototype, i.e., we optimize the input/error prototypes, which are hereinafter referred to as sufficient statistics: In (3), the condition is the prediction error, and ∈ [0, 1] is a regularization factor for weighting the importance of the input-error space quantization. Notably, = 1 refers to the known EQE (Shen and Hasegawa, 2006), while → 0 indicates pure prediction-error based quantization. (3) is taken over input-error pairs (x, e(x)) ∈ ℝ d × ℝ and the multiplied fractions 1 d+1 and d d+1 are present for normalization. (2) do not solely refer to the optimal representatives of input space as, e.g., they could have been derived from K -means (Shen and Hasegawa, 2006), Self-organized Maps (Kohonen et al., 2001) or Adaptive Resonance Theory (Carpenter and Grossberg, 1988). Instead, based on (3), the position of input prototypes in  is optimal that minimizes both: quantization error and prediction error. It is expected to observe a high density of input prototypes in the input space where the model prediction performance is poor compared to other subspaces, where the model behaves more accurately. This is reflected by the joint optimization objective that drags the input prototypes in areas where the model accuracy varies significantly.

Remark 2. The input prototypes {w k } in
Obviously, the number of prototypes K is not known a-priori and the ED i incrementally decides when to add a new input-error prototype based on the input novelty and model performance. Hence, we propose an evolving algorithm that minimizes (3) starting initially with one (K = 1) input/error prototype pair (w 1 , u 1 ) corresponding to the first input x 1 and prediction error u 1 = f i (x 1 ) − y 1 given the first pair (x 1 , y 1 ). Then, current prototypes and new ones are conditionally adapted and created, respectively, w.r.t. incoming pairs materializing the concept of familiarity and novelty, respectively. Specifically, based on a familiarity threshold I between the new input x and its closest prototype w k and a dynamically changing error tolerance O for the current error y − f i (x), the pair (x, y) is classified as novel or not with the so far observed pairs. If the new pair is considered familiar w.r.t. recent history, the closest input prototype and corresponding error prototype are adapted to the familiar pair. However, if the current prediction error over the closest input subspace is not tolerated, i.e., greater than O , then this tolerance O decreases denoting less tolerance in the error space for future inputs. On the other hand, if input x is relatively far from its closest w k w.r.t. I then a new input-error prototype is created. If the current prediction error is not tolerated, i.e., greater than O , then this pair is considered novel, which immediately renders the model re-learning/adaptation. Otherwise, this pair is familiar since the current error is tolerated, thus, avoiding model adaptation/re-training. Nonetheless, O decreases denoting less tolerance in the error space for future novel inputs.
Familiarity I represents a threshold of similarity between input x and prototype w k , thus, guiding us in determining when a new inputerror prototype pair should be formed. Then, combined with the prediction error tolerance, the methodology decides on a novel or familiar input w.r.t. the prediction performance of the local model. Moreover, the gradual decrease of O upon deciding familiarity/novelty of the input-output pair or model adaptation, signals the decrease in prediction error tolerance to enforce model adaptation with higher probability in future pair observations. This avoids monopolizing familiarity decisions thus urging model adaptations and possible updates maintaining high quality of inferential analytics at the EGs.

Familiarity & novelty inference algorithm
The evolving Algorithm 1 minimizes the objective (3) by incrementally adapting the input and error prototypes as stated in Theorem 1. Note, w k and u k converge to the centroid (mean vector) of the inputs x and to the median of the absolute prediction error in the k-th inputerror subspace, respectively, as stated in Theorem 2. These (converged) prototypes are the sufficient statistics  i (Problem 3), which will be exploited by the EG for determining the most appropriate diverse models given a query.
7: else 8: prototypes adaptation in (4); familiarity ← TRUE 9: end if 10: else 11: novelty (new prototype): (3) iff given a pair (x t , y t ) they are updated as: t ∈ (0, 1) is a learning rate: . The prototypes (w k , u k ) ∈  i converge to the centroid of input vectors and median of prediction error, respectively, of the k -th input-error subspace.

Proof. See Appendix A.2 □
The Algorithm 1 on ED i (i) optimally quantizes the input-error space by minimizing (3), (ii) on-line decides whether (x, y) is familiar or not used for triggering model adaptation/re-training and/or cached model Based on the evolution of discrepancy values over time, as will be discussed in Section 5, ED i efficiently decides on delivering the new mode to EG updating its cache. The ED i has now all the available knowledge for its input-error space encoded in  i .
Remark 3. The advantages of the Algorithm 1 are threefold. First;y, it incrementally minimizes both the EQE and EPE based on the predictability performance and the distribution of the input and error spaces concurrently, which aligns with the principles of hetero-associative learning. Secondly, it classifies online the familiarity of an input-output pair conditioned to the predictability of the local model, thus, acting as a classifier. Thirdly, it conditionally updates the local model to follow the underlying distribution of the input-output space towards convergence (see Appendix C for an upper bound on the model adaptation rate). Hence, Algorithm 1 adopts three behaviours: (i) as an incremental input-error vector quantizer conditioned on a tolerated prediction error and familiarization decision threshold, (ii) as an online novelty classifier, where it controlably expands the sufficient statistics (prototypes) and, (iii) as an adaptation mechanism, which conditionally updates the local model based on the current predictability model's performance.

Model discrepancy & tolerance
We introduce a time-optimized mechanism for: (i) deciding when to update the cached model f o i at the EG with the new updated model f i at the ED i (and potentially the partial changes in  i ); and (ii) adjusting the Algorithm 1 to reduce novelty mis-classifications via positive and negative feedback.
At time instance t, the ED i is capturing the pair (x t , y t ) with errors: Without abuse of notation, we remove the subscript i referring to ED i in the remainder of this section for the sake of readability. The instantaneous discrepancy to the EG does not scale with the number of EDs. Hence, we obtain expected communication nP(z t > ) at any time instance t which cannot be neglected (especially when P(z t > ) is different for each ED in real life scenarios). Moreover, the hard decision threshold indicating the absolute difference between f and cached f o includes the inherent discrepancy variance of both models given a random input x. This is not trivially indiscriminable; the error variances 2 e and 2 e o are added up when considering the error difference |e − e o |). Any instantaneous excess of z t over does not certainly indicate an experienced change in both models in terms of predictability in the recent past.
Our idea is to observe the evolution of the discrepancy values {z t } in multiple time instances between the two models within a (yet, unknown) time horizon. Based on this observation, we expect ED to decide on a model update. This means that we avoid instantaneous and sudden model updates with the major aim to reduce communication overhead due to possibly highly frequent model and sufficient statistics updates to the EG. Evidently, the length of such time horizon cannot be determined a priori due to the stochastic nature of z t . Let Z t denote the positive random variable of discrepancy with z t > 0 be a realization value at time instance t. We allow ED to tolerate a cumulated discrepancy: for a specific time horizon since the last model update (resetting t = 0), where the expected discrepancy between the local and cached models is bounded by a discrepancy tolerance Θ > . In this context, the ED is given the opportunity to postpone model updates as long as the cumulated discrepancy since the last update S t is less than Θ. During this time horizon starting from the last model update, the ED saves communication resources by avoiding rapid model updates of sporadically sudden excesses of Z t > . Evidently, this comes at the expense of a potential difference in prediction accuracy of the model f at ED and cached model f o at the EG. We strictly enforce such expected difference to be bounded by in light of reducing the communication overhead, which might be relatively significant if Z t stochastically oscillates around .

Model update based on cumulative discrepancy
To observe the evolution of discrepancy variables between the two models, we focus on the cumulative sum S t up to t, i.e., S t = ∑ t =1 Z since the last model update, where t = 1 indicates the first landmark time instance right after the last model update. S t is a random variable made of the summation of random variables {Z } t =1 up to t. Such a sum is adopted as an indicator of the evolution of the discrepancy, which is expected to be bounded by Θ indicating the minimum acceptable quality of inferential analytics based on the cached model at EG. We desire S t to be as close to Θ as possible to avoid updating the model at EG during this time horizon (thus saving communication resources), but not to exceed this threshold, since there might be a significant predictability discrepancy between the local and cached models.
Evidently, the size, i.e., number of random variables Z , in S t is increasing with stochastic step and depends on the length of the time horizon ED does not communicate with EG for model update. The value of S t is governed by the stochasticity of the discrepancy values z in time. Hence, we do not know and cannot forecast when S t will reach Θ and when S t+1 will exceed Θ, i.e., when S t ≤ Θ and S t+1 > Θ since Our problem then is to determine the time instance t * > 0 since the last update such that the sum S t * is as close to Θ as possible but without exceeding Θ. In this context, we cast this problem as a time-based stochastic optimization problem (Bruss and Le Cam, 2000) by finding the optimal stopping time t * , which minimizes the expected discrepancy between accumulated discrepancy and tolerance without exceeding this boundary. Obviously, the case S t > Θ is undesirable since it induces a penalty that we should have stopped earlier before the tolerance had exceeded the boundary. In our mechanism, we delay the model update up to the best time instance t * in hopes of reaching as close to Θ thus reducing the communication overhead. If we gathered more discrepancy than Θ (S t * > Θ), then we should have stopped before t * to avoid expected discrepancy greater than Θ between the models.
The benefits of adopting a (stochastic) cumulative discrepancybased decision making rather than instantaneous decision making as in the models found in the literature (see Section 8.3) is that it enforces ED to either take a decision on model update or to continue with another observation by avoiding redundant model updates. On the other hand, if the application is in need of highly accurate predictions, a certain (controlled) delay must be tolerated tuning the boundary Θ. Obviously, we cannot delay for ever to avoid any communication between ED and EG, since in a dynamic environment the current model f is about to significantly change should the underlying join distribution of inputoutput space ( i ,  i ) is changing (indicated with a relatively high number of 'novelty' pairs) and both models f and f o have to represent the current (or better recent past) state of nature.
We naturally provide the function G t (S t ) in (6) representing the tolerance return at the ED involving the boundary Θ and the penalty when S t exceeds Θ, thus, penalizing our 'delayed' decision for model updates: The ED attempts to maximize the expected return [G t (S t )] by delaying model updates, thus, saving communication resources but not exceeding the established boundary to secure the minimum discrepancy between f and f o models in ED and EG, respectively. We now formally state the model update problem, which specifies the generic Problem 1: Problem 4. Given a boundary Θ > 0 and a sequence of discrepancy variables {Z t }, find the optimal stopping time t * such that the supremum of the expected return sup 1≤t≤∞ [G t (S t )] is attained. The maximum expected return is [G t * ].
When the optimal stopping time t * is determined, as we will show later in this section, the ED updates the EG with the model f and the mechanism starts-off a new era by observing new discrepancy behavior with resetting the time landmark t = 1.

Solution fundamentals
In order to establish the solution fundamentals for Problem 4, we provide preliminaries on the Optimal Stopping Theory (Shiryaev, 2008) to help us classify our Problem 4 as an optimal stopping problem. In this context, we need to prove first the existence of the optimal stopping time t * explicitly for our problem and, then, provide our optimal stopping rule, which is the decision rule for updating the model to EG. The reader could skip Section 5.3.1 should they be familiar with the principles of the optimal stopping theory.

Optimal stopping theory
The theory of optimal stopping (Bruss and Le Cam, 2000;Shiryaev, 2008) is concerned with the problem of choosing a time instance to take a certain action in order to maximize an expected return. A stopping rule problem is associated with: (i) a sequence of random variables Z 1 , Z 2 , …, and (ii) a sequence of return functions (G t (z 1 , … , z t )) 1≤t , which depend only on the observed values z 1 , … , z t of the corresponding random variables. An optimal stopping rule problem is described as follows: We are observing the sequence of (Z t ) 1≤t and at each time instance t we choose either to stop observing or continue. If we stop observing at time instance t, we gain a return G t . We desire to choose a stopping rule to maximize our expected return. Definition 1. An optimal stopping rule problem is to find the optimal stopping time t * which maximizes the expected return The available information up to t is the sequence t of the values of the random variables Z 1 , … , Z t , a.k.a. filtration.

Definition 2. The 1-stage look-ahead stopping rule is the stopping criterion
In other words, t * calls for stopping at the first time instance t > 0 for which the return G t for stopping at t is (at most) as high as the expected return of continuing to the next time instance t + 1 and then stopping.

s.)
A monotone stopping rule problem can be described as follows: The set A t is the set on which the 1-stage look-ahead rule (1-sla) defined in Definition 3 calls for stopping at t. The condition A t ⊂ A t+1 means that if the 1-sla rule calls for stopping at t, then it will also call for stopping at t + 1 no matter what Z t+1 happens to be. Similarly, A t ⊂ A t+1 ⊂ A t+2 ⊂ … means that if the 1-sla rule calls for stopping at t, then it will call for stopping at all future times no matter what the future observations turn out to be.
Theorem 3. The 1-sla rule is optimal for monotone stopping rule problems.

Optimal stopping rule for model update
Our target is to determine a 1-stage look-ahead optimal stopping rule for model update observing a sequence of discrepancy values. Before proceeding with our optimal stopping rule, we need to prove the existence of the optimal stopping time of Problem 4. That is, we check if the ED by applying our proposed 1-sla stopping rule maximizes the expected return being as close to the boundary as possible without exceeding this. Lemma 1. The optimal stopping time t * that maximizes [G t (S t )] in Problem 4 exists.

Proof. See Appendix A.3 □
Given the existence of the optimal stopping time t * for Problem 4, we provide a solution by defining the optimal stopping rule adopted by the ED. We report on a 1-sla rule based on the principle of optimality (Shiryaev, 2008) in Theorem 4, at which the ED stops observing discrepancy values and then updates the EG at the first time instance t such that: That is, any additional observation of a discrepancy value at time t + 1 would not additionally contribute to the maximization of our return in Problem 4. The 1-sla rule is optimal since the stochastic differ- as will be proved in Theorem 4. Based on the principle of optimality for our 1-sla stopping rule, we provide the optimal rule in Theorem 4 for model updates: Theorem 4. Given a sequence of discrepancies Z 1 , … , Z t , the optimal stopping rule t * for Problem 4 is The optimal stopping rule in Theorem 4 involves the cumulative sum of discrepancies and the conditional expectation of tolerance up to t given S t . This clearly demonstrates the dynamic (optimal) tolerance threshold departing form the instantaneous one in Harth and Anagnostopoulos (2018), where the ED monitors the stochastic behavior of discrepancies at every time.
The ED decides to update the model at the first time instance the cumulative discrepancy up to t is higher than the expected discrepancy up to t multiplied by a factor (1 − F Z (Θ − S t )) −1 > 1. Since the 1-sla rule in Theorem 4 is optimal by Lemma 1, the ED with fixed Θ guarantees that the expected return is as much close to Θ as possible and no other stopping rule can guarantee as much. Based on a Θ, our model update mechanism is flexible to treat and control the expected delay and the expected discrepancy between the local and cached models in ED and EG, respectively. In the remainder of this section, we demonstrate the optimality of model update rule for different discrepancy distributions and the practicality of the mechanism.

Optimal model update rule in action
Let us demonstrate the optimality of our rule in Theorem 4 by adopting different probability distribution functions F Z (z) of discrepancy Z. Fig. 3 shows the probability distribution (PDF) of discrepancy Z for an ED, where it adopts the models RBF and LM over the dataset D2 (Section 8.1). For illustration purposes, we estimate the parameters for fitting the PDF of Z via a Weibull W( 1 , 1 ) and Exponential Exp( −1 ) distribution functions, where the corresponding F Z is then for different distributions of Z. The optimal stopping time t * is the first time instance t when the expected return at (t + 1) is less than S t .
obtained. Fig. 4 shows the expected return [G t+1 | t ] at t + 1 vs. S t with S t ∈ [0, Θ] for the Gamma, Weibull, Exponential and truncated Normal distributions of Z (as fitted from the real datasets in our experiments). The optimality and uniqueness of our rule is obvious where the ED immediately updates the EG with the local model when the current return at time t, G t = S t , is greater than the expected return (since from that time instance and onwards, any expected return is strictly less that any G t , thus, we will never maximize our return). Practically, we update the model to EG when S t ≥ S * , where S * is estimated by solving (8) w.r.t. S t ; for instance, in Fig. 4 the arrow points to the optimal discrepancy value S * , where the ED updates EG when the current S t exceeds this value. Note: S * is unique, thus, the optimal stopping time t * provided by our model is unique. (Trevor et al., 2009). Based on this method, the ED incrementally updates the F t Z at the t th observation based on the previous F t−1 Z . In both methods, the ED evaluates the criterion in (8) (1)).

Adaptive decision making
The ED i has now a sequential decision making mechanism to update the EG with the model f i based on a sequence of discrepancies since the last model update. Departing from the instantaneous decision relying only on the current discrepancy in Harth and Anagnostopoulos (2018), the ED i sophisticatedly postpones a model update to save communication resources and, more importantly, adapts its novelty detection mechanism in Section 4.1 based on the feedback of the discrepancy value.
The adaptive model update algorithm is provided in Algorithm 2.
Firstly, the ED i receives a pair (x, y) at time t. Based on the Algorithm 1, such pair is classified as novelty or familiar. In the former case, this pair is appended to the current sliding window  i (while the oldest pair is discarded) and, then, the local model f i is either re-trained or adjusted depending on the regression algorithm. The ED i uses the locally updated f i to instantaneously provide a predictionŷ = f i (x), thus, obtaining the error e i (x) = y − f i (x). Moreover, to assess the dis- is obtained. Based on both errors, we calculate the discrepancy z t = |e i − e o i | (see Lines 4-9). In instantaneous decision, we would have updated EG with the new f i should z t > . However, in our mechanism, we examine whether the optimal criterion in Theorem 4 is met by updating S t = S t−1 + z t . If the criterion in (8) holds true, the ED i updates EG with f i and a new decision era starts off; see Lines 10-14. The decision on actually updating the cache model is delegated to the criterion in (8) to secure optimal expected return given Θ. Upon reception of a pair (x, y), the ED i is about to adjust Algorithm 1 based on the z t value (see Lines 15-20). We exploit the occurrence of the event {z t ≤ } and the fact that the pair (x, y) was classified as novelty to provide feedback. If Algorithm 1 classifies (x, y) as novelty and then, after updating the model f i (re-training or adaptation) we experience the event {z t ≤ }, then we impose a penalty (negative feedback) since there was no reason to update the model and to come up with an up-to-date model with the same prediction behavior as the cached model f o . That is, we could have saved computational resources of not proceeding with model adaptation/re-training, since regarding the prediction error, the model performance is the same as before the model adaptation. Such feedback is reflected by adjusting the error threshold O in Algorithm 1, i.e., increasing O by a factor −z t ∈ (0, 1] to proceed with more accurate classification results, thus, saving computational resources. On the other hand, i.e., when {z t > } and the pair is classified as novelty, then we reward the model adaptation/retraining since we obtain expected discrepancy greater than between the updated f i and cached f o i . In this case, we shrink O by z t − z t ∈ (0, 1), z t > > 0, to enforce model adaptation/re-training in future pairs. Fig. 5 (left) shows the process at ED i including the familiarity/novelty inference and model update of Task Flow 1.
Remark 5. The advantage of Algorithm 2 is twofold. Firstly, it optimizes the decision making of when to update the local model based on Theorem 4 conditioned to the non-familiarity of an input-output pair. Secondly, which is the most important functionality, it adopts a reward-penalty mechanism to adjust the error tolerance threshold O based on the discrepancy threshold . Algorithm 2 provides feedback to Algorithm 1 in order to improve its novelty detection capability, which plays significant role in future local model adaptations and computational resource usage. The Algorithm 2 directly implements a closed-loop feedback controller that controls the decisions on the local model adaptation in light of saving computational resources. It achieves to increase the certainty of the Algorithm 1 on when to adapt/re-train the local model based on the feedback from the time-optimized model adaptation mechanism.

Expected communication
We estimate the expected communication of the instantaneous model update (Harth and Anagnostopoulos, 2018) and of delivering partially updated sufficient statistics.

Instantaneous model update communication
In the instantaneous model update (Harth and Anagnostopoulos, 2018), the ED decides on a model update or not at the t-th pair (x, y) t , t = 1, … , T within T ∈ observations. At time t, the ED only after classifying that pair as novel and re-training/adapting its local model f, accordingly EG as a percentage of messages w.r.t. baseline solution, where all data are transferred from ED to EG; the expected communication is accurately predicted for ≤ 2.

Sufficient statistics update communication
The sufficient statistics  i are conditionally adapted upon a new pair (x, y), while converge as proved in Theorem 2. The adaptation, which is fundamental for convergence, results to incremental changes of the closest prototypes to pairs, thus, these changes have to be reflected to EG for model selection, as will be shown in Section 7. Let ) be the change vector in ℝ d+1 after the reception of pair (x, y) based on Theorem 1. After convergence, the expectation of the vector change is [Δ(w , u )] = (0, 0); however, until convergence, the ED should regularly update EG for incremental changes in  i . Such updates are sent from ED to EG during the update decision or can be sent interdependently, should the changes are not significant enough to be considered for update. Proposition 2 reports on the upper bound of the changes in the sufficient statistics that determines the frequency that ED updates EG considering only partially incremental updates on  i .

Proposition 2. The expected magnitude of changes in  i is bounded by:
EG referring only to modified input/error prototypes, provided that they have not changed since the previous update.

Cached models selection at the edge gateway
Up to this point, we have elaborated on Task Flow 1, where ED i generates the sufficient statistics  i to optimally update the cached model at the EG. In Task Flow 2,  i statistics are received by EG as a guiding light to select the most appropriate diverse models per query. Our desideratum is that inferential analytics must be achieved in real-time with low communication overhead and be highly accurate. Communication overhead refers to delivery of  i and f i from all ED i to EG and high accuracy refers to low error for random queries.

The EG caches all models
Based on Algorithm 2, each ED i autonomously decides when to update the EG with f i independently of the other EDs. Partial updates of statistics  i are also sent to EG to significantly drive model selection, as discussed in Section 6.2; EDs deliver only knowledge (models and statistics) to EG and not actual data.
Assume that analysts/applications issue a query stream {q ∈ ℝ d } to Cloud, which is directed to EG; see Fig. 1. The EG should return accurate predictionŷ and/or the relevant local models around the input space defined by query point q. And, these outcomes should be highly accurate and delivered in real-time without any further communication with the EDs. Hence, given a query q, the challenge for EG is to (i) efficiently select the most appropriate subset of models  ′ ⊆  providing an ensemble prediction (Zhou, 2012)ŷ, whose prediction error is as close to the global f G as possible and (ii) deliver the most representative models in  ′ that better explain the input-output dependency.

On computing appropriate models subset
We show (i) that computing the best subset  ′ of models per query q is computationally hard, and (ii) that as exemplified, it is highly ben-eficial to engage only a good models subset per query. The above showcases thus the traits and benefits of our approach on models selection at the EG. EG can, trivially, engage all cached models. Each cached model f o i produces an estimateŷ i = f o i (q) and then it takes their averagê y = 1 n ∑ n i=1ŷi . Let us denote such method as the Simple Model Aggregation (SMA) as e.g., adopted in Konečny et al. (2016), so to differentiate it from EG's sophisticated methods. SMA implies that all models are equal candidates and available for providing an estimate. It would have been preferable if the EG could engage a subset  ′ ⊂  whose average estimateŷ ′ = 1 | ′ | ∑ f o i ∈ ′ŷ i would be equal toŷ, or more interestingly, if the EG could engage the minimum subset of models whose average estimate is as close to actual y than to averageŷ for each query q.
Determining the minimum models subset whose aggregate estimate is close toŷ calls to mind the Subset Sum Problem (SSP) (Przydatek, 2002): Consider a pair ( , s), where  is a set of n > 0 positive integers and s is a positive integer. SSP asks for a subset of  whose sum is closest to, but not greater than, s. SSP is NP-hard (Garey and Johnson, 1990). Consider now the following problem, referred to as Minimum Subset Average Problem (MSAP). ( , s), find the minimum subset  ′ with average s ′ subject to ⌊s ′ ⌋ = s or ⌈s ′ ⌉ = s.

Proof. See Appendix A.8 □
Now, based on Theorem 5, we obtain that: Corollary 1. Given a query q, the problem of finding the minimum subset of cached models  ′ ⊂  in the EG, whose average estimateŷ ′ gives the same error asŷ w.r.t. the actual y is NP-hard.
Proof. See Appendix A.9 □ SSP and MSAP are NP-hard, however, one is often satisfied with an approximate, sub-optimal solution, i.e., in polynomial time; see (Przydatek, 2002) for SSP. Nevertheless, even if EG were able to use such heuristic to find the minimum set  ′ for given query (let n be small) then this would still not be preferable given our goals. That is because, in order to obtain  ′ for a given query, EG would firstly have to engage all cached models and consequently, based on their estimates, produce  ′ . EG has to predict the most appropriate  ′ , which gives the same or, hopefully, smaller prediction error than that of  . This prediction can be interpreted as follows: the cached model The task of predicting  ′ per query involves the following issues: (a) the probability (density) distribution of the queries is evidently unknown since analysts/applications randomly issue queries whose patterns are not trivially easy to be revealed and/or provided then to the EG; (b) it is not feasible to identify the distribution that generates query q, since we have only one sample from this at a time; (c) it is not suitable to assume that query q is produced by a certain distribution at time t, which remains also the same for subsequent queries q , > t. This is getting more difficult when dealing with non-stationary distributions of query patterns, which is not a rare situation (Anagnostopoulos and Triantafillou, 2017a). The only available knowledge we can exploit for predicting an appropriate models subset per query is the set of the sufficient statics { i } n i=1 delivered (and updated) from the EDs to their EG. Based on such knowledge, we propose computationally efficient and accurate model selection algorithms for edge-centric inferential analytics.

Cached model selection algorithms
We introduce model selection methodologies exploiting knowledge coming from EDs. The ensemble predictionŷ is the weighted sum of The weight i (q) in (9) is a function of the current query q that interprets the importance of the performance of local model f i in the local familiar input subspace around query q derived by the sufficient statis- engages only the models in  ′ for this specific query. The baseline solution is the SMA, which does not exploit the statistics  i in the ensemble outcome, i.e., the EG simply aggregates the individual pre- for deriving the final one thus setting i (q) = 1∕n: In SMA the EG is only updated independently by an ED i with f i , while no reception of  i is required by any ED. The ensemble subset  ′ ≡  , i.e., no model selectivity, where prediction accuracy is not favored compared to global f G ; see evaluation Section 8. We now propose the following model selection methodologies departing from SMA.

Input-space aware top- model (IAM)
We first present the top-1 (best) model selection scheme ( = 1). The EG selects only one (best) model f * ∈  to engage analytics tasks, i.e.,  ′ = {f * } given query q. The model selection is achieved by using the input space prototypes {w i,k } of the sufficient statistics  i received at EG. It is worth mentioning here that the input prototypes w i,k , ∀i are dragged to the subspaces of the input-error space to reflect the prediction performance of the considered model f i (please, recall Remark 3). In this rationale, the IAM selects the model f * whose the -th input prototype w * is the closest to query q compared to all } from all n models: w * = arg min w∈ ‖q − w‖. The EG selects f * whose input subspace (represented by w * ) is the most familiar (closest) with query point q, thus, the associated predictive model f * can provide the best prediction. Without having obtained all input prototypes  i from each f i , the EG could not discriminate which model's input subspace is the most familiar with the given query point. The weight function in IAM indicates the closest distance of q to w * : The EG engages only the f * associated with the closest prototype for prediction, i.e.,ŷ = f * (q). For  > 1, the EG ranks all prototypes w ∈  w.r.t. their distance from query q and selects those models f * 1 , … , f *  ∈  ′ ⊂  whose closest input prototypes are ranked in the top- closest distances. The ensemble prediction is then: The influence of the distance ‖q − w‖, i.e., the closer to q the higher the weight importance, is achieved by the exponential inverse squared distance weighting e −‖q−w‖ 2 (Lukaszyk, 2004). Note, for  = n, we obtain the Weighted SMA, where the normalized weights reflect the distance of query q to the local closest prototypes from each model.

Error-aware top- model (EAM)
The EAM model combines  cached models from  under the double-exponential error weighting function g(u) = 0.5e −|u| , u ∈ ℝ. The information of local errors per model is provided by the corresponding input-error associations and specifically from the error representatives. A model f o i is appropriately weighted according to the assessment of predictions in terms of average absolute prediction error u i,k around the input sub-space w i,k such that w i,k = arg min l∈ [K i Specifically, given a query q we derive the closest input prototype w i,k and its associated error prototype u i,k . Then, by adopting the doubleexponential density g(u) over the average absolute prediction errors represented by the prototypes u i,k , we obtain: The prediction outcome is achieved by selecting  ≥ 1 models from  with the top- weights i (q) of the  models, i.e.,ŷ = (13).

Input & error-space aware top- model (IEAM)
The EG exploits all the knowledge from  i , ∀i combining the familiarity of the input subspace of a model w.r.t. query q through the closest input prototype w i, and the associated performance reflected by the error prototype u i, . IEAM selects the best or the top- best models from  , which are not only familiar w.r.t. the queried input but also effective for providing accurate predictions based on their local prediction performance over the familiar subspace represented by the closest input prototypes to the query point. The combination of the two directions, input space familiarity and associated prediction performance, renders the EG to proceed with a more sophisticated model selection. The weight i (q) represents a degree of model closeness to an issued query taking into consideration the (inverse) closest input distance w i, ∈  i and the associated median of the absolute prediction error u i, around this subspace. Specifically, i (q) interprets the relative closeness of model f i to query q: where u i,k = u i,k ∑ u∈ u is the normalized median of the prediction error of model f i over the k-th input/error subspace among all error medians  = {{u 1,k } k 1 k=1 ∪ · · · ∪ {u n,k } k n k=1 } from all n models. The prediction outcome is achieved by selecting  ≥ 1 models from  with the top- high degrees of closeness of the  models ranked by i (q), i.e., (14).

Computational & space complexity at the edge
We provide the inherent computational complexity of our methodology including EDs and EGs trading off communication efficiency and accuracy of analytics. The computational complexity of the Algorithm 1 in Task Flow 1 is as follows. Given (x, y) the ED i adopts a d-dim. tree structure over the K i prototypes to classify the pair as familiar in O(dlogK i ) time. The decision on a model adaptation due to a novelty classification with probability i < 1 is then O(d) (see Appendix B).
In the context of model adaptation, the complexity is now based on the underlying model algorithm. Should the algorithm be incremental, the complexity for adaptation is upper bounded by O(dN), given a sliding window of N d-dim. vectors. However, there are incrementally updated algorithms whose complexity is significantly reduced and does not directly depend on the window size N. The selection of incrementally updated models in a ED is evidently guided by the computational capabilities and resource availability of the ED. For instance, in a resource constrained environment, EDs normally adopt regression/classification models with complexity O(d) in an on-line adaptation mode, like online passive-aggressive algorithms or stochasticgradient decent RLS (Crammer et al., 2006) (see Appendix B for model update adaptation complexity). Algorithm 2 incorporates the complexity of the model update criterion in Theorem 4, which is O(1) adopting incremental KDE with space complexity O(dN) given the sliding window of N d-dim. vectors. The feedback mechanism computes in O(1) the prediction error, since the input-output pair is observed in ED i. Hence, in total, the model update and adaptation overhead in ED i to compute and decide is O(dlogK i ) + i O(d + 1) time, which is a significantly light process given the current advances on ED technology. The expected communication overhead for the model update and knowledge transfer to the EG is analyzed in Proposition 1. The space complexity of such knowledge transfer of ED i incorporates the parameters of the adapted f i model and potential changes/deltas of input-error prototypes {Δw k , Δu k } ∈  i . The space complexity is O(K i (d + 1)) given that all prototypes have been changed.
In Task Flow 2, given a query q, the EG performs one nearest neighbor (1NN) search over the input prototypes in all statistics { i } n i=1 to find the closest one used for all the proposed model selection schemes.
By adopting a d-dim. tree over the n ′ = n × ∑ n i=1 K i prototypes, the time complexity per query is O(dlog(n ′ )) to provide inferential analytics with O(dn ′ ) space complexity.

Discussion on data & predictive task offloading at the edge
An essential mechanism in EC is the computation task and data offloading as a process of delegating computation tasks and/or data to an EC server or Cloud (Alghamdi et al., 2019a). As emerging predictive applications require intensive computation processes, task/data offloading is a promising mechanism that potentially overcomes limitations of EDs. There are three fundamental offloading decisions: local task execution/data processing; full task/data offloading; and partial offloading (Alghamdi et al., 2019b), where in the latter part of the computation is processed locally while the rest is offloaded to EC server or Cloud. EDs and EGs in our context have tasks, e.g., model training; inference; adaptation; updates, and data to be potentially offloaded to guarantee quality of service. Therefore, EDs can autonomously decide on suitable decisions based on current context, e.g., computational resources and load, to secure resilience in delivering predictive tasks (Alghamdi et al., 2019c). In our scheme, the model update mechanism is timelyoptimized to offload meta-data, i.e., model parameters and statistical prototypes, from EDs to EGs to achieve knowledge aggregation. Our future agenda includes sophisticated task/data/knowledge offloading mechanisms balancing the expected workload among EDs and EGs w.r.t. load and computational capacity of EDs in resource-constrained environments. Notably, our evaluation section offers comprehensive evaluation results over unmanned vehicles, which is undoubtedly considered as a resource constrained environment.

Datasets, predictive models & parameters
We experiment with real multivariate contextual datasets from EDs including stationary sensors and unmanned vehicles. For each dataset, we define scenarios corresponding to regression and time series forecasting adopting multivariate linear regression (LM), autoregressive (AR) recursive least-squares embedding (AR-RLS), radial basis function network regression (RBF), and nonlinear AR-RBF, which are widely used for analytics (Babcock et al., 2002;Triantafillou, 2017a, 2017b;Tatbul et al., 2003). The reader could skip this sub-section, should they be familiar with such predictive models.
Regression Models: In LM, an ED i learns the model The RBF model has an input and a hidden layer with a non-linear RBF function h and a linear output layer. The ED i learns the RBF model: . M i is the number of neurons in the hidden layer, im is the centroid vector for neuron m, and a im is the weight of neuron m in the linear output. The h(·) depends on the input distance from a centroid and is radially symmetric about the centroid commonly used to be Gaussian, i.e., h(‖x − im ‖) = exp(−0.5‖x − im ‖ 2 ). The B i are sequentially estimated optimizing the fit between f i and the pairs (x, y). Appendix B provides the incremental rules for the models' param-

eters.
Time-series Forecasting Models: For the time series forecasting, we adopt the embedding mechanism (Cerqueira et al., 2017) to transform the time series and then apply AR-RLS and AR-RBF. Consider the time series of values u 1 , u 2 , … , u t at regular time intervals. The time series is reconstructed into a higher dimensional space with embedding dimension d by generating a matrix of N embedding vectors in d dimen- ], each row corresponding to an embedding vector u t = [u t , u t−1 , … , u t−d+1 ] ⊤ ∀t ∈ . A slides with the sliding window  i of size N of the ED i. Based on this transformation, there are no long term time dependencies in the series, thus, the embedding vectors are deemed as uncorrelated (Takens, 1981) and allows the use of any regression technique in the literature; in our experiments is AR-RLS and AR-RBF and obtain the pair (x, y) t from embedding vector u t with y t = u t and x t = [u t−1 , u t−2 , … , u t−d+1 ] ⊤ ∈ ℝ d−1 . The forecasting task is to predict y t+p = u t+p at time t + p, with lag p > 1, given the d − 1 recent values x t = [u t−1 , … , u t−d+1 ] and N recent embedding vectors u ∈ A matrix. In AR-RLS, we obtain: Each ED i learns a LM y = f i (x) with input x = [x 1 , x 2 ] ⊤ , x 1 is temperature, x 2 is humidity and output y is light and parameter b i ∈ ℝ 3 (including the offset coefficient) and a RBF with inputs and output as described USV represents a ED capturing 1672 2-dim. vectors every 10 s and is equipped with a Raspberry Pi for locally computing predictive models and obtaining wireless access to a EG; a testbed framework provided by the RAWFIE project. 11 For regression, we monitor the temperaturehumidity correlation on the sea surface captured by the swarm with output y (temperature) and input x (humidity) per ED/USV. The EDs learn LM and RBF models with b i ∈ ℝ 3 and B i (M = 6), respectively. The models are incrementally updated over window size N = 6 (1 min history). For time series forecast, we predict the future temperature and humidity using AR-RLS and AR-RBF with d = 6 dim. embedding vectors (estimated by the AIC (Burnham and Anderson, 2002)). The AR-RLS and AR-RBF parameters b i ∈ ℝ 6 and B i with M i = 3, respectively, incrementally adjusted over window size N = 6.
Parameters: In Algorithm 1, learning rate = 0.1 (Bottou and Bousquet, 2007) and regularization factor = 0.5 in (3) for putting equal importance of EQE and EPE. The familiarity threshold I is normalized in the input domain [0, 1] d , i.e., I ∕ √ d ∈ (0, 1); a value close to 1 refers to coarse vector quantization, thus, a few prototypes K, while close to 0 refers to fine-grained quantization, thus many prototypes K.
For models update, in each ED i the discrepancy threshold i = MED i with factor ∈ (0, 3] and MED i is the median of the error differences |e i (x) − e o i (x)| in Algorithm 2 to control the expected communication between ED and EG. Based on i , the initial error tolerance O = i with minimum * O = i 20 . The boundary Θ ∈ {2 , … , 10 } thus being proportional to the median of discrepancy z. Table 1 summarizes the parameters ranges/values used in our experiments.

Performance metrics
The performance metrics reflect the diverse objectives of this research. Firstly, regarding the time-optimized model update mechanism, we assess the expected maximum return [G t * ], when ED decides on model update at optimal stopping times t * compared to Θ. A close expected return to Θ indicates that the ED intelligently decides on a model update without exceeding such boundary, thus, enjoying less communication overhead. In terms of communication overhead, we assess the update rate of the proposed mechanism defined as 1 [t * ] given a specific time horizon T. We desire our mechanism to decrease the redundant model updates, however, not exceeding the tolerance boundary and not spoiling the quality of analytics at the EG. We define as consistency the ratio [t * ] indicating the average discrepancy at the EG due to model updates. Such metric indicates that our time-optimized mechanism compared with (Harth and Anagnostopoulos, 2018) and other model update mechanisms (provided later), on average, result to the same discrepancy at the EG. Hence, the proposed mechanism does not spoil the quality of analytics at the EG in light of reducing the communication overhead. Instead, as we will show, our mechanism knows when to update the model at the ED enjoying the same discrepancy levels with other model update mechanisms being communication efficient. The consistency and the expected return metrics showcase the optimality of our scheme for model updates. We also define the percentage of the expected EDs-EG communication savings of all model selection schemes compared to the Global/baseline approach; recall that the baseline approach sends all raw data from EDs towards the EGs to construct a global model f G . In our case, we measure the communication overhead for delivering models and sufficient statistics from the EDs to EG. Finally, concerning the prediction accuracy of the inferential analytics per query, we measure the Root Mean and the Mean Absolute issued from the applications/analysts to the EG. The queries were generated by splitting the datasets into test and training set using 10-fold-cross validation (Trevor et al., 2009).

Comparison schemes & rules
Predictive Model Update Rules: Under the philosophy of finding the best time to update the models from EDs to the EG, we compare our mechanism with two model update mechanisms: Firstly, Harth and Anagnostopoulos (2018) considers the rule: t * = min{t > 0 ∶ z t = |e t (x) − e o t (x)| > }, hereinafter referred to as Instantaneous Rule (INST). We also consider an update rule where ED updates EG when the average discrepancy since the last update is greater than : =1 z > }, hereinafter referred to as the Mean Rule (MEAN). MEAN might resemble at the first sight with our Optimal Stopping Time Rule (OST). However, it does only consider constraints over the current mean discrepancy and does not optimize the expected return. MEAN is rather intuitive, nonetheless, without optimizing the distance from any tolerance threshold or delaying the model update for reducing the communication overhead. 12 Note: as proved in Theorem 4, the optimality of our OST is guaranteed, thus, any other model update mechanism does not maximize (6).

Schemes under Comparison:
We compare our approach with schemes found in the literature. DBP (Raza et al., 2015) uses forecasting models to predict ED's data, one forecasting model per attribute independently, and compares the predicted values with the current ones. If the difference less than a tolerance then DBP remains idle; otherwise, DBP builds a new forecasting model per attribute and transmits the models and data to EG. In DBP, the model selection scheme at the EG is the SMA. HOVF ) uses an AR model for local ED's data prediction per attribute, independently. HOVF decides whether to send only data to EG or not based on a stopping time mechanism in Shiryaev (2008). Both HOVF and DBP require EG to reconstruct the data (if they are not transmitted from the EDs to EG) in order to build the models f i at the EG, while the model selection scheme in HOVF is the SMA. Our previous work (Harth and Anagnostopoulos, 2018) employs model update by adopting INST, model selection using 12 The interested reader could refer to (Shiryaev, 2008) where there the optimal stopping time for maximizing the average S t ∕t (considering no constraints) is intractable, even for specific simple cases of expectations of Z.  (2016)). Finally, all schemes are compared against the Global scheme, where no models are obtained, no model selection is provided and only data are transferred to EG to build the global f G . The window size and tolerance for DBP and HOVF are the same as in OST for the sake of comparison.

Performance & comparative assessment
We assess our hypothesis on the optimality of OST compared to INST (Harth and Anagnostopoulos, 2018)    dant updates especially with accurate predictive AR-RBF obtaining high distance of the expected discrepancy from Θ, while OST intelligently delays as much as possible any update being closest Θ shown in Fig. 7 (left). Similar results are obtained using D1 and D2 with all predictive models; not shown due to space limitations. Fig. 7 (right) shows the optimality of OST achieving a relatively close expected return to every Θ value trading-off analytics quality vs. update rate. Given low Θ, OST increases the update rate by maximizing the expected quality and being closest to Θ; this is not achieved by any other rule. Fig. 8 (left/right:lower) shows the consistency of all rules indicating that the EG experiences the same discrepancy of analytics quality adopting OST, while OST being significantly more communication efficient as shown in Fig. 9. In Fig. 9  We now assess and compare our scheme with the above-mentioned comparison w.r.t. communication efficiency and accuracy of inferential analytics at the edge. We assess our hypothesis where knowing the best local model f i to involve at the EG per query q is unknown since  et al., 2016) and Global compared to the known best f i per ED i, i.e., the ideal case knowing which model to engage. Using IEAM and EAM, 44% of the cases obtain similar accuracy with the Global, IAM achieves same accuracy in 20% of the cases, while SMA obtains only 6% of the cases similar accuracy to Global resulting to the highest difference thus highly inappropriate for model selection. This indicates the capability of  IEAM and EAM to identify the most appropriate local models per query at EGs without raw data transfer thus being communication efficient and generating as accurate predictions as the Global. We obtain similar results for D1 (n = 25) and D3 (n = 4); not shown for space limitations. Fig. 10 (right) shows the average number of prototypes K per ED vs. ratio I ∕ √ d for all predictive models; ratio towards 1 decreases K being negative exponential indicating the minimum storage requirement on EDs retaining prototypes for achieving accurate predictions as Global without transferring data ( I ∕ √ d = 0.1 obtains K = 73 per ED).
To illustrate the efficiency of our scheme variants trading-off RMSE with communication, Fig. 11 and Fig. 12 show a significant 50-88% decrease in communication for OST + IEAM/IAM/EAM (top  ∈ {1, 2} models and I ∕ √ d = 0.1), which achieves RMSE slightly higher than Global for all Θ starting from 2 to 10 (top left to right bottom). Note: the increase of RMSE and communication reduction in OST variants except OST + SMA are not highly correlated, as the error with high communication results is nearly the same error than with nearly no communication. This indicates that EG identifies the best models for analytics based on statistics  and only a few communication updates from EDs. The importance of statistics for finding the best models rather than simply averaging them is reflected by OST + SMA, which cannot achieve low/comparable RMSE with the other variants, even if it increases significantly the communication. We obtain similar results for D1 over all predictive models; not shown for space limitations. Fig. 13 and Fig. 14 show the efficiency of our variants OST + IEAM and OST + SMA, HOVF , DBP (Raza et al., 2015), (Harth and Anagnostopoulos, 2018) and Global w.r.t. accuracy and communication with Θ ∈ {5 , 7 , 10 }. OST + IEAM is evidently the most efficient achieving high accuracy with least communication indicating the optimality of the OST model update rule and the IEAM model selection. DBP and HOVF are communication efficient but they do not account for the dependencies among attributes apart from selective data transfer, which has negative impact on RMSE.
OST + SMA enjoys more communication efficiency than INST + IEAM but does not perform in RMSE well in terms of model selection, which is the major drawback of model averaging. INST + IEAM achieves high quality of analytics due to IEAM model selection component, but it is not as communication efficient as OST + variants due to the INST update rule, whose behavior demonstrated above. Overall, OST + IEAM is the most efficient scheme hiring the optimal OST model update rule and the IEAM error-and-input space aware model selection component dealing with models diversity.

Conclusions
A novel, edge-centric inferential analytics methodology is introduced contributing with time-optimized model update and diverse model selection that support analytics at the edge being communication efficient. This is achieved by optimally disseminating only knowledge/sufficient statistics instead of raw data, while the methodology introduces knowledge-driven model selection obtaining high analytics quality. Comparative assessment with baseline and schemes in the literature over real datasets evidenced its benefits in edge computing.
Our future agenda includes the vision of data thinning, relevance and model refining at the network edge by dramatically reducing not only the amount of data that needs to be transmitted for further processing but also the relevant data that are required for query-driven analytics tailored to analysts' and applications' needs, thus, involving the human in the loop.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. and immediately updates the model to the EG. Based on Markov inequality and Z > 0 by definition, we obtain that F Z (Θ) > 1 − [Z] Θ , thus, for Θ it must hold true that F Z (Θ) ∈ (1 − [Z] Θ , 1) and then [Z] < Θ.
□ Appendix A.7. Proposition 2 Proof. Consider the k-th dimension change indicator I(Δw k, ) = 1, if k-th dimension of the closest w significantly changes, i.e., Δw k, > , due to the update rule upon a pair (x, y); 0 otherwise, for any arbitrary > 0. We define the indicator I(Δu ) in a similar way. The probability of changing k out of d + 1 dimensions of the change vector Δ(w , u ) based on the corresponding indicators is , k = 0, … , d + 1. Given that the -th input/error prototype is adapted (Algorithm 1), the expected magnitude of change is then: ) −1 is the sum involving the y moments of the reciprocals of binomial coefficients (Belbachir et al., 2011) with asymptotic expansion F (0) x %2 + 2(x − 1) −1 − 2 1−x (Yanget al, 2010). 3: If a subset  ′ of  with k elements is found, whose elements have an average k ′ such that ⌊k ′ ⌋ = s∕k or ⌈k ′ ⌉ = s∕k Then return  ′ .

4: end for
□ Appendix A.9. Corollary 1 Proof. Consider the errors e = |ŷ − y| and e ′ = |ŷ ′ − y|. In order to show that the problem of finding the minimum subset  ′ with e ′ = e is NP-hard, it suffices to show that finding the minimum subset  ′ ⊂  of cached models such thatŷ =ŷ ′ subject to constraint in Problem 2 is NP-hard. Consider the set  0 = {⌊ŷ i ⌋} n i=1 , and  1 = {⌈ŷ i ⌉} n i=1 withŷ i > 0, ∀i. Since MSAP, which deals with integers is NP-hard from Theorem 3, MSAP with ( 0 , ⌊ŷ⌋) and ( 1 , ⌈ŷ⌉) is also NP-hard. x t = 1 x t−1 + · · · N x t−N + t , where the window length N is determined by the Akaike Information Criterion (AIC) that penalizes model complexity (Burnham and Anderson, 2002). To estimate the coefficients , the ED i adopts the Recursive Least Squares (RLS) that allows dynamic update of a least-squares fit, as used in the LM. The least-squares solution to a system of equations Ab = y, where A ∈ ℝ N×d−1 , y ∈ ℝ N and b ∈ ℝ d regression coefficients to be estimated is given by the solution of A ⊤ Ab = A ⊤ y. Hence, we need only the projections: P = A ⊤ A and p = A ⊤ y, requiring O(d 2 ) space to keep the model up to date. When a new embedding vector arrives with y N+1 and x N+1 , we update P ← P + x N+1 x ⊤ N+1 and p ← p + y N+1 x N+1 . The coefficients b are incrementally updated in O(d) as: where G is initialized to I with > 0 and I is the d × d identity matrix. In the case of AR-RLS the solution vector b consists precisely of the AR coefficients , i.e., b = [ 1 , … , d ] ⊤ ∈ ℝ d .

RBF and AR-RBF Incremental Model Update:
The ED i adopts a computationally efficient incremental learning methodology (Schwenker et al., 2001) of the RBF parameters upon reception of a pair (x, y), in case of regression, or embedding vector u, in case of time series. Based on this methodology, with update complexity O(dM) over M neuron, the incremental update rules with updating rate ∈ (0, 1) for centers im and weights a im are: We obtain the time-series coefficients for AR-RBF by replacing in (B.3) and (B.4) x = [u t−1 , … , u t−d+1 ] and y = u t from the embedded vector u, m ∈ [M].

Appendix C. Upper-bound Probability of Model Adaptation
Based on the Algorithm 1 (Line 15), the ED i decides on adapting the local model f i given a classified familiar input-output pair (x, y), should the prediction error exceeds the (adjusted) error tolerance threshold O . This event, coined here as A = {ModelAdaptation} occurs with probability P{A} = P(‖x − w k ‖ > I )P(e > O ), given that w k is the closest prototype to input x, i.e., the pair is classified as not familiar given a familiarization threshold I and the model predictability error exceeds O . Let F e (x) = P(e ≤ x) be the cumulative distribution function of the model prediction error e. Then, P{A} = (1 − P(‖x − w k ‖ ≤ I ))(1 − P(e ≤ O )) = (1 − F e ( O ))(1 − P(‖x − w k ‖ ≤ I )). Based on the Markov's inequality, we obtain that P{A} ≤ (1 − F e ( O )) [‖x−w k ‖] I , which is the upper bound of the model adaptation probability.
In convergence of the input-error prototypes we obtain that [‖x − w k ‖] → 0, which derives from the fact that P(‖x − w k ‖ ≤ I ) → 1, as proved in Appendix A.2. Hence, in this case, the upper bound model adaptation probability tends to zero, indicating that the rate of adaptation decreases as long as the input-error vectorial space is quantized to minimize both the EQE and EPE. In this stage, the model does not require adaptation, which is anticipated since the ED i classifies the input-output pairs as familiar. This indicates the capability of the Algorithm 1 to ensure minimization of the expected computational overhead on the EDs via the proposed hetero-associative inference mechanism. However, in a dynamic environment, there are anticipated concept drifts over the input-output distribution. Hence, the upper probability of model adaptation can be envisaged as the model adaptation rate in a specific time horizon. As a candidate mechanism to handle this context is to monitor a descriptive statistic of the most recent familiar input-output pairs, e.g., a moving centroid of the N i recent familiar (x, y) pairs received in ED i. Given that there exists a significant change in this descriptive statistic (e.g., adopting the CuSum algorithm (Basseville and Nikiforov, 1993)), the local model can be triggered to adapt/retrained based these recent familiar input-output vectors, which is expected to reduce the inherent complexity to be adapted on every familiar pair received. This is left in our future research agenda.

Dr Christos (Chris) Anagnostopoulos is a Lecturer (Assistant Professor) in Distributed and Pervasive
Computing in the School of Computing Science at the University of Glasgow. His expertise is in the areas of network-centric pervasive & adaptive systems and in-network information processing in large-scale distributed computing networks. He has received funding for his research by the EC/H2020, UK EPSRC, and the industry. Dr Anagnostopoulos is coordinating (Principal Investigator) the projects: EU H2020/GNFUV and EU H2020 Marie Sklodowska-Curie (MSCA)/INNOVATE, and is a co-PI of the EU PRIMES and UK EPSRC CLDS. Dr Anagnostopoulos is an author of over 140 refereed scientific journals/conferences. He is leading the Essence: Pervasive & Distributed Intelligence within the Knowledge and Data Engineering Systems Group (IDA Research Section). He is also an associate member of the Networked Systems Research (NETLAB). Dr Anagnostopoulos before joining Glasgow was an Assistant Professor at Ionian University and Adjunct Assistant Professor at the University of Athens and University of Thessaly. He has held postdoctoral positions at University of Glasgow and University of Athens in the areas of in-network context-aware mobile computing systems. He holds a BSc, MSc, and PhD in Computing Science, University of Athens. He is an associate fellow of the HEA, member of ACM and IEEE.