Material Measurement Units for a Circular Economy: Foundations through a Review

Long-term availability of minerals and industrial materials is a necessary condition for sustainable development as they are the constituents of any manufacturing product. To enhance the efficiency of material management, we define a computer-vision-enabled material measurement system and provide a review of works relevant to its development with particular emphasis on the foundations. A network of such systems for wide-area material stock monitoring is also covered. Finally, challenges and future research directions are discussed. As the first article bridging industrial ecology and advanced computer vision, this review is intended to support both research communities towards more sustainable manufacturing.


Introduction
In contrast to the current linear economy, which can also be described as "make, take, dispose", the circular economy aims to break the link between economic growth and consumption of finite resources [1]. In a circular economy, materials are recirculated, reduced, reused, recycled and recovered, and the overall goal is "to keep products, components and materials at their highest utility and value, at all times" [1]. Key benefits of a circular economy are the contribution to environmental protection, specifically, within an EU context, in terms of carbon and biodiversity targets, which are at the core of the European Green Deal [2,3,4] and its ambition to move to a clean, circular and sustainable economy.
A circular economy could also bring benefits in terms of jobs. A recent EU study showed that a net increase in jobs and GDP of 700,000 and 0.5%, respec-tively, could be achieved by 2030 through moving to a more circular economy [5]. Such a transition would support the United Nations Sustainable Development Goals (SDGs), including SDG11 Sustainable Cities and Communities, SDG12 Responsible Production and Consumption, and SDG13 Climate Action [6]. However, despite the obvious benefits, there is still much progress to be made. The EU produced 2.6 billion tonnes of waste in 2016, while in 2017 almost one quarter of waste (24%) in the EU-28 was sent to landfill; values vary across member states, with five states landfilling over 70% of municipal waste [7]. The OECD has predicted that global materials use will be more than double the 2011 value by 2060 [8], while global annual waste generation is expected to increase by 70% by 2050 [9].
A key question is therefore: how do we move from our current linear economy to a circular economy for materials? Initiatives are needed on many fronts, but central to the transition is measurement and tracking, as we cannot manage what is not measured. Consider the water network as an analogy. Clean water can be tracked from extraction, through the treatment stages and distribution network, and finally at the end user. Wastewater can also be measured as it travels in the network to the treatment works before final sludge disposal and release of effluent. Measurement of water throughout its cycle can have a positive impact on resource efficiency, as leaks and hotspots can be identified and better management practices developed [10]. Like water, materials are a state of matter. The hypothesis arises that if we can track water and use this information to improve resource efficiency, then we can also track materials through their life cycle, from material extraction to manufacturing, retail, the consumer and finally to waste, reuse or recycling.
Materials, however, are heterogeneous, do not flow through pipes like water and so cannot be measured and tracked in the same way. New systems are therefore required, but existing research on the topic is limited. Previous reviews of automation for materials have focused on waste management, with only two such reviews found by the authors. To aid the planning and design of sustainable new systems, Hannan et al. [11] reviewed technologies in solid waste monitoring and management systems and developed classifications to enable selection of ICTs for a particular problem. Gundupalli et al. [12] reviewed automated sorting of source-separated municipal solid waste for recycling with a view to supporting system designers and identifying areas for future research. The paper concluded that there is a need for robotic systems to deal with mixed wastes in large landfill sites in developing countries. The paper also highlighted that traditional person-based managing of wastes is highly labour intensive and can be dangerous and expensive.
These previous reviews made interesting contributions, but focused only on technologies for the waste sector and did not consider the holistic life cycle of materials as is required for a circular economy, which is about more than just waste. A further gap in previous research is the lack of detailed exploration of the potential of computer vision to support the transition to a circular economy. With the overall goal of improving the management of natural resources, the aim of this paper is to provide a guide for the design of a material monitoring sensor network enabled by computer vision that could be used across the entire manufacturing life-cycle, from raw material extraction to manufacturing, retail, consumers, waste sorting, recycling and landfill.
Our contributions. The main contributions of this review are the following.
I. We define a computer-vision-enabled system to monitor stocks of materials and we provide a review of works relevant to its development with particular emphasis on the foundations. Then, we provide a guide towards the design of a material monitoring sensor network. II. This is the first article, review or otherwise, that bridges the gap between computer vision and the holistic perspective on manufacturing supply chains that is at the core of industrial ecology and circular economy. Indeed, as we will show, while previous works have proposed vision systems for waste management or material recognition, we differ from the former by looking at manufacturing materials regardless of their life-cycle stage and from the latter by putting material recognition within the context of natural resource management. III. We cover the latest advances in computer vision, from the theory to implementation aspects and related challenges. IV. Finally, we delineate five research directions which may inspire both computer engineers with environmental concerns and industrial ecologists or circular economists interested in exploiting the latest advances in computer vision.
The paper is organized as follows. Section 2 covers the background literature on computer vision, Section 3 explains the methodology used to select the references included in this review which is provided in Section 4; then Section 5 summarizes the outcome of this review and discusses the main challenges and future research directions; finally Section 6 concludes.

Background Literature on Computer Vision
Computer vision is a research area concerned with making useful decisions about real physical objects and scenes based on sensed images [13]. It is a subfield of artificial intelligence and consists of designing a signal-to-symbol converter [14]: cameras provide signals (i.e. measurements) about the physical world and the computer vision model converts them into symbolic representations, e.g. the word/symbol "cat" if the image depicts a cat.
Research in computer vision spans more than forty years [15]. State-of-theart techniques can be divided into two broad categories: hand-designed feature methods (i.e. classical machine learning) and representation learning methods (i.e. deep learning). For the foundations of the field, the reader should refer to well-established books such as [15] for an updated and general treatment, [16] for a focus on the first category of techniques and [17,18] for a focus on the second category. This section provides the general concepts of the two broad categories.  https://www.pinterest.co.uk/pin/plastic-pollution-in-the-news-and-why-we-should-care-ukonserve--278026976982870528/ The main difference between hand-crafted feature and representation learning methods is depicted in Fig. 1: the former requires an expert to design the algorithm/model capturing the characteristic features of the image of interest, the latter instead assumes that the computer learns them during a training phase; as a consequence, the former requires expert domain knowledge and less computing resources than the latter. Given that a deep learning of representations works directly between the input images and the output symbols, the second category is also referred to as the end-to-end learning approach [19].

Hand-Crafted Feature Methods
One of the simplest methods is based on bag-of-words (BoW) [15], also called bag-of-features, which identifies a set of keywords (i.e. features) for each image analogously to text documents that are described by the word content. Then, a test image is assigned to the class with the closest word/feature composition. One of the first BoW recognition systems was proposed in [20] (see also [21]): the image patches are detected using Harris affine detectors [22], which in turn are used to compute the scale invariant feature transform (SIFT) descriptors [23]; then, a histogram of visual words is used as the input vector to a machine learning classifier. Other types of detectors and descriptors are compared in [24].
Bag-of-words models are the simplest because they do not consider the geometric relationships between different parts and features [15]. While this makes them particularly efficient, higher inference accuracy is provided by part-based models which focus on the geometric relationships between the constituent parts of the object [25]. More details on part-based modeling can be found in [26].
An approach even more accurate than part-based modeling takes into account the context in which the object with its constituent parts occur [27]. Combinations of part-based and context models in the same vision system have also been proposed [28,29].
As visible in Fig. 1, the final block in the pipeline is the classifier. One of the simplest classification algorithms is the k-nearest neighbors which consists of finding the k training samples closest to the new sample and evaluating its class knowing the class of the neighbors [15]. A library with nearest neighbors algorithms for large training datasets is presented in [30].
The k-nearest neighbors is a non-parametric approach since it does not define a model of learned parameters from the training set. A simple parametric classification algorithm is multiclass logistic regression (despite the name, this method is not for regression), which learns a linear model and applies the softmax function to the model output to give the probability of having the class C i given the input feature vector x, i.e. p(C i |x) [31].
In some cases there are multiple possible surfaces that correctly divide the training samples into their classes. In these cases kernel support vector machines (SVMs) define the decision boundary as the one that maximizes the distance between the training set classes [31]. A review of kernel-based methods for computer vision can be found in [32].
Another approach consists of using decision trees having a graph structure. The key idea of this approach is to divide the complex classification task into simpler tests that are hierarchically organized [15]. For example, assume we have an image of an outdoor garden; to classify it as "outdoor garden" the problem can be split in two subsequent steps: the first answering "Is there sky at the top?" and, if true, the second answers "Is the bottom part green?" [33].

Representation Learning Methods
The most used computer vision methods belonging to this second category are based on convolutional neural networks (CNNs) [34,18]. CNNs consist of a network architecture designed to perform computations emulating multiple connected layers of neurons in a fashion similar to the neural network of a human brain. Most CNN-based computer vision systems require five components: the data, the neural model, the training algorithm, the cost function (aka "loss function") and the performance metric [17].
I. Data. Representation learning methods are algorithms that generate a nonlinear map between input and output data. The data contain the information about the task being addressed and the neural model has to be trained to fit the specific data. For example, if we need a vision system that recognizes the digits from 0 to 9, the input data are the images of the digits and the output is an integer that uniquely identifies the type/class of an input image (see [35] for a MATLAB tutorial). A vision algorithm sees each image as a matrix whose elements specify the position and the color of each pixel. If the neural model is trained properly, it will classify the images of 0-9 digits with high accuracy. This is analogous to the way we learned such a classification: when we were born, an image depicting a zero had no meaning; subsequently, at school, we have been trained to associate the picture of a digit with a quantity, e.g. zero apples, zero toys. II. Neural model. A neural model is an artificial emulator of a biological neural network. A single artificial neuron performs a simple linear combination between the input signal and the network weights and then applies a nonlinear function (called "activation function") that produces a nonlinear mapping between the input signal and the neuron output. While the model of a single artificial neuron is simple, complex models can be achieved when multiple neurons are connected in a neural network. A neural network is a stack of multiple layers and each layer is a set of neurons. A shallow neural network becomes a deep one by stacking multiple layers. The weights define the intensities of the connections between the neurons. The architecture of the neural model is designed before training and the weights are usually initialized with random values. Subsequently, the values of the weights change during training in order to adapt the neural network to the data. The training algorithm discussed below decides how to change the weights during training in order to fit the data. III. Training algorithm. The training algorithm (aka "optimizer") defines how to update the weights during the training taking into account the data along with the loss function being optimized. Hence, it optimizes the network weights for the specific task. In general, the best training algorithm is the one that provides the most optimal weight values in the shortest time. For an introduction to the different types of training algorithm that have been developed, the reader could refer to [17], Chapter 8. IV. Loss function. The loss function is a mathematical description of the goal of the neural model. Recalling the example of digit classification mentioned above, the loss function employed is cross-entropy (detailed in [31], Chapter 4) which measures the classification error committed by the model during training. Therefore, minimizing the cross-entropy function, in practice, means minimizing the classification error of the model. Since loss functions usually have multiple local minima, there is no guarantee that the optimizer will find the global optimum. V. Performance metric. Once the training is complete, the weights of the network are adapted to the data according to the loss function. The performance metric is used to evaluate the accuracy of the trained model. Recalling the example of the digits, a performance metric could be the percentage of correct classifications made by the model when processing a sequence of digit images. The performance metric changes if the task differs from classification, e.g. mean squared error for regression problems or average precision in object detection [36].
By definition, a CNN has at least one neural layer that performs the convolution operation [17]. An example of a CNN architecture is depicted in Fig.  2: training images are given as input and, once the end-to-end learning is completed, the resulting model gives the conditional probabilities p(C i |x), where C i is the i-th class (e.g. car, bicycle) and x is the input image. Therefore, the features are embedded into the model parameters defined through the optimization CAR TRUCK VAN BICYCLE Convolutional layers for feature learning Fully connected + softmax layers for classification Figure 2: Example of the architecture of a simple convolutional neural network: from the input image the features are learned with hierarchical levels of abstraction using multiple convolutional and other types of layers such as ReLU and pooling (gray blocks); those operations transfer the local information of a layer into the next (blue blocks). Finally, the softmax layer converts the features into classes processing the output of one or more fully-connected layers (last blue blocks at the end of the pipeline) [17].
of the cost function. A key reason that has made CNNs particularly successful for computer vision over other neural network architectures is that they accept a two-dimensional input and, through convolutions, perform two-dimensional operations. Hence the pixels of the input image are processed preserving their original relative position. Different CNN architectures have been proposed over the years [37]. For example, in [38] firstly several local regions of the input image are identified, secondly a large CNN learns the features of each local region and finally it classifies the content of each region using a linear SVM per class. Subsequently this CNN architecture has been sped-up in [39,36].
The basic CNN can process images of arbitrary size with the output size of the convolutional layers (also called feature maps) being influenced by the image input size. As a consequence, a trained CNN may have an architecture which is suitable for processing images of a fixed size (say 256 × 256), but may not be adaptable to process images of a different size (say 128 × 128). Therefore, [40] proposes placing a layer after the last convolutional layer in order to make it possible for the network to process different image sizes.
As discussed for Fig. 2, typically the output of the model is the marginal probability p(C i |x). The fully convolutional network in [41] gives instead such a probability pixel-to-pixel, i.e. the output is a two-dimensional matrix giving the class of each pixel. Subsequently, the region-based classification approach cited above (i.e. [38]) has been combined in [42] with a fully convolutional network.
A more complete object classification system is proposed in [43], where for a given input image depicting a scene, the model infers both the class of every detected object and a bounding block around them to locate their spatial position. Specifically, the model divides the input image into a grid and, stating the problem as a regression task, for each grid cell it predicts multiple bounding boxes, the confidence for those boxes and the object class probabilities. The authors initially define a smaller CNN network and then, once pre-trained, they convert the model to perform detection adding four convolutional layers, two fully connected layers and increasing the input resolution of the network from 224 × 224 to 448 × 448. Subsequently, this algorithm has been sped-up and made more accurate in [44].
For a more comprehensive treatment of CNNs the reader should refer to [18,17] for the basics, [37] for a review of several variants and [45] for a brief review of the most popular deep learning algorithms for computer vision (not just CNNs).

Methods
This review seeks to gather the relevant literature for researching and developing the computer-vision-based technology detailed in the next section. The references were selected by following the map of keywords/concepts shown in Fig. 3, where Computer Vision has a central position since it defines the basis. The orange and blue arrows connect the primary and secondary concepts/keywords used in this review, respectively. Computer Vision is connected to Deep Learning on the right side since we emphasize vision algorithms based on representation learning rather than on hand-crafted features. At the bottom, it is connected to "CV + food" because food is a particular type of manufacturing product, hence works focused on both computer vision and food can be adapted to focus on other manufacturing products (see next section for details). On the left side, Computer Vision is connected to "Network of multiple units" since a key idea discussed in the next section is to design a network of multiple material measurement devices sending data to a central database; single units could be implemented on trucks, bins, smartphones, sorting plants or industrial robots. The final aim of such a network is to improve mapping and quantification of materials in a target area.

Material Measurement Unit
The monitoring system is called a Material Measurement Unit (MMU) and is defined as follows. Definition 1. An MMU is a complex sensor that, through a computer vision algorithm, receives images of objects as input (e.g. RGB images, X-ray images, depth images) and provides as output information about the material composition of the object. The fundamental output measurements are (1) the class of material and (2) the mass of material. Essentially, an MMU is a converter from object images to material measurements such as the class and the mass. Figure 4 provides an overview of the MMU context and purpose: the inputs to the system are "Images" as in the green rectangle at the bottom. These are collected or generated through the "Sources" indicated by the orange arrows on the right side; the images are processed by the MMU internal "Model" (i.e. the second green rectangle); the outputs of the model are "Material measurements" having the "Types" specified by the orange arrows; the fourth green rectangle mentions a "Sensor network",  which is realized if multiple MMUs are implemented on different platforms and interconnected; the final "Purpose" of such a sensor network is monitoring the material stocks and flows for more sustainable natural resources management. For example, such a technology could provide further input data to material flow analysis (MFA) studies [46,47,48] or extend existing material flows and stocks databases such as the Yale stocks and flows database (YSTAFDB) [49]. Summarizing, the whole pipeline in Fig. 4 reads as follows from the bottom to the top: an MMU processes single images or video frames recorded through cameras to determine the type and the mass of materials depicted on the images or frames by using a computer vision algorithm; multiple MMUs could be deployed in different locations; the material locations could be provided by the GPS of the device implementing the MMU; the final goal of this distributed network of units is providing real-time material mapping and quantification data for MFA studies to improve the material flow circularity of a target urban area.
An MMU can be seen as made up of three components, each one dedicated to a specific task: I. Component 1 -material recognition II. Component 2 -object recognition III. Component 3 -volume or mass estimation.
Given that the density of a material is typically a known parameter [50], the volume estimation of a product evaluated by Component Figure 4: Overview of the MMU system and its purpose: an MMU receives images as input and seeks to measure material stocks in a desired location. Multiple MMUs could be connected as a sensor network and provide real-time material mapping and quantification data to MFA studies. The corresponding sections of the review are indicated along the left side a .
a Circular economy image sourced from: https://www.portoprotocol.com/circular-economy-as-a-way-of-increasing-efficiency-in-organizations/ of the mass through mass = density × volume. When processing an object with hidden parts such as a mobile phone, Component 1 would recognize the plastic of the case and the glass of the screen, but not the internal electronics, whereas Component 2 could recognize the specific model of a phone and read the corresponding full list of materials from a database. If instead waste plastic packaging is being processed, the complex shape of the damaged packaging makes Component 1 preferred to Component 2 because it focuses on the texture of plastic ignoring the unpredictable shape of the damaged object. The next three subsections are ordered considering the scale-up of the system: Subsection 4.2 focuses on Component 1 covering works from the computer vision literature on material recognition; Subsection 4.3 adds Components 2 and 3; finally Subsection 4.4 proposes a distributed monitoring system exploiting multiple MMUs. Details on the topics covered by each subsection are captured on the left side of Fig. 4.

Computer Vision Focusing on Materials or Waste
To date, typical applications of material recognition in robotics and manufacturing are in the enhancement of robotic grasping of objects or product visual quality assessment [51,52]. Instead, within the context of resources management for material flow circularity, an autonomous or semi-autonomous recognition of the types of materials as in Fig. 4 could improve the mapping of resources by answering the question "what types of materials are in a target area?". One of the three components of an MMU is the material recognizer, hence here we discuss previous works on computer vision for material or waste recognition.
In [53] CNNs are trained to recognize the traits of materials using weaklysupervised learning. Particularly relevant is the result that their system is able to segment the scene with masks purely based on the material appearance, which is a local attribute, hence it does not rely on the particular shape of the objects. As noted previously, this can be useful when it comes to automatically sorting trash because products thrown away have a non-standard shape caused by the damage they experience, e.g. an empty plastic bottle could be compressed to save space, a glass bottle could be broken.
Hand-crafted-feature-based material recognition is proposed in [51], then both a generative and a discriminative model are trained from the extracted features. Moreover, to prevent the overfitting of the generative one, a greedy algorithm [54] is designed to add one feature at a time as long as the recognition rate increases. A conclusive suggestion of Sharan et al. [51] is that the system accuracy could benefit from including the modeling of non-local features, e.g. object shape, correlated with the local surface appearances.
The authors of [55] focus on "waste in the wild" proposing a two-stage scene segmentation to yield a binary waste detection system, i.e. whether there is waste in the scene or not. At the first stage, the full scene is segmented; at the second stage, a zoom-in is performed around the detected waste and the zoomed image is processed for a fine segmentation of the waste shape. The two-stage approach shows an improvement when compared to single-stage segmentations using the same neural models. For example, an MMU could use the accurate segmentation of the second stage as input to a material recognition or volume estimation system to return the type of material or estimate its mass. The authors of [55] emphasize that the task they have considered is binary, i.e. waste or not waste, because of the lack of available images for several classes of waste types (aka class imbalance, an issue in computer vision covered in Subsubsection 4.5.3) that makes training a segmentation system to recognize the waste type currently unfeasible. A similar problem was experienced in [56], where to address the issue a data augmentation through trash simulation is proposed: given a set of images of objects taken from the trash, i.e. pieces of trash, a new image is generated randomly combining the initial pieces of trash (say 2-6), thus simulating images taken from the trash rather than images of single pieces. Another possible approach for data augmentation is to generate new virtual images using rendering techniques as in [57] or designing a platform with public access that, through crowdsourcing, grows over time as in [58].
Recognizing materials from images depicting daily scenes is more challenging than with images taken in a laboratory because "wild" scenarios show an extremely large variation in lighting conditions, surface textures and perspectives. Hence, the authors of [52] proposed a CNN-based vision system to recognize and segment the materials of images taken from the wild achieving a 73.1% mean class accuracy. Moreover, it is shown that a fine-tuned AlexNet achieved an accuracy about 8% higher than the hand-crafted method proposed in [59] when compared for the dataset they collected for the study. A key conclusion of that work is that having a rich, well-collected, well-segmented and labeled dataset is necessary for accurate material recognition, especially in challenging environments such as real-world scenes.
Orientation histograms such as scale-invariant feature transform (SIFT) [60] and histograms of oriented gradients (HOG) [61] are the most commonly used low-level features for object recognition. Exploiting a kernel view the authors of [62] generalize the definition of such low-level features and give insights on how to define novel variants. Successively, these kernel descriptors have been used in [63] for both material recognition and object recognition to investigate whether these two recognition tasks are related or not; in particular the study found that using the outputs of an object recognizer improves the material recognizer accuracy, whereas the material recognizer does not help object recognition.
In [64] high-level material categories are learned based on low-level and midlevel features specifically designed for material recognition. They introduce a set of mid-level features to capture the shape, the reflectance, the microtexture aspect and the color from the image. Using all these features may cause overfitting of the training set and it is not known a-priori which features are the most relevant for material recognition. Therefore, an augmented version of the latent Dirichlet allocation algorithm [65] is developed by the authors to perform a greedy feature selection. Finally the selected features are combined together to build a material recognition system.
In [66] material classification is performed comparing two modeling approaches: the Varma-Zisserman's classifier [64,67], that uses a bank of filters to process the image patches, and the so called "Joint" classifier, that directly uses the source image patches instead of the filter responses generated by filtering them. The empirical comparison suggested that the "Joint" classifier is more accurate. A comparison in terms of computational time/complexity is not considered in [66], but in the context of MMUs it is an important performance metric because small platforms such as phones or microcontrollers may impose computational limits.
In [68] a CNN extracts the features by processing images of objects acquired by a camera, while another sensor measures the object weight and a third one determines whether the object is made of metal or not. Then, a neural network reads the CNN image features and other engineered features to decide if the waste item is recyclable. The dataset collected for this study is available upon request to the corresponding author. The authors of [69] based their waste recognition system on CNNs and also provided an extensive performance comparison of different architectures and training strategies (e.g. pre-training, transfer learning) both in terms of accuracy and computational time. They finally proposed a reduced-complexity CNN optimized to classify some waste items. A CNN optimized for waste classification is also proposed in [70], while the authors of [71] used deep transfer learning.
The large production of waste electrical and electronic equipment (WEEE) with only about 20% recycling rates and the environmental problems caused by a destroy-then-melt approach for WEEE material recovery have motivated Jahanian et al. [72] to develop a vision system for autonomous e-waste disas-sembling tasks. In particular, they considered the disassembling of smartphone circuit boards and they built their neural model using the Mask R-CNN [73] as the baseline. From a computer vision point of view, the problem is interesting because it requires the recognition of components with different geometries and sizes within a small space, e.g. the battery and the screws, while from a circular economy perspective e-waste is of primary concern as it contains significant fractions of critical raw materials [74].
Motivated by the negative impact of the contamination of waste items on recycling rates, Ibrahim et al. [75] focused on detecting the contaminating material found in municipal solid waste. This work was a collaboration between a waste management company, an automation company and a university which led to the development of a CNN-based vision system trained with over 30,000 images labeled by the waste management company staff experts. Manually labeling a large number of samples is a tedious process, hence, in some cases, only unlabeled images are available and, therefore, supervised methods such as CNN are not suitable. In this case, the model proposed in [76], which proposes a similarity measure for material appearance, could be applied to cluster the unlabeled samples. Subsection take-aways: several works on material recognition have been proposed seeking to improve robotic grasping; both hand-crafted-feature-based and CNN-based methods were proposed; waste detection in a laboratory setting is easier than"in the wild"; object recognition helps material recognition.

Computer Vision Focusing on Food
Food computing is the research area seeking to make machines able to process food images and extract information such as the type of food in the image, whether there is food or not in the image, how much food is in the image, what is its recipe and how many calories it contains [77]. Such information helps the machine user (e.g. the user of a mobile phone) to monitor his/her diet and modify the diet for the benefit of his/her health [77]. We observe that food is an object, i.e. a manufactured product, made of organic materials, hence research questions and challenges addressed in food computing are closely related to material computing. As we show in this section, the reader can get useful insights from food computing for developing MMUs by simply looking at food as an object, at the recipe as the object material composition, and at the food portion volume estimation as the object component volume estimation. The final goal here remains as depicted in Fig. 4: automation or semi-automation of resources mapping and quantification; a network of MMUs could provide input data of MFA studies to design circular supply chains [46,47,48] or extend existing material flows and stocks databases such as the YSTAFDB [49].

From food to material recognition
For example, the food recognition approach in [78] uses SIFT descriptors [60] to compute the most likely food types appearing in the frames of a video recording a volunteer eating in a restaurant. Similarly, considering that unused or faulty objects accumulated in private houses might be a valuable source of materials, an MMU could process a video of these objects provided by the household to estimate the type and mass of each detected material.
In [79] a SIFT-based bag-of-features model [80] followed by an SVM classifier is designed after an extensive investigation considering a dataset of 5000 food images organized in 11 classes. The final classification accuracy, of the order of 78%, could be similarly achieved optimizing the model to process images taken from trash, which are complex to classify because the objects frequently have a different shape once thrown away (e.g. a bottle deformed to save space, a package that has been damaged to extract the contents).
The authors of [81] use descriptors based on the relative geometric position of the ingredients exploiting the fact that a type of food has ingredients arranged in predictable spatial configurations, e.g. a sandwich has ingredients distributed linearly over multiple layers, a plate of salad has ingredients distributed horizontally all over the plate. A similar modeling approach could effectively exploit the predictable relative position of the components of a manufacturing product, e.g. an electrical machine is composed of a rotor inside a stator, a phone is externally composed of a screen on top of a case, books are multiple layers of sheets.
In [82] the focus was primarily on computationally simple detection models because it was intended to be implemented on mobile phones for a real-time food recognition application. The authors describe how the application works and its performance considering two implementations: the first one uses a bag-offeatures model with SURF features [83], the second uses a Fischer vector model [84] with HOG [61]; both use the extracted image features as input for a linear SVM classifier. Similarly, a mobile phone application for material measurement could be used to process images or videos of unused and faulty products accumulated in the user's house; then, for example, products made of critical raw materials [85,86,87] could be collected in agreement with the householder for their recycling/re-manufacturing.
Based on the observation that food items often have ingredients distributed in slices (i.e. layers), the authors of [88] propose using "slice" convolutional kernels in a CNN [89] to improve the model classification accuracy. The authors also point out that such an accurate model to date requires memory and computational costs too high to be implemented on devices with limited resources (e.g. on mobile phones as in [82]). The idea of adapting the neural layers to the type of target items can be seen as a method to embed a-priori knowledge in the layers architecture; an alternative approach could be adding a-priori knowledge through fine-grained classes able to catch minor differences between target items (e.g. ravioli vs. dumplings, mobile phone of brand A vs. brand B) by formulating a multi-task loss function as proposed in [90].
Transfer learning [91] has been shown to improve the food classification accuracy of neural models compared to models trained from scratch [92,93]. Similarly, the features learned by a network trained on datasets with images of different types of objects could be fine-tuned (either the whole network or just part of it) on smaller datasets of images of products whose materials are of  Figure 5: An example of the system proposed in [97] adapted to measuring material compositions instead of food calories. particular interest, e.g. critical raw materials [85,86,87]. To exploit transfer learning, over the years, large neural models have been developed, trained on large datasets, made publicly available and ready to use (see Subsubsection 4.5.2 for details).
Data augmentation [94] could be applied to increase the number of images available for training. An example application for food recognition is proposed in [95], where new training images are generated rotating, translating and rescaling the original ones. Similarly, as in [96], data augmentation could be used with images of trash pieces to accurately classify their material composition (e.g. paper, plastic).

From food portion to material volume estimation
Given the similarity between processing food and non-food items, below we discuss relevant papers on food calories or food volume estimation.
In [97] the food volume is estimated requiring a single RGB image using the CNN architecture of [98] to work as a virtual depth camera; then, the depth map is converted into a voxel representation; semantic segmentation [99] is performed to identify the pixels corresponding to food and finally, in combination with the voxel representation, the volume of food is estimated. This approach, as pointed out by the developers, simply requires a single image collected "from the wild", hence it is particularly flexible; however, the whole system is quite complex as it combines multiple tasks, each one with its own complexity and with the accuracy/robustness of the whole system depending on the accuracy/robustness of all its components: open-world recognition (i.e. detecting food items from a generic scene), depth measurements from a single RGB image, 3d voxel representation from a 2d image, and scene segmentation. As pointed out by the authors, their system requires further development. Our interest in their system is its application to measure materials. An example of how the system could be adapted to our case is illustrated in Fig. 5.
In [100] a generative adversarial network [101] is used to map the input image depicting a food scene into the corresponding, pixel-by-pixel, energy content. The result is that their model reads the RGB food image and returns as output the energy (i.e. calories) at each pixel, e.g. a pixel without food has zero calories, a pixel belonging to broccoli corresponds to an energy weight lower than a pixel belonging to meat; hence, the energy weight can be seen as an energy density in J/m 3 . In common with [97], this approach requires a single input image; a difference is in the fact that here the real size of the food is reconstructed using a marker of known size included in the food image rather than using a neural model as a virtual depth camera. Inspired by this work, an MMU could be implemented training a generative model to map the input RGB image of an object to the corresponding pixel-by-pixel density of material in kg/m 3 ; then, the summation of all the weight-densities overlapping each mask of the semantically segmented image gives an estimate of the mass of each labeled material. An approach based on stereo vision is proposed in [102], hence it requires two images of the food item taken from two different views; after a test on the best compromise between accuracy and efficiency, SURF is chosen for feature extraction; the system requires a reference object placed next to the food to reconstruct the real sizes, and the food should be placed inside an elliptical, flat plate, i.e. bowls are not permitted. Compared to [97], this work has the advantage of being implementable on a computer with limited performance (e.g. a mobile phone). The simplest version implemented in [97] requires the user to select the best labels and exploits the knowledge of the menu of the restaurant the dish belongs to, whereas their flexible and highly automated more complex version is, according to the authors, at a preliminary stage. Note that [102] focuses on food portion volume estimation without evaluating the calories/energy content. Its application could be adapted to estimate the volume of the components of a manufacturing product and then, knowing the density of detected materials, converted into masses. Such an MMU could run in a mobile phone. Bearing in mind the analogy between volume estimation of food and the volume estimation of manufacturing products to approximate their mass, the interested reader should refer to [77] for further reading on food volume or calories estimation.

From food recipes to material composition of products
An output that MMUs should provide is the list of materials making a target product. In general, we see two approaches to making the system capable of providing such information: (1) embedding it inside the mathematical model during the supervised training phase so that the model learns it (e.g. segmenting the scene labeling the materials as in [52]); (2) providing the MMU with access to a list of materials stored in a database (e.g. as in [97], where the recipe of a detected food item is retrieved from the restaurant menu). Both approaches require the MMU developer to know the material composition of the product of interest. Hence, below we list five methods to collect information about the recipes of manufacturing products, some of which are used in food computing [77] to collect food recipes: • cookbook-like, i.e. looking at books describing the manufacturing process and material composition of the target product such as the ones in Table  1; • websites, e.g. the manufacturer website providing the technical specifications of its products, websites collecting material compositions similar to the ones created for recipes (e.g. Allrecipes, RecipeSource); • research papers reporting the material composition of specific products such as the ones in Table 2; • performing a chemical composition analysis of the product of interest; • documentaries such as "How It's Made" [103].
Considering that the recipe of a food item is a list reporting the types and masses of materials/ingredients that compose the desired food type, research in food ingredient recognition could be useful. For example, [130] is one of the first works focusing on ingredient recognition rather than food category recognition, which was the main research trend at that time. The authors proposed a multitask learning technique to recognize, at the same time, both the food type and its ingredients because the features learned to solve one task improve the recognition accuracy in the other task and vice versa and therefore it improves the robustness of the whole system. The authors also pointed out and investigated two key design issues which are valid for MMUs as well: first, "given a deep neural network architecture to be trained in a multi-task fashion, to what extent should the two tasks share network layers?"; and second, "should the tasks be solved in series, i.e. the output scores of a task are the input to the other one, or in parallel?". To answer these questions, four different neural architectures were derived by modifying the VGG 16-layer network [131]. Along with the multi-task learning, a region-wise ingredient recognition approach is proposed in [132] which consists of dividing the input image into many small regions and performing ingredient recognition on each single region. Hence, with region-wise learning, the system also localizes the ingredient spatial distribution over the image along with the list of ingredients. Further works on ingredient recognition can be found in [77]. Subsection take-aways: computer vision on food recognition, food portion estimation and ingredient recognition can be adapted to focus on non-food products; since mass = density × volume, volume estimation is related to mass estimation; both hand-crafted-feature-based and CNN-based methods have been proposed in food computing; a challenge in volume estimation is the accurate extraction of 3d information from a single 2d image; while stereo vision requires two input images, a main challenge is requiring just a single image without reducing accuracy.

Towards a Sensor Network for Material Stock Monitoring
Definition 2. A manufacturing network is a set of locations/buildings connected by the exchange of material, e.g. raw material reservoirs, manufacturers, shops, houses, waste sorting centers, recycling centers, landfills. An analogy between water networks and manufacturing networks can be seen considering both an intuitive and a physical explanation. The intuitive explanation is that both networks are made of nodes (i.e. compartments) that exchange materials (e.g. water, aluminum, plastic) over time and space; the physical explanation is based on the force-voltage physical analogy, also known as Maxwell's analogy [133], as detailed in Table 3. Note that the two networks have a different effort variable because a fluid flows within a water network, whereas materials (including products, which are combinations of materials) "flow" within a manufacturing network. Another difference is that a hydraulic network confines the fluid in pipes, whereas a manufacturing network moves the materials using transport systems, e.g. trucks, airplanes, ships.
The physical analogy between water and manufacturing networks suggests that MMUs could be used as a sensor network to monitor the flow of materials as the system proposed in [134] for water networks. Multiple platforms equipped with MMUs could form a sensor network for material stock monitoring. An example of the system is shown in Fig. 6 considering the design of an MMU sensor network for a city using Belfast as an example. Below we list four types of platform upon which an MMU could be implemented to realize an MMU sensor network.

MMU in trucks and bins
In [135,136,137] a computer vision system is proposed to process internal images of bins. In particular, in these works: first, the camera is installed on the truck (i.e. a smart truck) and used by a worker when the truck approaches the bin; second, the image processing returns information only about the waste level, not the waste type and mass as an MMU would do. A smart truck is also proposed in [138,139] for urban street waste mapping and cleanliness assessment (refer to [11] for a review of the information and communication technologies used for waste management).
The use of multiple sensors for smart bins has been considered over the years and the interested reader is referred to the review paper by Soni et al. [140], which organizes and details the state-of-the-art in 2017. This review also proposes a framework for a waste management system based on smart bins leveraging the cloud and wireless communication to create a network of smart bins. Therefore, two aspects are particularly relevant in the context of our paper: first, Soni et al.'s network system based on smart bins is a sub-network of the MMU data collector airport port recycling/sorting center retailer manufacturer houses Figure 6: Example of the MMU sensor network introduced in Fig. 4 that could be implemented in the city of Belfast, Northern Ireland: material measurements from multiple units are collected in a central database referred to as "MMU data collector" in this figure, which can extend current material stocks and flows databases such as the YSTAFDB [49] for MFA studies on the implementation of circular material flows such as [46,47,48]; a GPS could provide the location of the measured materials at different points as introduced in Fig. 5.
system we propose here for material stock monitoring; and second, integrating into smart bins the computer-vision-enabled material measurements provided by MMUs could improve the accuracy of a smart bin network.

MMU in autonomous sorting systems
A vision-based sorting system processes the types of material for which it has been trained to detect and using a larger and richer training set, in general, improves the adaptability of the system to work with other materials. In fact, if a new class of material needs to be detected, samples belonging to that class must be learned by the autonomous sorter. In contrast, techniques based on mechanical and magnetic principles can sort only a specific type of material according to its physical/chemical properties (e.g. magnetic drums, air separators, triboelectrostatic separators) [12]. Examples of works integrating computer vision in autonomous sorting systems have been proposed in [141] for demolition waste, in [142] for electronic waste, in [143] for pomegranate arils, in [144] for plastic granulate, in [145] for municipal solid waste, in [146] for rock mixture composition and in [147] in a patent application. The low use of advanced computer vision based on deep learning suggests that a significant improvement of autonomous sorting systems is possible.

MMU in mobile phones
As far as we know, the only work proposing a mobile phone application for waste detection is [148]: the pre-trained model AlexNet [149] is fine-tuned on a garbage-focused dataset using a GPU and then simplified to be implemented in smartphones; after an analysis of the image processing time required by the dif-ferent components of the system, the model with the best compromise between classification accuracy, image processing time and memory size is chosen.
As discussed in Subsection 4.3, computer vision systems proposed for food classification and calories estimation are related to material classification and mass estimation, respectively. In [97] the system prototype runs on mobile phones and requires the user to provide a single image, whereas [150,102] need two images. An interactive application is proposed in [82], whereas the authors of [151] delegate the most complex tasks such as food recognition to a cloud server instead of to the phone CPU.
Implementing MMUs on mobile phones is particularly challenging because the computationally expensive task of estimating the mass from images is necessary, whereas it may be avoided in automated sorting facilities or bins through the use of weight scales; moreover, mobile phones have limited computing performance compared to the hardware that can be used in sorting facilities, e.g. GPUs. On the other hand, a mobile phone has the advantage of being portable and cheap. Potentially any owner of a mobile phone could collect material measurements through an MMU mobile application or send images/videos to a central server running an MMU. Subsection take-aways: MMUs can be implemented on different platforms to map and quantify the material stocks in a target area; the GPS provides the material location; to date vision systems are widely used for robotic applications (e.g. inspection, sorting) and not for resources mapping; implementing vision systems on a mobile phone poses constraints on the computing performance, but has the advantages of portability and affordability.

Getting started
The first step to using or developing deep learning algorithms is selecting an appropriate programming language. Currently there are two main choices: MATLAB and Python. MATLAB has a toolbox dedicated to deep learning named Deep Learning Toolbox [152] which provides two development environments: a script-based one and the Deep Network Designer. The former is pure coding, therefore is more flexible, but less user-friendly. In contrast, the latter is particularly easy to work with as it has an intuitive user interface. The use of the Deep Learning Toolbox requires a license. Python libraries, in contrast, are open source. The main Python libraries for deep learning are PyTorch [153] and TensorFlow/Keras [154]. Python users can run on a free GPU service through Google Colaboratory (aka "Colab"). Usually, the latest algorithms are developed in Python and subsequently integrated by MathWorks into its toolbox; the advantage of using the latest algorithm is usually at the cost of a less documented and less user-friendly code.

Public datasets, pre-trained models and transfer learning
In general, the model of an MMU can be defined in two ways: using a pretrained model or training a model from scratch. Usually the model reaches a good accuracy if it is at least fine-tuned with images depicting the domain of application. Hence, in Table 4 we list the source paper and the download website of several publicly available datasets containing images of manufacturing products or waste items that could be useful to train/fine-tune MMU models. If the choice is to use pre-trained models rather than training from scratch, the links to pre-trained models available in some machine learning libraries are: MATLAB [165], PyTorch [166] and TensorFlow [167]. Guidance on how to exploit a pre-trained model can be found in the area of research known as transfer learning [91], which typically deals with transferring into a new model the machine knowledge/experience contained in another model previously trained on a dataset different from the one of interest.
Transfer learning. In transfer learning, the task to be solved with the new model (e.g. recognizing materials) is referred to as target task/domain, whereas the task solved with the previously trained model and from which the knowledge has to be transferred is referred to as source task/domain (e.g. recognizing common objects). There are two typical situations in which transfer learning is particularly useful. One situation is when the developer of the vision system has no or very few images of the target domain, which occurs when the task is very specific (e.g. recognizing laptops of a particular brand); hence, training a neural model from scratch with just those samples yields an inaccurate model. The other situation is when one needs a large neural model to solve a complex target task, but due to time constraints its training is not practically feasible [91]. In general, the following key considerations should be kept in mind about transfer learning: • the best accuracy possible is achievable by training from scratch a model using many samples of the target domain, which means not performing any knowledge transfer; • the knowledge transfer effectiveness is higher when the target and the source domains are strongly related and decreases with the reduction of the source-target similarity (e.g. images of cats are more similar to images of dogs rather than images of plastic bags); • when transferring features from the layers of a pre-trained CNN, the features from the first layers contain general visual traits and therefore are more transferable to different domains, whereas the deepest layers are more optimized for the source task [168]; • transferring knowledge from sources unrelated to the target task may cause the negative transfer effect [169].

Lack of training images and class imbalance
When designing a vision system, one may experience a lack of images or the class imbalance problem, which consists of having some classes with a large number of samples and other classes with few samples. Both the lack of training samples and class imbalance result in inaccurate models [94]. Some transfer  [162] Material recognition N/A GINI 12 [148] Object recognition Yes CUReT 13 [163] Material recognition N/A ImageNet 14 [164] Material/object recognition N/A DTD 15 [59] Texture recognition No recybot 16 [72] Object recognition Yes learning techniques have been proposed to create a larger training dataset exploiting different source domains [91] and these techniques could also be adapted to address class imbalances considering only specific target classes. Data augmentation. Data augmentation techniques are used in computer vision specifically to extend the training dataset size [94]. The key idea behind data augmentation is to generate new images that are variants of the original training samples by adding to them particular effects such as rotations, translations, scale variations, lighting variations, occlusions. Thus, even though the initial set of images depicts only a few scenes, the set is enriched by adding properly altered samples. Along with traditional hand-crafted augmentation techniques that generate the aforementioned types of effects, recent advances in deep learning led to the development of further methods. In particular, the ones based on generative adversarial networks (GANs) were shown to be particularly effective for their computational efficiency and quality of results [94] (see [170] for a recent review on GANs for computer vision). Since GAN-generated samples are not as predictable as the ones generated using traditional methods (e.g. rotations), an effective way to extend the size of the training set is to use both traditional and non-traditional approaches and then collect together their outputs. To assess the quality of the synthetic samples, a visual Turing test was performed in [171] with medical images by asking two expert radiologists to distinguish between original and altered samples. The two radiologists labeled the GAN-generated images with 62.5% and 58.6% accuracy, respectively. Augmenting the training dataset requires that extra storage space is available and this aspect is particularly critical when very large datasets are enlarged (e.g. millions of images). Alternatively, one may implement an on-line data augmentation, which generates the new samples during the model training. While this second strategy can save memory, it results in a longer training time [94].

Computing platforms
Deep learning techniques are permitting the design of machines more accurate than ever for solving tasks such as classification of images or regression, which are tasks relevant for MMUs. However, deep learning machines are also more computationally demanding than traditional machine learning. Since high performance computers are not portable, an active area of research is investigating the most efficient strategies to exploit large neural models on portable, low performance devices such as sensored microcontrollers and smartphones [172]. The two opposite implementation paradigms are cloud computing and edge computing (optimal solutions may be a compromise between them depending on the specific situation). Cloud computing consists of performing the computation on large and high performance computers that exchange data with smaller devices through the Internet. To have an idea of the size of such high performance machines, a Microsoft data center is 11.5 times the size of a football field [173]. With this approach, the portable device mainly transfers the measurements from the acquisition site to the cloud and reads the reply once the computation is completed. In contrast, edge computing seeks to perform all the required computations where the measurements are acquired, i.e. on the portable device. If the choice is to implement the MMU using cloud computing, popular platforms providing cloud services are: Amazon Web Services (AWS) [174], Microsoft Azure [175], Google Cloud [176] and IBM Cloud Services [177]. While cloud computing has the advantage of providing the maximum computing resources and, therefore, speed of execution, it has three main disadvantages which motivate the decentralization of the workload from cloud servers to edge devices: latency, scalability and privacy.
I. Latency: moving the data from the edge device to the cloud is a time consuming activity which may compromise the system performance in real-time applications; for example, sending and processing a camera frame for a computer vision task could take more than 200 ms end-to-end on Amazon Web Services [178]. II. Scalability: cloud platforms provide many servers to satisfy the computing demand of their users, but a large number of connected devices may overload the network and the data center resulting in issues such as communication delays. III. Privacy: the exchange of data between different devices could compromise privacy if sensitive information is improperly used or intercepted.
If the choice is to develop MMU systems minimizing the use of the cloud, the following hardware and software are particularly suitable.
• TensorFlow Lite is an open source machine learning library developed by Google optimized for smartphones and microcontrollers.
• PyTorch Mobile is another open source machine learning library for edge computing and is developed by Facebook.
• The Nvidia Jetson production line are development kits that integrate GPUs on a microcontroller and have primarily been designed to speed up neural network algorithms on portable devices.
• Two microcontrollers known for their ease of use are Raspberry Pi and Arduino; in particular, the development team of the latter is currently working to integrate machine learning functionalities and the Arduino Nano 33 BLE Sense is the recommended board to get started.  Table  4; the links to public pre-trained models in MATLAB, PyTorch and Tensor-Flow/Keras are provided; a GPU speeds-up CNN-based algorithms compared to a CPU (both training and inference); for Python users, Google Colaboratory may be a valid option to access a ready-to-use GPU; data augmentation techniques can extend the training dataset size; the Nvidia Jetson production line provides GPUs on microcontrollers.

Challenges
The design of a material stock monitoring system that accurately and in real-time gives the types, masses and positions of materials in a target area involves multiple challenges. In general, our recommendation is to begin working on single devices and then connect multiple units only when the material measurements are of satisfactory accuracy. This size-increasing rationale is indeed followed in writing Subsections 4.2, 4.3 and 4.4 with that order: from material recognition, which is Component 1 of an MMU unit, we then add the object recognition, the volume estimation and, at the very end, the idea of networked units that send material stock information to a central data collector. A twotask learning process for both object and material recognition can be guided by previous works on food computing [130,132], whereas how to design a threetask learning process that involves also mass estimation is an open question. A first prototype of such a multi-component vision system is proposed in [97] for food computing, but, as pointed out by the authors, there are many challenges for computer vision researchers especially when it comes to implementing the system on small devices such as smartphones. Some examples of these challenges are recognition of the material considering small regions of the image (i.e. fine-grained recognition), recognition considering images taken in the open world rather than in a laboratory setting, depth estimation from a single RGB image and minimization of latency for real-time applications. The challenges for computing accurate material measurements are in computer vision, which is currently a very active and advancing research topic as it is one of the main areas of application of deep learning. At the same time, edge computing research [172] should proceed to permit the implementation of high performance vision systems on small devices. In this way, an MMU could easily run on a smart bin, a smart truck and a smartphone. Once the vision algorithm of single units is able to provide reliable material measurements, the main challenges will be in the network communication system to transfer the measurements in real-time from different nodes.
Along with the size-increasing rationale described above, another way to begin is by prioritizing critical materials, where the precise definition of "critical" may depend on the region or country of interest. For example, municipal electronic waste or electronic devices kept in private houses and no longer in use (e.g. due to being faulty or out of fashion), are a valuable source of rare Earth elements. Hence, first prototypes of MMU may focus on electrical and electronic items to improve the management of these critical non-renewable resources. Whether it is for electrical and electronic items or other materials such as waste plastics, the use of networked MMUs could make an important contribution in the transition to a circular economy, a barrier to which is ineffective collection and sorting of wastes [46]. Previous research by the authors has highlighted concerns over the ability of the public to properly sort waste streams [179], as well as the need for better flow of data between manufacturers and waste collectors to reduce costs [46]. Networked MMUs could improve understanding of both waste sorting and overall material flows, which if implemented as part of a suite of measures, could support the changes needed for a circular economy.
Importance of benchmarking. Regardless of the preferred research direction, benchmarking the resulting models in a standardized way permits their performance to be effectively assessed; benchmarking is standard practice among developers of classification systems, which makes it possible to rank models based on chosen metrics [180,181]. To advance the development of MMUs, examples of benchmarking metrics are: • the model accuracy in waste item classification • the model accuracy in material classification • the model accuracy in volume or mass estimation • the model computational complexity (e.g. seconds needed to process an image) • the model memory storage requirements (e.g. its size in MB).

Future Research and Development
Five main future research and development paths are identified and summarized below: the first path essentially consists of implementing the most advanced recognition systems on platforms such as bins, mobile phones or sorting centers; the second, third and fourth paths are concerned with improving the three recognition systems already implemented on specific platforms; the last path consists of adapting systems from one platform to another (e.g. from mobile phones to smart trucks). Note that while developing an MMU mobile app is of practical use, doing so also addresses several fundamental challenges in computer vision research as pointed out by Myers  I. The systems developed in [53] and [51] for material recognition are based on CNN and hand-crafted features, respectively, and are not implemented on specific platforms. Therefore, their implementation on mobile phones, smart bins and sorting centers could be investigated. Successively, the material recognizer could be combined with an object classifier to improve the system accuracy as done in [63] or in [130,132] with food items leveraging the analogy between ingredients (i.e. ingredient recognition) and material composition of non-food items (i.e. material recognition). Two preliminary questions arise with respect to the data-intensive CNN-based approach: "Do I train the network from scratch?" and "Are the publicly available datasets sufficiently rich for the target application?". If the answer to the second question is negative, a valuable contribution to the field would be the development and publication of a new dataset for this purpose as done in [72]. Alternatively, the lack of training images could be mitigated by collaborating with a company that has collected them as done in [75]. II. The mobile phone application of [97] for food calories estimation is, according to the authors, at a preliminary stage. Their system is highly automated, but complex as it involves RGB map estimation, 3d voxel representation, open-world recognition and scene segmentation. Transferring the target application from food calories to material stock monitoring will result in a promising MMU. III. While the system mentioned in the previous point is based on CNNs, the approach in [102] is based on hand-crafted features, therefore less computationally demanding. In general, [97] could be seen as a more challenging research path to be ready for deployment later than the approach of [102]; however, the latter appears less promising in terms of both accuracy and flexibility. IV. The mobile phone application of [148] could be improved, for example, by using a more advanced neural architecture with a similar computational complexity. Moreover, a high performance central server could communicate with the phone performing the most demanding tasks, i.e. exploiting cloud computing instead of edge computing. V. The systems mentioned in the previous three points consider mobile phones.
Their implementation in smart trucks could be investigated.
To help the reader with developing the five lines of research or identifying different ones, Table 5 summarizes selected works covered in this review. The selected works (32 in total) are also organized in the contingency matrix of Fig.  7 which has the conference/journal of the paper specified along the x-axis and the keywords extracted from the paper abstracts along the y-axis. The matrix shows that "convolutional neural network" and "material recognition" have the highest frequencies (9 and 10, respectively); the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) has the highest frequency among the x-axis terms. Note that in the field of Computer Vision and Machine Learning, conferences such as CVPR or ICCV have higher or equivalent quality outputs compared to journals, e.g. IEEE Transactions on Pattern Analysis and Machine Intelligence or IEEE Transactions on Image Processing (with frequencies 2 and 1, respectively). This is the opposite of other research fields, in which the journals have higher impact than conferences. The keyword "trash classification", which has a frequency of 4, has the highest number of mentions in the selected papers published in IEEE Access (2 papers in total). The term "optimized densenet121" has the highest number of mentions in Resources, Conservation and Recycling, which is a journal more focused on the application of Computer Vision; indeed, DenseNet-121 was previously proposed at CVPR with no regard for the specific domain of application [183]. Finally, it should be noted that the majority of the journals/conferences of the 32 selected works are more oriented to Computer Vision research than to Circular Economy. This is because, as in the title, the review emphasizes the foundational aspects of MMUs. However, it does so by covering these foundational aspects within the context of the target area of application, that is, Circular Economy. The main authors' goal is to make a manuscript accessible and of interest for both research communities. Figure 8 shows the approaches used in the selected works as specified in the second column of Table 5 as a function of the year of publication. The diameter of a circle is proportional to the number of articles with the same approach-year pair. It is visible that hand-crafted features are the dominant approach until 2014, whereas the two approaches overlap in 2015 and 2016; after 2016, the dominant approach is feature learning, which is mainly underpinned by deep learning.

Conclusions
To improve the management of natural resources, this paper first defined a computer-vision-based complex sensor and then reviewed the works related to its technological foundations. A network of such sensors for wide-area material stock monitoring was also discussed. Accurate estimation of the mass of materials by processing real-world images is the most challenging task, especially if the vision system is implemented on edge devices with limited computing performance such as microcontrollers and smartphones. Five future research directions have been outlined. We hope to reach the interest of computer vision researchers and engineers concerned about the human ecological footprint and the interest of industrial ecologists, circular economists and environmental researchers and engineers looking at new ways to use the latest advances in computer vision.  Table 5. Journal/conference names are on the x-axis and keywords extracted from the paper abstracts are on the y-axis. The numbers along the opposite axes indicate the frequency of each term. The rectangle sizes result from the frequency values while the rectangle colors define the correlation between each x-y pair, e.g. a dark red rectangle means that the abstracts of the papers published at the journal/conference along the x-axis often mention the keywords along the y-axis ("often" is with respect to the average value). In contrast, a dark blue rectangle means rare mentions with respect to the average a . Year Hand-crafted Learned  Table 5. The diameters of the circles are proportional to the number of articles with the same approach-year pair.