Computer Vision for Inventory Management

...................................................................................................................... iii APPROVAL FOR SCHOLARLY DISSEMINATION .................................................... iv DEDICATION .................................................................................................................... v LIST OF FIGURES ........................................................................................................... ix LIST OF TABLES .............................................................................................................. x ACKNOWLEDGMENTS ................................................................................................. xi CHAPTER


Research Need
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction [1]. It suffers from several setbacks. The biggest of these problems being a need for large amounts of data. It is not always possible to collect and label large datasets for deep learning applications in industry. Training a model on this data also requires significant processing power and takes a lot of time. Because of this, there is a pressing need to reduce the amount of training data, time, and computational resources required to train and run computer vision models. While various machine learning techniques have helped to ease this problem, combining a number of these techniques could prove to be key to making computer vision more accessible in many industry applications.
As computer vision becomes more widely accepted in industry, it is important for models to leverage available techniques to ensure that they are usable on smaller, less powerful devices. One important example of this can be found in inventory management systems that may seek to leverage IoT devices to keep track of various types of stock.
Such a system may involve the use of object classification to verify that inventory is properly accounted for using images or video streaming. There are many techniques that can be leveraged, such as transfer learning, to reduce training requirements and increase accuracy. It may also be possible to intelligently grow a dataset over time, instead of collecting all the necessary data before training a network. These techniques could provide a more immediate path for making computer vision accessible to companies for use on low resource devices.

Thesis Objectives
This project seeks to explore the possibility of implementing a computer vision model that can be used in an inventory management system and to address the various needs that such a system would require. This can be done by leveraging modern transfer learning techniques for reduced training requirements and improved accuracy. A method for intelligently growing a dataset over the lifetime of a computer vision system could also reduce the initial data requirements while further increasing the accuracy of a network. I postulate that when a computer vision model is implemented using a combination of modern machine learning techniques and continually trained on a dataset grown using the proposed method, the network can be effectively deployed on IoT devices for use in the inventory management space. This solution will be compared to existing solutions.

Machine Learning
Machine learning is a subset of artificial intelligence intending to provide machines with the ability to learn from data and perform specific tasks without being given explicit instructions. Machine learning systems are designed to learn by example.

Unsupervised Machine Learning
There are two major types of machine learning. The first is unsupervised learning.
Unsupervised machine learning is a type of machine learning in which the system is not given labeled data or any sort of manual supervision. It can be used to find previously unknown patterns in such unlabeled data. Unsupervised learning can consist of techniques including clustering, anomaly detection, certain kinds of neural networks [2], and approaches for learning latent variable models [3]. To attempt to prove the proposed hypothesis in this thesis, we will be focusing on the other form of machine learning.

Supervised Machine Learning
Supervised learning is a method that involves the system learning to map inputs and outputs based on existing input and output pairs. Supervised learning consists of techniques including support vector machines, linear regression, decision trees, and neural networks [4]. There are several issues to consider when using supervised learning [5]. These issues are bias-variance tradeoff, functional complexity, dimensionality of the input space, and noise in the output values. Bias-variance tradeoff refers to the tradeoff between the bias and variance of a learning algorithm [6]. Bias refers to the error from erroneous assumptions in the learning algorithm. High bias can result in underfitting.
Variance refers to the error from sensitivity to small fluctuations in the training set. High variance can result in overfitting [7]. The prediction error of a learned classifier is directly related to the sum of the bias and variance of a learning algorithm. A learning algorithm with low bias needs to be able to fit data well, but not be so flexible that it fits different training sets differently and has too high a variance. The issue of functional complexity refers to the amount of training data that is available relative to the complexity of the function. If the function is simple, then a learning algorithm with high bias and low complexity will be able to learn from a small amount of data. If the function has high complexity, then it will only be able to learn from a learning algorithm with low bias and high variance using very large amounts of training data [8]. The problem of dimensionality of the input space arises when the input feature vectors of a model have very high dimensions. When this occurs, the learning problem may be difficult even when only a few of these features are relevant. The many additional dimensions can confuse the learning algorithm and cause high variance [9]. High dimensionality often requires the classifier to be tuned to have low variance and high bias. In addition to manually removing dimensions, there are dimensionality reduction techniques, including feature selection, that can be used to alleviate this problem.

Computer Vision
The field of computer vision originated in the late 1960s as a subset of artificial intelligence. The goal was for scientists to mimic the way that humans see things. It was believed that processing data from images would allow machines to gain a high-level understanding of what the images contained. As the field grew throughout the 1980s, researchers focused primarily on mathematical models and techniques such as edge and contour detection in order to analyze images [10]. It was not until the late 2000s that computer vision saw the shift towards machine learning that is so popular today. This type of computer vision is a subset of machine learning in which computers attain some high-level understanding from images and videos using statistical or neural networkbased models that are trained on some set of labeled or unlabeled data [11]. Computer vision tasks can range from industrial applications such as inspecting items on an assembly line to research tasks such as teaching machines to comprehend the world around them.

Convolutional Neural Network (CNN)
In recent years, the CNN has become one of the most popular models for computer vision. CNNs are a form of multilayer perceptron that uses the mathematical operation known as a convolution instead of general matrix multiplication in at least one of their layers [12]. The neurons of a CNN are arranged in three dimensions. The difference between a CNN and a standard neural network is depicted below (Figure 2-1).
The three main types of layers seen in a CNN are the convolutional layer, pooling layer, and fully connected layer [13]. The convolutional layer is responsible for computing the output of neurons that are connected to local regions of the input, computing the dot product between their weights and a region they are connected to in the input volume.
The pooling layer will perform a down sampling operation along the spatial dimensions.
An example of a pooling layer is shown below (Figure 2-2). The fully connected layer computes the class score. Each neuron in the fully connected layer is connected to all the neurons in the previous volume. CNNs have a distinct advantage compared to feed-forward neural networks when it comes to images. They can successfully capture the spatial and temporal dependencies of an image in a way that many other neural networks can often struggle. This is done through the application of relevant filters. The CNN architecture performs a better fitting of the image dataset due to the reduction in the number of parameters involved and reusability of weights [14]. In other words, the network can be trained to understand the complexities of image data particularly well.

Requirements
In most cases, computer vision models required very large amounts of data to find the relevant features when trying to recognize an object. Without substantial data, models will not be able to attain a degree of certainty high enough to be practical in an industry setting. Today, many large, well-labeled datasets are available for public use, but there are many scenarios where not enough data is publicly available. This has led researchers to explore several ways to reduce the amount of labeled data that must be obtained before training a model. Similarly, processing large amounts of data takes a lot of time and can be very resource intensive [15]. It may not always be feasible to spend weeks training a model. The devices that require the use of computer vision may not always have the resources necessary to train on huge datasets [15]. It is because of problems such as these that researchers have spent so much time developing ways to improve computer vision.
One of the methods that they have discovered to combat these issues is transfer learning [16].

Data Cleaning and Data Preprocessing
In many machine learning systems, a data preprocessing or cleaning step is usually employed before building the model. In general, data preprocessing and cleaning are typically needed because datasets are messy or come from a variety of sources. Some form of data preprocessing and data cleaning may be required to feed data through a model at all. These techniques may also be needed to reduce complexity or increase accuracy of a model. These techniques may include data cleaning, standardization, or feature extraction. In the context of computer vision, image resizing, color conversion, and geometric transformations may be used to better assist the training of a model.
Techniques such as converting colors to grayscale can reduce the complexity of a model.
De-texturizing, edge enhancement, and salient edge maps can all help to improve the accuracy of a model. Flipping and rotating images can be used to expand an existing dataset of images. When choosing which cleaning and preprocessing techniques to use for a system, it is important to consider the situation in which it will be used. For example, removing the background of an image may be useful for reducing complexity in order to train a model, but will not be beneficial if the model will then be used as a security camera outside someone's home [17]. These techniques need to be tailored to the dataset and problem that is to be solved.

Minimizing Required Resources
Computer vision constraints are even more noticeable when introduced in mobile and IoT architectures due to the enormous compute requirements [18]. Studies have been conducted in a range of areas in order to leverage computer vision models in mobile and IoT architectures [19]- [21] and manufacturers are even making changes to mobile systems to meet the increased computational demands of such use cases [18].
Unfortunately, these changes can't be made overnight, and companies want to be able to find ways to implement machine learning in their workflow now. One method of making machine learning more accessible for use with mobile and IoT devices in to develop new architectures that require less compute power. An in-depth analysis on various computer vision models has shown the significance of modifying architectures to optimize computational resources [20]. Architectures such as ShuffleNet, MobileNet, and NasNet-A-Mobile all make huge strides in reducing compute requirements while maintaining acceptable degrees of accuracy [20]. Another method is using various machine learning techniques to improve computer vision models. Techniques like transfer learning have made it possible to significantly reduce training requirements for machine learning models [16]. Fortunately, this area of research is ripe for exploration and could prove key in crossing the computational gap between industry standard machine learning techniques and current mobile and IoT systems.

MobileNet
MobileNets are a class of efficient machine learning models for mobile and embedded visual applications [21]. While convolutional neural networks tend to be given more data and made to be more complex models, MobileNet opts to create an architecture that is more efficient with respect to size and speed.
MobileNets are built on an initial full convolutional layer, followed by depthwise separable convolutions. It is important to note the difference between a standard convolution and the depthwise convolution used in a MobileNet. A visual representation of the difference in a standard convolutional filter, a depthwise convolutional filter, and a pointwise convolutional filter mentioned later in this paper is shown below (Figure 2-3). When expressing a depthwise separable convolution as a two-step process, a reduction in computational cost can be achieved, as shown in Equation 2-5. Since MobileNets use a 3 × 3 depthwise separable convolutions, they result in eight to nine times less computational cost than standard convolutions and only show a slight reduction in accuracy.
Each layer of a MobileNet is followed by a batch normalization layer and ReLU nonlinearity layer except for the final fully connected layer which feeds into a SoftMax layer for classification. Down sampling is handled with strided convolutions in the depthwise convolutions and the first layer. A final average pooling layer reduces the spatial resolution to one before the fully connected layer. The full architecture for a MobileNet can be found below (  to increase the efficiency of the model. The use of general matrix multiply functions are often used in convolutions, but frequently require an initial reordering in memory in order to map the convolution to a GEMM. This can be seen in the Caffe machine learning package [23]. The pointwise convolutions used by MobileNet do not require this reordering in memory. As MobileNets spend roughly 95% of their computation time in these pointwise convolutions, they can run much more efficiently than other computer vision models, making them ideal for mobile devices. MobileNetV2 is a mobile architecture that further improved the state-of-the-art  (Figure 2-4). The architecture of MobileNetV2 contains an initial full convolution layer with thirty-two filters, followed by 19 residual bottleneck layers. Relu6 is used as the nonlinearity for its robustness with low-precision computation and a 3 × 3 kernel size is used because it is standard in modern networks. As was the case in the original MobileNet architecture, MobileNetV2 can be tweaked using tunable hyperparameters. A key difference is that for multipliers less than one, a width multiplier is applied to all layers but the very last convolutional layer to improve performance of smaller models. Assuming the size of the input domain is | | and the size of the output domain is | |, then the memory required to compute ( ) can be as low as | 2 | + | ′ 2 ′ | + (max( 2 , ′ 2 )).

Eq. 2-12
This is allowed by using the constraints that the inner transformation, including and NasNet-A.

Transfer Learning
Transfer learning is a technique motivated by the ability to intelligently apply knowledge learned previously to solve a new problem faster or with better results. This allows the domains, tasks, and distributions used in training and testing to be different [25]. Transfer learning comes in many forms. One of the most common forms is to use a model that has already been trained on a large dataset, such as ImageNet, remove the last fully connected layer, and add two new adaption layers to the network. During training, all the layers except for the two new layers remain fixed. In doing this, the pre-trained network serves as a feature extractor for the new network. A network with millions of images and thousands of categories, like ImageNet, can provide a new network with lots of useful features and help to obtain optimal results. Studies have been conducted to attempt to determine how the chosen architecture and dataset of the source domain will affect the transferability of a model [26]. In scenarios where the new domain is far removed from the pre-trained network, this type of transfer learning may not be beneficial. In the case of a significant gap between the source and target domains, it may be possible to use a subset of transfer learning, known as domain adaptation, to minimize the domain gap between two datasets [27], [28]. Domain adaptation aims to leverage an existing network to improve a network in a different, but somewhat related target domain. While many transfer learning techniques have been explored, a combination of techniques could yield even better results with even greater reduction to the labeled data required to train such models.

Inventory Management Systems (IMS)
Inventory systems suffer from several problems. Some of the more common issues include stock outs, excess inventory, misplaced inventory, and employee errors [29].

Stock Outs and Excess Inventory
Stock outs are shortages in inventory. They are often caused by inaccurate records or poor forecasting in the inventory system. Stock outs can result in delays in product availability. The solution currently being used to reduce stock outs is to implement accurate trigger points that can determine when to purchase more materials [29].
Excess inventory is the results of companies not using inventory after a purchase.
It can cause additional costs to the organization through storage costs and funds tied up in unused stock [29]. Much like stock outs, excess inventory can currently be mitigated using an electronic IMS. The IMS can provide useful information to the purchasing department in order to forecast when purchases should be made.

Misplaced Inventory and Employee Error
Misplaced inventory can occur when the system doesn't store quality information.
The system may not store the details of inventory's location. Misplaced inventory can result in wasted time and cause late deliveries to customers [29]. One current solution to avoid misplaced inventory is to physically count and organize all material as soon as a shipment arrives. This material can then be put in a spreadsheet and inserted into an electronic IMS [30]. Another solution is to integrate bar-code technology into an IMS [31]. While an electronic IMS can certainly help to mitigate misplaced inventory, it is not a perfect solution. An electronic IMS cannot account for user error [30].
Employee errors can lead to inaccuracy in inventory records, failure to purchase new material, or an excess in inventory. Materials can accidentally be misplaced, thrown away, or broken. Clerical errors can also occur [30]. Currently, the only solution for employee error is to train employees on specific inventory systems that they will be using [29]. CHAPTER 3

Data Collection Experiment
For initial proof of concept testing, a dataset needed to be generated to simulate the use case that the model needed to solve. To do this, thousands of images needed to be generated of objects in some sort of container (Figure 3-1). To create this dataset, an experiment was conducted using a guitar amplifier positioned with the speaker pointing straight up toward the ceiling. On top of the amplifier, rested a container that held some combination of plastic spoons and forks. Next, a camera connected to a computer was attached to a stand that hung above the container. A lamp was placed above the container to maintain a controlled lighting level. A script was run on the computer to take pictures of the container once every second while the speaker was playing a looped audio file that could shake the objects in the container enough to create varied images for training of a model. After initials tests were done to train the model using the data collected, an issue was found with the angle that the camera was hanging above the container. Because of this issue, the dataset needed to be cleaned. Images collected in the categories of spoons with a fork and forks with a spoon had to be checked for times when the foreign object was out of view of the camera. These "dirty" images were removed as to avoid discrepancies in testing. Upon cleaning the data collected, the model was able to obtain substantial accuracy when trying to place the images into the specified categories.

Transfer Learning Experiments
An initial transfer learning experiment was done using a convolutional neural

Data Reduction Experiments
An experiment was created to measure the data reduction possible when using transfer learning of a model previously trained on ImageNet to classify a new dataset.
The experiment went as follows. First, the model was trained using the full dataset of forks and spoons from the data collection experiments and results were collected from 5 different training and testing runs. These results included accuracy at several steps in the training process, the final confusion matrix, and some of the images that were misclassified. Next, the dataset was cut in half and the results of training and testing were recorded for 5 separate runs. Based on those results, a call would be made to again half the training set or add half of the removed data back to the set, emulating a sort of binary search. This continued until the required data to properly train the model could be determined. As this test was done on a dataset that includes variations of objects in the original ImageNet dataset, it was assumed that a much smaller subset of the collected data would be required.
It was found that the original training set of 10274 images could be reduced to a training set of 500 images while still maintaining an accuracy of 94% (only 1-2% lower than the original network). In general, images containing predominately forks (or entirely forks) were almost never classified as containing predominately (or entirely) spoons. The inverse was not the case.

Data
In order to train a computer vision system for inventory management, a method for creating a dataset is required. As mentioned in Section 2.5.1, a method was proposed to collect data for use in testing this thesis. A guitar amplifier was placed with its speaker facing upward and a camera was fastened above the amplifier facing directly downward.
A lamp was placed above the amplifier to maintain a constant source of lighting. In order to create a more consistent background behind the collected objects, a solid black container was placed atop the amplifier. The objects that should've been in the images were then placed inside the container. The amplifier was turned on and the volume was increased to a substantial volume in order to cause a vibration of the objects in the container. Images were taken every three seconds. This was a long enough time to ensure a substantial degree of change between images. Once a substantial number of images for a set of objects were collected, the objects in the container were swapped for a new set of objects and the process was then repeated. It is also suggested that images with different quantities of objects are collected for a more robust dataset. Upon completion of data collection, all images were checked for significant blur or other issues with the image.
Next, images were flipped and rotated in order to create more unique images to use in training the network. An example of flipped and rotated images can be seen below (Figure 3-2). Once all of this was complete, the newly created dataset was used to train computer vision models.

Figure 3-2:
Image after each of the preprocessing methods.

Hardware
While the neural network ultimately needs to be implemented on low-resource devices, training the network on a low-resource device is not necessarily practical.
Instead, all networks were trained on a desktop computer with an Intel Core i9-9900K CPU with 16 GB of RAM. The trained networks were then saved and exported for use on low-resource devices. The low resource device chosen for this experiment is a Raspberry Pi 3 with a 1.4 GHz 64-bit quad-core processor and 1 GB of SDRAM.

Choice of Base Model
The neural network chosen for this experiment uses the MobileNetV2 images over 1000 categories [33]. It should robust enough for the purposes of this thesis.

Transfer Learning
In order to perform transfer learning, the feature vector selected in Section 3.3.1 was used. Atop the feature vector of the pre-trained model, a dense fully connected layer was added. From there, the cross entropy of the probability distribution was taken, and the mean was found. The cross-entropy mean was then used as the loss for a gradient descent optimizer in order to train the model. A SoftMax layer was used in order to transform the logits vector into a vector of probabilities that was used to determine the prediction made by the new network. With the pre-trained model's feature vector frozen, the new network was trained on a dataset created using the method laid out in Section 3.1. The trained neural network was then saved and exported for use in the low-resource environment. Once there, the network architecture, weights, and biases were loaded and run in a production scenario.

Classification Methods
In order to address the challenge of detecting foreign objects in a container, two methods were explored. Both methods attempted to detect the foreign objects in the form of a classification problem but differed in complexity and robustness. The first method classified exactly which foreign object, if any, was in the container. This method could be more easily considered a classification problem in that the model was explicitly trained on what the foreign object looked like. The first method required a larger list of categories and posed problems when trying to scale. In fact, just identifying a single foreign object at a time would require n! number of categories where n is the number of unique objects in the inventory system. This method also required a larger dataset to accommodate these categories. The second method explored was to identify whether there is a foreign object in the container, but not what the object is. This method required only 2n categories. On the surface, this method may seem simpler. The problem that may arise with this method is that the network will be expected to learn that any items in the container other than the one that is supposed to be there are foreign. Because of the variety of foreign objects that can exist as the model scales, the network may struggle to define a foreign object.

Growing a Dataset
Implementing a computer vision system within an inventory management system poses a few distinct data problems. The first of these issues is that many images are needed for the initial training of a network. The inventory of such a system could also change over time. Additionally, variables such as location or lighting may change, affecting the images that the vision system will be using for classification. In order to combat these issues, a method is proposed to increase the dataset over time and retrain the network as it grew. In the proposed system, some of the images that the system takes while in production are selectively added to the dataset that we initially created. These

Transfer Learning
In order to test the benefits of transfer learning to our mock computer vision system, we trained our model with and without transfer learning while tweaking several variables including batch size, total epochs, learning rate, and activation function. We

Growing Dataset
In order to properly assess the potential benefits of intelligently growing a dataset, a dataset was expanded over the course of three weeks. The images added to the initial dataset were based on the confidence value of the model's prediction. The threshold that was used is a confidence value of 0.8. This threshold was selected to allow a significant number of images to be added to the dataset over time. It is subject to change based on the dataset being used. All images that led to a lower confidence value than the threshold were saved and added to the dataset. A variety of networks were trained on the resulting dataset at the end of each week and the accuracy during the training was recorded.
An additional dataset was created to test each model's ability to generalize. This dataset will be referred to as the generalization dataset. The generalization set needed to include a wider variety of lighting conditions. It also needed to include additional objects that were not included in the initial dataset. The images in this new dataset were entirely independent of the original dataset. This way none of the images in the new dataset were seen by any of the models during training.
The accuracy on the generalization dataset allowed for a better metric to test how well the system can identify objects that are placed in a wrong container. This accuracy was collected for a series of networks. The change in accuracy over the course of growing the dataset was recorded, and the average change in this accuracy was calculated from week to week. In order to consider the proposed method of intelligently growing a dataset a success, the average change in accuracy on the generalization dataset from week to week needed to show a consistent increase.

Data
A set of 6000 images was collected using the method outlined in Section 3.2. The images were check for significant blur that may have occurred. They were then flipped and rotated using a preprocessing script. This resulted in a dataset of 24000 images evenly distributed across five types of objects, each type of object with foreign objects, and a set of empty containers. The final dataset was used for the transfer learning tests. It was also used as the initial dataset for intelligently growing a dataset. For the purpose of testing the benefits of transfer learning and growing a dataset, the models will only be attempting to classify images as being empty, containing the desired object, or containing foreign objects. Some example images and class labels from this dataset can be seen below (Figure 4-1).

Figure 4-1:
Example images and labels from the initial dataset.

Transfer Learning
Training Duration To begin testing the benefits of introducing transfer learning to the proposed computer vision system, models were trained with a batch size of 10, a learning rate of 0.01, and the classification layer used the Scaled Exponential Linear Unit (SELU) activation function. The number of epochs that a given model was trained on ranged from 100 to 3000. Over the course of 30 models, the average accuracy increased from 93.33% to 95.05%. To obtain this accuracy, a 70:30 split was used. This will continue to be the case throughout the remainder of this thesis.
The accuracy of the model using transfer learning begins to stabilize around 1200 epochs (Figure 4-2). The model that doesn't use transfer learning still sees dips in accuracy at 2000, 2500, and 3000 epochs. The standard deviation in accuracy from 1200 to 3000 epochs using transfer learning is 0.42%, while the standard deviation without transfer learning is 0.76%. This is not a very large difference, but it is worth noting. The accuracy of models using transfer learning and trained on 1200 or more epochs averaged a 1.73% increase in accuracy.
The models trained on fewer than 1200 epochs had an average increase in accuracy of 2.74% when using transfer learning.

Figure 4-2:
The accuracy networks with and without transfer learning.

Activation Function of Classification Layer
The next variable to be tested was the activation function used in the classification layer. To do this, the previous experiment was repeated three additional times. Each time, the batch size and learning rate were held constant, while the accuracy of the network was tested every 100 epochs from 100 to 3000 total epochs. The activation function in the classification layer was changed to the Linear, Rectified Linear Unit (ReLU), and then Leaky Rectified Linear Unit (Leaky ReLU) activation function. The models using the SELU activation function in the classification layer benefitted the most from the use of transfer learning (

Increase in Data
Over the course of three weeks, an initial dataset of 24000 images was grown to a total of 28448 images. In that time, new lighting conditions and object types were presented to the model. Only the images that led to a low confidence value were retained to be added to the dataset. The first week of growing the dataset led to a selection of 1132 images to be added. The second week resulted in an additional 1220 images. The final week of the experiment resulted in an additional 2096 images that were added to the dataset.
In addition to the five types of objects in the initial dataset, eight additional types of objects were introduced in the dataset. Because these new objects made up only some of the 4448 images introduced over the time that the dataset was grown, there is not an equal representation of these new objects compared to the existing types of objects.
The generalization dataset is comprised of 6848 images distributes among the thirteen types of objects, each object with additional foreign objects, and a set of empty containers. The distribution of images in this set is more even than the dataset grown over time.
Test Accuracy The first metric that can be observed over the course of growing the dataset is a change in the test accuracy from week to week. In order to get a good insight into the change in accuracy from week to week, a series of models were trained using a batch size of 10, a learning rate of 0.01, one of four different activation functions in the classification layer, and trained on a number of epochs ranging from 50 to 3000. A total of 240 models were trained. On average the series of networks resulted in a 0.15% gain in test accuracy after the first week ( week.
When the test accuracy is analyzed based on the activation function used in the classification layer, it becomes apparent that the slight drop in accuracy over the course of three weeks is true across all the models. There is some variation in which week leads to the greatest loss of accuracy, but the overall change is consistent.

Accuracy on Generalization Dataset
The next metric that was gauged from week to week was the accuracy on the generalization dataset. The same models used to collect test accuracy were used in order to test accuracy on the generalization dataset. After one week of collecting data to grow the dataset, there was a 0.04% growth in the average accuracy across all networks. After the second week, an additional 1.26% growth was witnessed. At the end of the third week of data collection, a 2.24% growth in average accuracy on the generalization dataset was measured. It translates for an average growth of 3.54% in accuracy after three weeks ( Table 4-3).
Like the results seen with test accuracy, the activation function used in the classification layer seems to have little impact on the amount in which the accuracy on the generalization dataset increased over the course of three weeks. The increase in accuracy over the course of three weeks stayed around 3.5%. The final accuracy after three weeks of growing the dataset stayed between 65% and 67% (Figure 4-3).The increase in accuracy over time means that the dataset is growing in a way that allows the

Transfer Learning
Based on the data collected, it appears that transfer learning had a positive effect on the accuracy of the various models that were tested in this thesis. Although the increase in accuracy was only about 1.71%, it was consistent across the models.
Additionally, the tendency for the models that used transfer learning to stabilize faster is an indicator that the models can be done training sooner. This is significant in that the duration of training for those models can be shortened without much impact to accuracy.

Accuracy on Generalization Dataset
The accuracy on the generalization dataset of each variation of the computer vision model was collected along with the test accuracy. As the accuracy on the generalization dataset is the accuracy of the model on a set of images that contains many more lighting conditions and unknown objects, it wasn't expected to be as high as the test accuracy. The generalization dataset used for this thesis included 9 additional objects that were not present in the initial dataset. The generalization dataset also included many images in which the various foreign objects were partially covered or obscured in order to make them harder to detect.
The generalization dataset had no impact on the training of the models themselves. As a result, the standard deviation of accuracy on the generalization dataset is greater than that of the test accuracy. Where the standard deviation of the test accuracy ranged from 2.21% to 3.76% over the course of growing a dataset, the standard deviation of the accuracy on the generalization dataset ranged from 5.36% to 6.26%. The test accuracy of models tends to stabilize over time, while that was not the case for the accuracy on the generalization dataset.

Long-term Implications
The benefits of growing a network can be seen over the course of the three weeks in which the dataset was grown. This is enough data to show a trend, but it is hard to be sure that this trend will hold over the entire lifecycle of the system. Continuing these tests over month or years could help to better gauge the long-term implications of growing a dataset.

Sensitivity and Specificity
While the proposed method of testing the impact of growing a dataset is relevant, other metrics should be explored in the future. The degree of bias and variance in the model were not captured in the tests performed in this thesis, but they are still quite important. A good way to measure the bias and variance of the model is to measure sensitivity and specificity. Sensitivity is a measure of something being tested "positive".
In the context of this thesis, this would be the measure of an image being identified as containing a foreign object. Specificity, on the other hand, is a measure of the probability of something being tested "negative" or not containing a foreign object. If when the dataset is grown over several weeks, a positive change in sensitivity and specificity would be a sign that the model is getting better at differentiating between images with and without foreign objects.

CHAPTER 6
CONCLUSIONS AND FUTURE WORK

Conclusions
The data collection method and preprocessing techniques outlined in this thesis proved to be capable of creating a large initial dataset that can be used to train a deep neural network. Based on the results of the transfer learning experiment in this thesis, it can be concluded that transfer learning had a positive impact on the proposed computer vision system by increasing the accuracy of the model. The results gathered in this thesis also support the idea that intelligently growing a dataset is a viable method for adapting to changing needs of a computer vision system.
Ultimately, we are left with an outline for creating a complete computer vision system capable of detecting misplaced inventory using classification. The system can be run on low-resource devices such as Raspberry Pis and can be improved over time as its needs change. The system can be scaled for as many types of inventory as are needed.
The long-term implications of growing the dataset used for training this system can't be concluded at this time, but we have enough data to conclude that the proposed method for growing a dataset can have positive effects on the system.

Future Work
Training on Low-Resource Devices The possibility of training networks using the same low-resource devices that they are eventually run on would be an interesting route for future exploration. A single Raspberry Pi certainly wouldn't have enough compute power to train a model, but a series of low-resource devices could be turned into a cluster that may be able to be used for training. Similarly, something along the lines of an NVIDIA Jetson may be capable of training the networks used in this thesis and is certainly less powerful than the machines currently being used for training.

Object Detection
The classification methods used in this thesis are certainly a good method of trying to solve the issue of foreign objects. A method along the lines of object detection could be used as well. It may be possible to look for individual objects and identify all the objects that aren't meant to be in a container. Testing the effects of growing a dataset on other tasks, such as object detection, could prove useful. These types of tasks could also benefit from the ability to gradually introduce new lighting conditions or types of objects.

Other Machine Learning Architectures
While MobileNetV2 was the architecture chosen for this thesis, there are many other machine learning architectures that could have been used. Other types of neural networks could benefit more from intelligently growing a dataset. In theory, the proposed method of growing a dataset is applicable to any image dataset. There is no reason to believe that other neural networks couldn't benefit from this technique as well.

Long-term Benefits of Growing a Dataset
As mentioned in the discussion section of this thesis, it could prove valuable to test the concept of intelligently growing a network over a much larger period to better reflect the benefits seen later in the lifetime of a computer vision system.

Sensitivity and Specificity Testing
As discussed in Section 5.4, testing sensitivity and specificity could prove to be other metrics capable of showing the benefit of intelligently growing the dataset.