Distinguishing artefacts: evaluating the saturation point of convolutional neural networks

Prior work has shown Convolutional Neural Networks (CNNs) trained on surrogate Computer Aided Design (CAD) models are able to detect and classify real-world artefacts from photographs. The applications of which support twinning of digital and physical assets in design, including rapid extraction of part geometry from model repositories, information search \&retrieval and identifying components in the field for maintenance, repair, and recording. The performance of CNNs in classification tasks have been shown dependent on training data set size and number of classes. Where prior works have used relatively small surrogate model data sets ($<100$ models), the question remains as to the ability of a CNN to differentiate between models in increasingly large model repositories. This paper presents a method for generating synthetic image data sets from online CAD model repositories, and further investigates the capacity of an off-the-shelf CNN architecture trained on synthetic data to classify models as class size increases. 1,000 CAD models were curated and processed to generate large scale surrogate data sets, featuring model coverage at steps of 10$^{\circ}$, 30$^{\circ}$, 60$^{\circ}$, and 120$^{\circ}$ degrees. The findings demonstrate the capability of computer vision algorithms to classify artefacts in model repositories of up to 200, beyond this point the CNN's performance is observed to deteriorate significantly, limiting its present ability for automated twinning of physical to digital artefacts. Although, a match is more often found in the top-5 results showing potential for information search and retrieval on large repositories of surrogate models.


Introduction
Recent trends in digital design, such as Twinning [6] have underlined the value of rapid, or fully integrated, synchronisation between physical and digital domains to accelerate processes and enhance analytic capability.
The prototyping process comprises vital iteration between a multitude of physical and digital models [5,14,16], in which both physical and digital states must be captured, aligned, and replicated across domains (i.e. updating of CAD models).
A key technical challenge thereby lies in creating a relationship between physical and digital states, such that each may be recognised as a counterpart of the other, whilst maintaining this alignment across multiple versions of a prototype. By automating the detection of physical models and searching/aligning against a digital counterpart, scope exists to reduce process time and cost via physical/digital transition.
Detection has previously utilised physical tagging (e.g. QR and barcodes) or direct scanning (e.g. photogrammetry), where these methods typically require modification to the physical prototype or generation of new digital models, recent works have demonstrated the ability of Convolutional Neural Networks (CNNs) to extract and learn distinguishable features of an artefact for classification. Thus, enabling rapid association with existing data, such as retrieving CAD models by taking a photograph of their real-world counterpart [4].
However, the performance of a CNN is dependent on quantity and quality of training data [1], and the number of classes between which it must distinguish [2]. Where photographic training data is often sparse in prototyping, implementing a CNN becomes a significant challenge [11], thus the real-world viability of using CNNs for design twinning is not known.
This paper investigates (a) the performance and implementation challenges associated with CNN use at increasing scale, and (b) the viability of 'off-the-shelf' CNN architectures for artefact classification. We consider an artefact to be a designed object, whose form can be distinguished and classified.
The paper proceeds to present related works in the field of CNN use in design (Section 2), followed by a methodology for testing CNN scalability (Section 3). Results are then reported (Section 5) and a discussion ensues with respect to CNN's and their ability to support twinning activities in design (Section 6). The paper concludes by detailing the key findings from the study.

Related work
CNNs in design: The application of CNNs to design is a rapidly emerging field of inquiry with many potential impacts across Engineering Design. For example, [17] trained a CNN on multi-view renders of 3D geometry and used the resulting CNN to classify other 3D geometry. A potential use case for this is the matching of similar parts across product families with a view to reduce the part variety in an organisations supply chain. [10] have sought to augment depth mapped images with Neural Networks to develop voxel-based approximations of objects within a scene, with potential application to design via providing a means to describe the locations in which products are used and deployed. [4] is seeking to democratise design by using a CNN as an information search and retrieval tool for large model repositories, such as Thingiverse and MyMiniFactory. The CNN enables users to simply take a photo of the item that they wish to print and return the closest matching result in the repositories dataset. [3] recently demonstrated CNN's being able to emulate mathematical and user perceptions on shape and form. These could be used to check for conformation to brand identity as well as potential infringements on others. It could also twin user feedback into the design process where the CNN will act as market feedback providing real-time assessment of designs as a designer is working on their product's design. While such examples demonstrate the exploration of value of CNNs in design, there remain questions around their performance, and challenges in their implementation.
Surrogate models for dataset generation: The challenge of acquiring datasets, with thousands of images required per artefact, has previously been a limiting factor in exploring the utility of CNNs in large-scale classification tasks. With the context of design prototyping requiring sufficient data to distinguish between 100s of prototype or component versions, this presents a systemic obstacle of particular importance.
Recent work has shown the viability of synthetic image dataset generation, whereby a surrogate CAD model is processed and rendered with computer graphics software to generate two-dimensional artefact representations [13,4]. Image composition, for example lighting, surfaces, background, and appearance can additionally be tuned to replicate real-world environmental features [18,11,12].
This method presents an opportunity to leverage existing virtual models for large-scale classification, whereby a CNN is trained on synthetic images from a surrogate model data set; thus mitigating the need for 'real-world' photographs, and prior restrictions in data acquisition [4]. To-date however, the effect of model repository size on the efficacy of this approach has not been investigated. Where CNN detection accuracy will typically decrease as repository size increases, the extensive training sets producible via surrogate models give scope to substantially increase performance.

Methodology
The study followed a 5 step computational process to determine the scaling behaviour of a CNN, for the purposes of twinning physical artefacts with their digital counterpart in large repositories.

Dataset curation
The data set used consisted of 1,049 STL files collected from the model sharing website MyMiniFactory.com 1 . MyMiniFactory is a website that allows users to upload, share and sell 3D models, and also provide an API to programmatically interact with the 3D model database. Between June 10th and June 15th 2020, the API was queried using a Python script with the search terms 'spare part', '3d printer', and 'accessibility'. Additionally, a filter was applied to restrict results to only those under a Creative Commons free to re-use license. A total of 1,514 files were downloaded, consisting predominantly of STL files (1,112 files) and a range of CAD files. A small number of those 1,112 STL files were corrupted and removed, leaving a data set of 1,049 usable files.

Surrogate model curation
Where prior work has used relatively small datasets (< 100 models), this work leverages surrogate models based on existing CAD geometry to enable CNN use in the design scenario.
The open source 3D computer graphics software Blender (version 2.8) was used to create photo-realistic renders of the artefacts. Blender is widely used throughout the computer games and film industry to create and render 3D graphics and animation. It also provides a Python library, which in the case of the work presented in this paper, allowed for scripted model loading, creation of lighting, camera positions, and image rendering.  Model size was also normalised and scaled via re-scaling of the bounding box, such that the artefact could be represented fully in a 540px x 540px rendered output.
Camera positions were set using longitude and latitude angles, allowing the camera to be rotated around the model (see Fig. 1). To generate a range of images the longitude and latitude angles were set at 10 • , 30 • , 60 • , and 120 • degrees, resulting in four data sets of 684, 84, 24 and 6 renders/images per artefact respectfully. A summary is provided in Table 1.

CNN selection
This study utilised an AlexNet 'off-the-shelf' CNN architecture. The network, designed for a 1000 category classification capacity, established prominence in computer vision research with a breakthrough Top-5 performance in the 2012 ILSVRC contest, significantly outperforming prior architectures. Through its substantial documentation in literature and relative ease of use, AlexNet remains a widely used CNN architecture with diverse applications. Specifically to this context, [4] showed AlexNet to outperform other top classification architectures, GoogleNet, and ResNet, achieving 94.9% accuracy AlexNet's architecture consists of 62 million parameters and a 1000-way output classifier [8]. It comprises eight layers; the first five are convolutional and remaining three layers are fully connected (Fig. 3). The final layer, a 1000-way softmax classifier, outputs a probability distribution across 1000 class labels. The network input is a 3 channel RGB image of dimensions 227 x 227 pixels. To reduce over-fitting, complexity measures including dropout and data augmentation are introduced. The CNN was recreated using TensorFlow 2.0 and Keras deep learning API. A Nvidia 8GB RTX 2060 GPU was used for training.

CNN preparation
Three steps were taken to prepare the CNN following guidance in literature and early experimentation with the surrogate data. These are: • Forming the data pipeline and applying image augmentation prior to parsing through the CNN. • Configuring the Hyperparameters.
• Training the CNN.

Data pipeline and Augmentation
From the Keras data preprocessing utilities a pipeline was defined to generate batches of tensor image data with real-time augmentation. The CNN was fed images from the surrogate data set using flow from directory method, automatically inferring class (artefact name) from the sub-directory structure. 30% of images per class were partitioned to a validation subset for training. Additionally, parallel image data generators were created for both training and validation data subsets, allowing training data to be generated with augmentation and data shuf-3  fling whilst preserving artefact representations in the validation generator [15]. This process is shown in figure 4.

Hyperparamaters
To improve training stability and CNN generalisation, a mini-batch size of 32 was used to estimate the error gradient [9]. Respectively, a learning rate of 1e-06 was applied to scale the value by which weights were updated in back propagation. These values were found to be effective with batch size, and the generated synthetic image data. An epoch count of 200 was defined to observe training performance over time and allow sufficient space for model convergence.

Training
CNN parameters are trained with weights and biases initialised on the active data set. 'Adam' is used as the optimisation algorithm, and categorical cross entropy set as the loss function [7]. This approach is elected over transfer learning with pre-trained networks to explore causality between our surrogate model data structures and the CNN's performance. Addi-tionally, hyperparameters are kept constant across data sets and training iterations.

Evaluation
CNN performance was evaluated using Keras metrics on three of five generated data sets (30 • , 60 • , 90 • ), measuring classification accuracy and loss for training and validation steps, Top-5 categorical accuracy, and training time. Initial test data showed performance to be inconclusive for the 120 • data set, and the 10 • data set to be computationally inefficient, therefore these were omitted from the study. Results from the candidate data sets were logged and graphed to visualise performance across dimensions of interest; accuracy, number of classes, model coverage (number of Images per class), and training time.

Results
This section reports the results from our study into a CNN trained on a CAD surrogate model dataset. These have been reported with respect to training performance and classification accuracy.

Training Performance
Training time (Table 2) shows that higher quantities of artefact in a given training set correlate to longer CNN training times. Time is observed to increase exponentially across data sets, however the degree reached by training time is limited by data set size.
A uniform transition exists from exponential to linear growth in training time for each of the surrogate data sets. The 30 • set (featuring the highest number of training images) transitions to more linear time scaling earlier in the class range (200 classes), whilst this behaviour is observed later (400 classes) in the 60 • , and 90 • data sets; suggesting the experimental set-up to reach a performance cap at between 16,000, and 4800 images. 4

Classification accuracy
In contrast to training time, classification accuracy decays almost exponentially as the number of classes per data set increases, showing a positive correlation between training image quantity and classification accuracy. Thus, accurate classification of physical artefacts can only be achieved with a small number of classes.
Top-5 performance (by which the true artefact is named in the top 5 predictions) displays an affinity to the slope of Top-1 accuracy, although its gradient indicates a slower decay in classification performance and is more linear in nature. At 1,000 classes, the CNN was still able to classify the matching model in the Top-5 predictions 75% of the time. This is a promising indication that CNNs could support design information search & retrieval applications. It's worth noting for illustrative purpose that the average car features 30,000 components.

Discussion and future work
This study has explored the performance properties of CNNs trained on surrogate models for future application to Engineering Design. Particularly, this work has shown the potential of adapting existing 'off-the-shelf' CNN architectures to support Design activities. This is of significance as it demonstrates that engineering organisations can use existing, general purpose AI/ML architectures rather than requiring expert consultancies to develop bespoke solutions.
Results have shown that CNNs could be applied to automated twinning between digital-physical assets if the number of artefacts to be classified remains low (< 10). However, CNN performance can quickly deteriorate, this should therefore be considered in selecting suitable applications for a CNN.
Top-5 results however show stronger performance, and suggest potential for CNN use as a rapid Search & Retrieval tool for design information. Implementation could occur in a Search & Retrieval tool that does not require bespoke multi-faceted search strategies as is typical in engineering due to the challenges in how one describes physical artefacts, their shape and form. This could be significant in democratising and speeding up information search processes as design engineers are a mere photograph away from useful search results that can take them from the physical to the digital domain. With these promising results, future work could investigate the development of bespoke CNN architectures for the activities described. In particular, it would be interesting to apply Information Search & Retrieval's F-score metric to the CNN and compare it to existing engineering design search strategies. 5 Optimisation in preparing and training the CNN is also an area to explore, ensuring development of computationally efficient and sustainable architectures that support design activities. This would consider the number of renders required per artefact and the development of a suitable scene with appropriate lighting to accentuate artefact features as well as the application of multiple scenes.
Further work into architecture performance drop-off could yield some interesting insights into how CNNs confuse artefacts, with confusion matrices providing a means to explore this. Is it the case that the confusion matrix of a CNN would be comparable to a similarity matrix produced by humans? This paper has shown that there remains a wealth of research to be done to apply Machine Learning in the domain of Design, and further assess how it supports, or even hinders design processes.

Conclusion
Where prior work has shown the potential of CNNs to support Engineering Design activities, this paper has taken the next step in this journey by examining how CNNs scale with increasing class sizes of surrogate CAD models. The results have important implications in determining whether CNNs can be deployed for twinning and/or information search & retrieval activities.
The results show that existing 'off-the-shelf' CNN architectures could be re-trained and successfully deployed to twin between physical and digital domains if the number of models is low (< 10). The results also demonstrate the potential for CNNs to support Information Search & Retrieval activities with the CNN being able to return a correct matching in the Top-5 for 1,000 model classes. This creates opportunity to use a single photograph to effectively retrieve virtual models of physical artefacts from large corpi.
Together, these results demonstrate the viability of CNN use in a design context, the effectiveness of the novel surrogate model training approach, and scope for future opportunities that may be realised.