Introduction

The manufacturing sector is facing unprecedented challenges due to several factors such as: escalating volumes of tools and raw materials, complex inventory management, high work-in-progress counts, supply chain challenges/disruptions, and the need for skilled and proficient human interaction. These factors not only decelerate production rates but also contribute to an uptick in labor costs, putting pressure on manufacturers to devise innovative solutions to maintain competitiveness. As a means of mitigating these challenges, manufacturers are actively pursuing to implement digital transformation practices into their diverse operations to reduce labor costs while maintaining production standards (Zhang et al., 2022). To ensure production sustainability, manufacturers must prioritize the adoption of digital transformation tools and technologies into their diverse operations to automate and improve business processes, supply chain efficiency, and agility (Oztemel, 2020a, 2020b). Digital transformation is revolutionizing the manufacturing industry by integrating connected components within the factory setting. This integration facilitates flexible production (Dey et al., 2021), remote manufacturing orders (Xia et al., 2021), real-time monitoring (Li et al., 2022), predictive/prescriptive maintenance (Him et al., 2020), and automated inspection (Sacco, 2019), among many other advancements. This shift towards digital transformation is driven by the need for increased agility, facilitated by recent advancements in cyber infrastructure, such as microdevices connected through the Internet of Things (IoT), cloud storage, data analytics, and Artificial Intelligence (AI).

The adoption of these advancements has made digital transformation the cornerstone of product and process development, allowing manufacturers to deliver effective digital experiences that meet the evolving needs of customers for both standard and customized products. A study by the General Electric digital transformation team showed that 87% of business leaders consider digital transformation a top priority (Yousif, 2021). These sought-after digital transformation tools are designed to handle large amounts of data, enabling the virtual product and process planning from the modeling and simulation phases to fabrication in real-time.

Furthermore, digital transformation applications can be employed to accurately predict Key Performance Indicators (KPIs) such as reliability, performance, flexibility, and maintainability in manufacturing facilities. These digital models are supported by data gathered from integrated sensors, which is then utilized to predict system performance, failures, evaluate different scenarios, enhance efficiency, and reduce energy consumption and cost (Gautam Dutta, 2021). The emergence of digital transformation technologies allows the use of raw data from various processes including simulation, fabrication, assembly, testing, fault detection, supply chain, and cybersecurity in the creation of smart products. This comprehensive data utilization not only improves product design and lifecycle but also bolsters overall manufacturing efficiency as presented in Fig. 1. In engineering, the use of digital transformation tools like computer-aided design/manufacturing (CAD/CAM) software, finite element analysis (FEA), and computational fluid dynamics (CFD) simulations are critical for virtual product and process planning. The state-of-the-art digital transformation practices involve the convergence of these engineering tools with computational techniques, thus creating a synergistic system for improved efficiency. Simultaneously, in computer science, machine learning (ML) algorithms and big data analytics support digital transformation by enabling predictive/prescriptive maintenance and real-time monitoring of manufacturing processes. The fusion of these fields culminates in the digital twinning of the entire process, bridging the gap between physical and digital spaces in manufacturing. The ultimate goal of these digital transformation applications is to convert an automated manufacturing foundation into a high-efficiency autonomous facility.

Fig. 1
figure 1

Digital Transformation Process

While the adoption of digital transformation technologies is a top priority among manufacturers, their integration is hindered by several barriers and challenges, including heterogeneous company structures, lack of return on investment, visibility, and a shortage of a skilled and qualified labor force (Matthew, 2021). Digital transformation transcends the mere integration of tools within the manufacturing landscape. It involves a multifaceted array of processes that extend beyond the production lines and productivity optimization. As such, embracing digital transformation requires a comprehensive approach. This approach must leverage new technologies and techniques such as cloud computing, machine learning, AI, IoT, and mixed reality. In addition, it's important to note that leveraging this digital revolution doesn't diminish the role of human factors in a workflow but rather emphasizes and enhances them (Yousif, 2021). Manufacturing operations function as a Complex Adaptive System (CAS), characterized by multiple interacting agents that adapt and evolve in response to changes in their environment (Monostori, 2008). Manufacturing operations are considered CAS due to its intricate components—machines, materials, and humans—that collaborate to produce the final products. Digital Transformation represents a paradigm shift in the way individuals and organizations perceive technical challenges. This new perspective combines systemic problem-solving strategies, cost-reduction, product creation, and decentralization to approach problems with innovative methods and mindsets, and an intense drive for improvement.

Implementing digital transformation promises reductions in cost and time, while enhancing overall quality. An effective digital transformation implementation extends beyond merely employing state-of-the-art technologies, but also necessitates the training and education of all users. Essential key technologies for a successful digital transformation program include cyber-physical systems, IoT, digital twin, virtual and augmented reality, cloud computing, and simulations. Despite these advancements, five primary challenges arise during implementation: dealing with legacy systems, control, integration, communication, and cybersecurity. Once these challenges are resolved, digital transformation enables event understanding, ultimately aiming to achieve autonomous machine corrections, laying the foundation for a holistic smart factory.

In the past, manufacturers responded to customers' increasing demands for a broader variety of products, better service, and shorter delivery times by implementing strategies such as 5S, a workplace organization method, Lean Six Sigma, a process improvement methodology, and by expanding their labor force and procuring more production lines and machinery. However, these solutions are now confronted with industry-wide labor shortages in the United States, and the desire for business expansion is often hampered by space constraints within manufacturing facilities. With limited resources such as skilled labor and land space, manufacturers are increasingly pushed to innovate. To meet these evolving demands, manufacturers are resorting to digital transformation initiatives, such as automation and data analysis, to their production and final assembly systems. The production of diverse, complex products introduces a higher probability of fault occurrence. To minimize fault occurrence and production line stoppage, it is crucial for production lines to become more flexible by transitioning from semi-automated and automated systems to fully autonomous ones. One promising strategy to achieve this flexibility in production lines involves integrating computer vision algorithms, enabling production lines to avoid immediate stoppages and maintain autonomy.

Starting with machine vision for automated visual inspection, these algorithms can perform fault detection, object classification, defect localization/segmentation, paving the way for autonomous machine correction. As their capabilities grow, they can be extended for other applications within the manufacturing facility, such as safety and packaging. These computer vision applications contribute to an approach of 'event understanding,' assisting in decision-making within production systems, and paving the way towards a holistic autonomous event understanding approach.

The paper is structured into five sections: Section two provides a literature review on digital transformation processes and products in manufacturing, discussing their capabilities such as automated inspection, fault detection, and deep learning implementations. Additionally, in section two the authors delve into the algorithms behind the computer vision systems and the goals of creating more cost-effective digital transformation tools. Section three will present the gap assessment within the digital twin application and outline the components of the proposed systems. In section four, the authors present the digital twin application results, capabilities, and the impacts it can have on a manufacturing facility. Section five presents the conclusion, future work, and acknowledgements.

Literature review

This literature review section starts with an examination of the various products resulting from the digital transformation process namely digital models, digital shadows, and digital twins. It then elucidates the reasons behind the shift from manual to automated inspection, discussing the significance of this transition for contemporary manufacturing processes. A comparative analysis is undertaken between automated inspection and Automated Visual Inspection (AVI), emphasizing their strengths in detecting faults and their potential role in enhancing operational efficiency.

Furthermore, the section presents a comprehensive overview of the diverse applications of AVI across multiple industrial sectors, supplementing this discussion with an exploration of previous case studies to elucidate its real-world implementations and efficacy.

The latter part of the paper introduces a cost-effective, open-source, and accurate digital twin application designed for conducting AVI. This proposed solution is specifically aimed at fault detection and autonomous machine correction, offering a potential pathway to overcome the prevalent challenges in implementing digital transformation as identified in the literature.

The intent is to underscore the value of AVI in driving the future of manufacturing processes, while also mitigating the complexities associated with the adoption of digital transformation strategies in the manufacturing sector.

Differentiating digital models, shadows, and twins: defining products of digital transformation

In the discourse of digital transformation, terms such as digital models, digital transformation, and digital twins are often used interchangeably, leading to confusion. It's important to delineate that digital transformation represents a process applied to various components—be it machines, processes, or systems. The desired outcome of this application determines which product—digital models, digital shadows, or digital twins—is created as a result of the digital transformation process (Kirkpatrick, 2022). In simpler terms, the process of digital transformation, based on its application and intended need, yields one of these three distinct digital representations. It's crucial to realize that not every situation necessitates the sophistication and comprehensive data integration of a digital twin; sometimes, the needs of a project could be sufficiently met with a digital model or digital shadow. These three products, namely digital models, digital shadows, and digital twins, differentiate themselves in the level of data integration between the physical system and its digital counterpart. Each represents a unique stage of the digital transformation process, offering different levels of interaction and synchronization with the physical entity they represent. The digital model is a digital representation of the physical object but lacks any form of automatic data transfer between the digital and physical objects. These digital models might include, but are not limited to, CAD models, simulation models of the planned factories, mathematical models such as programming scripts, and machine learning algorithms. It's essential to note that any sort of behavior changes in the physical system do not influence the digital model, and vice versa, as illustrated in Fig. 2. Advancing to the next level, a digital shadow provides a more dynamic interaction. It's a digital representation of a physical object with a one-way automated data flow from the physical to the digital object. Meaning, any changes in the physical system get mirrored in the digital shadow, while the reverse doesn't occur. Finally, the most sophisticated form is the digital twin, characterized by its bidirectional automated data integration. In this case, any change in the physical system influences the state of the digital object, and vice versa (Werner Kritzinger, 2018). The relationships among these three categories are visualized in Fig. 2.

Fig. 2
figure 2

Digital Transformation Products

Understanding the differences between these concepts is crucial as it shapes how data is utilized and controlled within the digital transformation process. For instance, consider a manufacturing unit: a digital model can be used for initial design and planning, capturing the blueprint of the machinery. On the other hand, a digital shadow, tracking real-time data of the machinery, can provide insights into the current operational performance and potential problems. Then there's the digital twin, which not only mirrors the real-time status but also allows for scenario testing and potential remote control, offering real-time insights and adaptability in manufacturing processes. Thus, the depth of integration chosen will align with the specific needs of the digitally transformed system.

The transition to automated visual inspection in modern manufacturing

With the rapid expansion of manufacturing and mass production, traditional manual vision inspection methods struggle to keep up with emerging market demands. While manual vision inspection remains relevant in certain contexts, it is progressively overshadowed by its limitations. These include substantial time consumption, labor-intensiveness, exhausting and high costs, exacerbated by the requirement for experienced inspectors (Billie, 2019) (Schlosser et al., 2022a, 2022b). Human inspection workers should be limited to two-hour shifts as errors may arise from eye strain, leading to weakened visual discrimination and reliability. Human inspection can increase the production cost as some inspection tasks are time-consuming (Prieto, 2002). As industries across the United States grapple with labor shortages while aspiring to business expansion, one effective solution is reallocating inspection labor to areas necessitating more skilled operators, thereby aligning with principles of smart manufacturing. With the advent of modern manufacturing techniques that facilitate high-speed production, there is an urgent need for rapid, autonomous inspection systems. Such systems not only accommodate the fast-paced nature of modern manufacturing but also enable the inspection of complex parts with greater efficiency (Prieto, 2002) (Kuo et al., 2019). This call for efficiency and precision gave rise to automated inspection, a technology that has found broad application across various research fields. Automated inspection systems, harnessing tools such as laser sensor machines for taking measurements (Blais, 2003), not only streamline the inspection process but also elevate its accuracy and repeatability. They are capable of performing tasks such as alignment orientation before the measurements stage (Brysch et al., 2022), and identifying physical object properties like geometric fidelity, texture, holes, color, and brightness with precision and consistency. Crucially, automated inspection systems can detect various types of defects—including cracks, size discrepancies, body defects, corrosion, depth variations, misalignments, and contrast issues—that might be overlooked in manual inspection due to human error (Wong, 2017). In light of these advantages, the shift from manual to automated inspection represents a key milestone in the digital transformation of manufacturing, improving both productivity and product quality. It exemplifies how digital transformation, through automated inspection, can overcome the human-centric limitations of traditional inspection methods, fostering a more robust and efficient manufacturing sector.

Automation has improved productivity and communication between devices. It has enhanced the overall process efficiency by reducing human effort which leads to less human error in major tasks such as human/manual inspection (Reem Ashima, 2021) (Sun et al., 2016). Different industry disciplines have applied automated inspection in various areas such as the infrastructure of bridges (Yeum & Dyke, 2015), carbon fiber composites (Christopher Sacco, 2020) (Sacco C. R., 2019) (Sacco 2018) (Meister et al., 2021), cables (Supriya, 2021) (Klinieam, 2021), machines (Hussien, 2012), oil and gas pipelines, and tanks (Shukla 2013) (Shukla 2016), material quality (Kwak et al., 2000), different products in the agricultural and food industry (Tadhg Brosnan, 2002) (Misra, et al., 2020), and vehicles (Livanos, 2018) (Schäfer, 2016). In the last decade, automated inspection used contact devices to do the inspection, requiring the part to be stationary and carefully placed. However, the major disadvantage of automated contact sensor inspection is that it may damage the product surface (Prieto, 2002) (Wong, 2017). This decade, research has changed focus to Automated Visual Inspection (AVI), also known as machine vision (Billie F. Spencer, 2019). Automated visual inspection can be applied without physical contact with the products. This reduces the need to stop the production line and does not require precise positioning, which is beneficial for inspection of parts fabricated from soft materials (Lin, 2019). Furthermore, the outbreak of COVID-19 has prompted significant advancements in the application of digital technologies, including computer vision, to improve safety in manufacturing. Automation and computer vision techniques have the potential to enhance safety planning, hazard management, and training efforts within the manufacturing industry, especially in the context of post-pandemic safety regulations (Khandelwal, 2020) (Guo, 2017) (Guo, 2018) Automated visual inspection is essential in the manufacturing process and on the production line, as vision inspection can detect the defects that affect the product appearance to prevent product recalls defects. Moreover, these product defects can have an impact on the machines, robots, and tools on the production line (Konrad, 2019) (Singh & Desai, 2022). These capabilities and the ability to reduce human labor is why this paper will focus on further enhancing the capabilities of automated visual inspection.

Underlying techniques and adaptability of automated visual inspection

The rise of Automated Visual Inspection (AVI), an inspection approach employing machine vision techniques, signifies a revolution in quality control applications and production lines. Due to its versatility, AVI can be integrated into manufacturing systems of varying magnitudes, ranging from small-scale printed circuit boards (PCBs) to large-scale aircraft (Wong, 2017) This flexibility underscores the broad applicability of AVI. Techniques such as optical microscopy (Campos Rubio, 2008) (Simon Gaugel, 2016), digital photography (Schlosser et al., 2022a, 2022b) (Aminzadeh & Kurfess, 2019) (Schlosser et al., 2022a, 2022b) (Aminzadeh & Kurfess, 2019), and 3D scanning and thermal imaging, underpin the functionality of AVI, each with its unique advantages and constraints. Despite its powerful ability for complex defect analysis and extensive use in fields requiring precise material analysis like material science, nanotechnology, and biomedical research, optical microscopy is limited in its application for swift, large-scale inspections in manufacturing and assembly facilities due to its inherently small field of view and limited scope (Simon Gaugel, 2016) (Andrew Hrechuk, 2018). Digital photography is the preferred industry-based imaging method due to its balance between field of view and resolution, with visual inspection devices widely used in production lines to reduce costs, improve robustness, and increase yield rates (Szu-Hao Huang, 2015). Other techniques like 3D scanning and thermal imaging also complement AVI systems, adding to the robustness and versatility of automated inspection. 3D scanning offers spatial data of the inspected object, allowing defect detection that could be missed in 2D imaging (Khanzadeh, et al., 2018). Thermal imaging detects faults based on temperature variations, making it useful in certain industries like electronics and construction (Manjeshwar et al., 2005). The selection of each method in AVI is guided by the specific inspection requirements, contributing to its adaptability and effectiveness in quality control across industries. Computational algorithms play a crucial role in reducing manufacturing costs, improving robustness, and enhancing yield rates (Szu-Hao Huang, 2015), ushering in a new era of streamlined and cost-effective manufacturing.

Moreover, while discussing AVI, the terms "machine vision" and "computer vision" are frequently encountered and used interchangeably (Sun et al., 2016). Computer vision is a broad field encompassing image processing, analysis, understanding, and sensor technology advancement, drawing from various disciplines such as mathematics, physics, computer science, and artificial intelligence. Machine vision, however, focuses on applying computer vision technology in industrial and manufacturing environments, primarily for automating processes like inspection and quality control. This multidisciplinary integration is integral to its expansive nature and continuous evolution (Davies, 2012). Given this backdrop, it becomes apparent why AVI—a practical application of machine vision—is gaining increasing prominence in the field of fault detection across various industries. By integrating computer vision techniques with manufacturing processes, AVI offers a powerful, adaptable tool for quality control, ensuring product integrity and ultimately enhancing overall operational efficiency.

AVI has seen considerable progress with advancements in diverse applications and techniques. These include image representation, which translates visual information into algorithm-friendly formats, and template matching which specializes in locating identical or similar image portions matching predefined templates for effective object detection in consistent environments. Moreover, pattern classification algorithms have made significant strides in enhancing the AVI landscape. Techniques such as Support Vector Machines (SVM), Decision Trees, and Neural Networks have emerged as key contributors, adept at categorizing complex patterns into predefined classes, thereby handling variations in object appearances more efficiently. Collectively, these techniques and advancements form the backbone of contemporary AVI, marking significant progress in the field (Szu-Hao Huang, 2015) (Wang Peng 2021) (Zvezdan Lončarević, 2021). For instance, AVI was employed to ensure the defined quality parameter of automobile bearing seals (K. Ashwini, 2021), demonstrating its potential in the automobile industry. AVI has also proven its versatility in complex inspection scenarios. This flexibility was demonstrated by employing AVI to inspect defects on curved surfaces, such as those found in components of aerospace or automotive parts (Wang Peng 2021). The food packaging industry has also benefited from AVI to identify defects such as relaxation, wrinkles, and bubbles in carton packets (Yongdong 2014). In the field of infrastructure maintenance, a fully automated system for live power line inspection was developed. This system, incorporating image and signal data processing, automatically detects defects or abnormal conditions, serving as a clear demonstration of AVI's adaptability (Pagnano, 2013). The influence of AVI on manufacturing was exemplified by detecting fabrication defects in hard disk drive media mass production (Chow, 2012), and innovative computer vision algorithm for detecting damages in carbon fiber composites (Caggiano, 2017). Additionally in composite material inspection, an autonomous system was developed for detecting damages around drilled holes in composites using a deep highly efficient fully convolutional networks algorithm (Ali Maghami, 2021) AVI was integrated in a robot-based automated thermoplastic manufacturing process with inline-nondestructive testing inspection to investigate the effect of welding spots (Armin Huber, 2019). These studies collectively underscore the diverse applications and continuous evolution of AVI, highlighting its indispensable role in varied industries and the potential for further advancements (Table 1).

Table 1 Take aways and Contributions of Literature Review

Despite these marked advancements, AVI faces significant challenges, particularly in real-time fault detection. Such obstacles pervade beyond mechanical components but extend to industrial systems such as electrical and hydraulic systems (Klinieam, 2021) (Cemenska, 2015) (Badmos et al., 2020). These systems, subjected to unpredictable environmental conditions and potential damage due to factors like acoustic emissions, occasional shocks, poor power quality, vibrations, or supply imbalance. In response to these challenges, AVI emerges as a paramount solution, its advanced capabilities and adaptability positioning it at the vanguard of industrial problem-solving.

The remarkable growth of AVI in recent years, particularly in product defect detection, signifies its inherent ability to bolster system reliability, safety, energy, and cost-efficiency. By facilitating early fault detection and diagnosis, AVI serves as a crucial response to industrial challenges, thus reinforcing its substantial importance in the contemporary manufacturing landscape. This growth is attributed to the heightened attention from regulatory entities, such as the Food and Drug Administration (FDA), which aim to ensure consumer and patient safety (Ruiz Fernández, 2020). AVI will become a crucial aspect in industries such as pharmaceuticals, food, and medical devices where product defects can have severe consequences/bear grave repercussions for both consumers and manufacturers. The incorporation of AVI into these sectors aims to reduce product recalls, thereby curtailing potential damage to a company's reputation and moderating costs. Through the deployment of computationally efficient algorithms, AVI systems promise inspections marked by high accuracy and reliability, leading to augmented product yields and diminished manufacturing costs. The ongoing research and future advancements in AVI bode well for the continual optimization and dependability of product inspection processes.

AVI Algorithms

Evolution of AVI algorithms and manufacturing systems

AVI algorithms have undergone significant advancements to address the intricate requirements and challenges associated with inspection processes in manufacturing environments. The balance between manual feature engineering and actual performance often depends on the capabilities of the underlying algorithm. To address these challenges, deep neural networks have gained popularity, particularly Convolutional Neural Networks (CNN). CNN have been groundbreaking in applications related to industrial safety and monitoring (Ren, He, Girshick, & Sun, 2016). The key strength of CNN lies in their capacity to automate object detection by leveraging patterns within large datasets. Consequently, they excel in identifying defects in structures, recognizing unsafe behavior, and even discerning precise postures of workers on-site (Ding et al., 2018) (Valero et al., 2016). Hence, they can be found prominently in product defect applications.

CNN-based methods distinguish themselves with their ability to automatically extract features, eliminating the need for manual feature engineering. This allows them to be highly adaptable for diverse and dynamic environments such as industrial sites (Girshick, 2015) (Redmon & Farhadi, 2017). Building on that, Fang et al. introduced the "Improved Faster Regions with Convolutional Neural Network Features (IFaster R-CNN)" approach, which not only excels in real-time object recognition but also surpasses the performance of existing state-of-the-art methods in object detection (Fang et al., 2018).

Beyond IFaster R-CNN, several widely used deep learning methods for object detection include Mask R CNN (He et al., 2017), Faster R-CNN (Jiang, et al., 2021) YOLO (Terven & Cordova-Esparza, 2023), and Single Shot Multibox Detector (SSD) (Liu, et al., 2016). Each technique produces its unique strengths and capabilities. For instance, while Faster R-CNN is renowned for its accuracy, it may face challenges in detecting small-scale objects due to limitations in its Region Proposal Network (RPN) (Kim et al., 2016). Addressing this, an enhanced version of Faster R-CNN has been proposed to optimize object detection accuracy across different scales by incorporating a broader range of box areas and aspect ratios (Ren et al., 2016).

Within the application scope of this paper, these deep learning mechanisms have been utilized in different product defect inspections use cases. An automated quality visual inspection mechanism was designed using CNN to extract features (Wang et al., 2018). This model was tested out on a benchmark dataset and surpassed the accuracy and detection speed of other methods. However, R-CNN has also been explored in this context wood veneer defect detection to be able to classify and locate defects with a 98.70% accuracy (Shi, Li, Zhu, Wang, & Ni, 2020). Following that evolution, implementations of Faster R-CNN can also be found in applications such as defects in faucets which have been grinded and polished (Liu et al., 2021) and wheel hub manufacturing (Sun et al., 2019).

Beyond the development of algorithms, manufacturing control systems have progressed beyond the automation standards defined within Industry 3.0. This is due to the additional capabilities required of this system to maintain production that can meet market demands. Some of these capabilities include fault-tolerant (Djordjevic, et al., 2023) (Stojanovic, 2023) (Song et al., 2023) (El Kalach et al., 2023), and flexible systems (Patel et al., 2018). Different technologies such as Semantic web and adaptive dynamic programming were used to realize fault tolerant systems. Meanwhile multi-agent systems (Jarvis et al., 2001), and different ML algorithms (Alam et al., 2020) can be utilized to achieve flexible systems.

Despite these advances, there remains to be a consensus on the standard framework required to deploy such algorithms or technologies. As such, this paper works towards proposing a framework capable of closing the loop between the complex computational requirements for higher level decision-making and the operational technology.

An overview of the Algorithms behind AVI applications in industry

The concept of automated inspection has a rich history in manufacturing. Automated visual inspection requires a deep understanding of machine learning which we will explore in this section. The different types of learning that will be discussed in this section can be seen in Fig. 3. The first type of machine learning technique is the supervised learning technique. Supervised learning requires a training data set with labeled inputs and their corresponding outputs. During training, the model will learn the relationship between the inputs and outputs and will eventually be able to deduce the output of new inputs based on the learned inputs. This is a widely used technique for visual inspection of products in a consistent environment. Within the context of manufacturing, supervised learning can be used for anomaly detection of laser powder-bed fusion metal additive manufacturing (Aminzadeh & Kurfess, 2019) (Mahmoudi et al., 2019) and for object detection in industrial applications (Ge, et al., 2020). Another subsection of machine learning is unsupervised learning which attempts to detect patterns not outlined previously with minimal human supervision. Unsupervised learning uses machine learning algorithms to cluster unlabeled data and identify hidden patterns. Within manufacturing, the authors used image data analysis for monitoring of selective laser melting (Grasso et al., 2016). An unsupervised learning algorithm was implemented to detect surface defects on textured surfaces (Mei et al., 2018). Reinforcement learning, another section of deep learning, also has applications in the manufacturing industry. Reinforcement learning relies on an agent learning the correct decisions by accumulating the largest reward. Within this paradigm, reinforcement learning can be used for job scheduling in a smart manufacturing plant (Xia et al., 2021).

Fig. 3
figure 3

Overview of ML Techniques Used in AVI

With AI and specifically machine learning experiencing such notoriety, many companies have begun to create platforms and products to easily integrate learning techniques into manufacturing environments. Googles Clouds Visual Inspection AI is one such platform with a guided UI and easy labeling of input images. It can be used for assembly or cosmetic inspection. Another platform is the IBM Maximo Application Suite: Visual Inspection Mobile. Similar to Googles’ product, it is simple to use and train models on with the added feature of being able to deploy your model on portable devices (iOS/iPadOS devices). Siemens and Capgemini have also come out with their own tool, Inspection AI for Manufacturing, which runs on the MindSphere platform. These products from major companies exemplify the willingness to integrate visual inspection in the manufacturing industry.

Literature assessment

Digital transformation's implementation in manufacturing often poses a daunting challenge, largely attributed to its technologically intricate nature. A startling revelation from a study uncovers that, out of seventy-five manufacturing firms that ventured into digital transformation, a scant 5% managed to garner significant business outcomes (Marco Lansiti, 2016). Despite an extensive average implementation period of three years, the results did not meet expectations. Several factors contributed to this limited success rate, including challenges in processing, and storing large quantities of data and the high costs associated with accessing digital transformation tools. Digital transformation necessitates the use of advanced software and sensors, making it financially untenable for many firms. To overcome this financial barrier, the authors decided to develop the proposed application using cost-effective, open-source alternatives to encourage companies who are looking to implement digital transformation without a high start-up cost.

Previous research has investigated fault detection during the manufacturing process using machine vision, statistical models, and machine learning algorithms that identify different faults using various historical data sets (Dutta, 2018) (Saidy, 2020) (Zou, 2000) (Lo, 2019). While real-time fault detection is an essential component, it is only an initial step in the solution process. The true challenge resides in the ability to limit the impact of these faults on the manufacturing process, which can cause significant disruptions lasting from several hours to days, even with effective fault identification (Klinieam, 2021). To tackle this issue, advanced computer vision algorithms for fault detection have been integrated into digital transformation initiatives. This enhances the system's diagnostic capabilities and paves the way for autonomous machine correction, leading to a more efficient and.

effective manufacturing process.

Proposed model and case study

Gap assessment

This paper presents a case study of a digital twin application designed to optimize an assembly line's quality control process by leveraging computer vision technology. The proposed application aims to detect assembly faults in the end-product—a four-part model rocket and trigger autonomous corrections. The application is powered by multiple artificial neural network algorithms trained for image classification and segmentation, acting as a quality officer overseeing the assembly process.

Building upon the conceptual distinctions between digital models, shadows, and twins as discussed in the literature review, the methodology leverages these insights to implement a digital twin for optimizing the quality control process in an assembly line.The proposed application was implemented in the Future Factories Laboratory at the University of South Carolina’s McNair Aerospace Research Center. This lab holds a full automated assembly line with five industrial robotic arms controlled through a single Programmable Logic Controller (PLC). This assembly line also has an industrial edge device connected to the PLC to send and receive data to cloud platforms. More details about this manufacturing system can be found in (Harik, et al., 2024).

The end-product of our assembly line is a model rocket, composed of four distinct components: a rocket base, two body parts, and the rocket tip. Analog and Multi-modal Manufacturing Datasets Acquired on the Future Factories Platform. The purpose of the framework introduced in this paper is to inspect the end-product, ensuring that the assembly of these components is flawless. If a fault is detected, the system will intervene and correct the error, thus minimizing the need for human interaction and ensuring an uninterrupted manufacturing operation. To demonstrate the practical application of a digital twin, a framework is implemented at the Future Factories Laboratory that exemplifies the automated bidirectional data flow and autonomous adaptability essential to such systems. A visual representation of this framework is depicted in Fig. 4, which details the operational workflow and highlights how real-time data flow, capture, and analysis informs and triggers immediate corrective actions within the manufacturing process, exemplifying a true digital twin in action. The application's architecture facilitates a two-way automated flow of data between the physical assembly line and its digital counterpart. This interactivity is pivotal, enabling the system to autonomously adapt and execute corrections in real-time based on the data captured and analyzed through computer vision technology. Such adaptability distinguishes the proposed application as a digital twin application, transcending the capabilities of digital models and shadows by allowing for an iterative feedback loop between the digital and physical domains.

Fig. 4
figure 4

Operational Workflow of the Digital Twin in Quality Control

The operational dynamics of our digital twin are predicated on the seamless integration of real-time data capture, analysis, and action. Upon identifying a discrepancy or fault within the physical system, the cameras transmit data to the digital twin, which then autonomously evaluates whether adjustments to the production process are necessary. This process exemplifies the bidirectional data flow critical to digital twin functionality, enabling the system to not only mirror the physical assembly line but also influence its operations based on digital insights.

The essence of our digital twin's real-time synchronization lies in its capacity for immediate response to physical system changes. This responsiveness is ensured through a protocol where the PLC, upon executing an assembly command, engages in a two-way communication with the digital twin via the edge device. This mechanism allows the system to determine, within moments, the necessity of corrective actions, thereby minimizing downtime and maintaining operational efficiency. The introduction of an IIoT alarm management system further refines this process by setting a threshold for autonomous decision-making based on the accuracy of the digital twin's predictions, thereby balancing automation with the need for human oversight when necessary.

Through its sophisticated architecture and operational dynamics, the application not only meets but exemplifies the criteria for a digital twin as defined in the digital transformation discourse. Unlike digital models and shadows, our system's ability to act autonomously, adapt in real-time, and directly influence the physical manufacturing process affirms its classification as a digital twin. This implementation underscores the practical application of digital twin technology in enhancing manufacturing processes, providing a tangible example of how theoretical concepts from the literature are translated into operational reality.

Workflow stages of the proposed digital twin application

The digital twin operates in three stages, as illustrated in Fig. 5. The system initiates when the edge device sends a data tag stating the last step in the assembly process has completed. The servers access an RTSP stream of the assembly cell—captured by the overhead cameras—where an object detection model validates that there is a completed assembly as illustrated in Fig. 6. The first stage utilizes a convolutional neural network (CNN) performing image classification with an accuracy of 98%. This classification identifies if the model rocket is correctly assembled or if it has faulty components.

Fig. 5
figure 5

Proposed Model Flowchart

Fig. 6
figure 6

Object Detection Model Inferences

If the rocket is properly constructed, it moves forward in the assembly line without any further changes. Conversely, if a faulty component is detected, the system proceeds to the second stage. Where, an image segmentation algorithm pinpoints the specific location of the faulty component within the rocket and conveys this information to the Programmable Logic Controller (PLC) through the edge device in our case.

The final stage involves the use of assembly robotic arms to rectify the assembly error. These arms disassemble until they reach the faulty part of the rocket and replace the faulty component(s) with the correct ones. The goal is to ensure that this adjustment process occurs in real-time, minimizing disruptions to the manufacturing process.

It is worth noting that the 98% accuracy of the image classification algorithm was achieved through extensive training and fine-tuning of the model. Our choice of neural network algorithms for both image classification and segmentation was guided by their proven effectiveness in previous studies, hence their adoption for this case study.

Dataset and algorithm for stage one: object detection and image classification

In the First phase of the digital twin application, we employed a Convolutional Neural Network (CNN), trained on a robust, multiclass dataset. These seven distinct classes were formulated based on varying color configurations of the assembled rockets, encapsulating the potential diversity of the assembly line output.

Image data was systematically collected from overhead cameras strategically positioned within the assembly cell. We utilized a Python script integrated with OpenCV, an open-source computer vision library, optimizing our data acquisition to closely mimic the real-time operational conditions of the assembly line. This procedure fortified the authenticity and validity of our training data.

To maintain the integrity of our model and ensure its effective performance, we subjected the images to an extensive preprocessing pipeline. We designed this pipeline with numerous precautions, including a two-tiered shuffling process. This operation was essential to avert potential biases and imbalance during the partitioning of the data into training, validation, and testing sets. This meticulous shuffling process safeguarded the randomness of our data distribution, thus curtailing overfitting and augmenting the generalizability of our trained model.

We further enhanced our preprocessing steps by reducing the resolution of each image from (450 × 600, 3) to (224 × 224, 3). This adjustment condensed model training time and expedited computational speed, aligning with our goal to implement an autonomous, real-time decision-making application. An in-house application was developed for labeling the dataset, catering specifically to our needs. It singled out the object of interest within each image, ignoring extraneous elements, and converted the image into an XML formatted file. This approach preserved data confidentiality and cut back on the need for paid services, reducing the operational costs for manufacturers.

Due to the insufficient volume of real-time images and the difficulty in obtaining assembly defect data during the manufacturing process for model training, the authors devised an image augmentation tool to provide a more efficient method of data gathering that is less time-consuming. This tool expanded the dataset by manipulating image properties such as noise, blur, and brightness, or a blend thereof, artificially increasing the number of images for each class. This not only broadened the dataset but also diversified the types of scenarios that the model could learn from, enhancing the model’s learning capacity and, in turn, its accuracy.

Convolutional Neural Network (CNN) was specifically chosen for its exceptional capability to extract salient features from low-resolution images. The chosen CNN model, shown in Fig. 7, comprises three convolutional layers and one fully connected dense layer, amounting to 11,014,056 trainable parameters. The convolutional layers efficiently reduced the dimensionality of the original images, thereby decreasing computational complexity. Notably, the RGB values of each pixel were normalized from their original 0–255 range to a 0–1 scale during preprocessing. This normalization is vital as CNNs are susceptible to large deviations in data. In essence, each step of our process, from data acquisition to pre-processing, was designed and implemented to construct a high-accuracy, reliable image classification model.

Fig. 7
figure 7

Architecture of the CNN

Dataset and Algorithm for stage two: image segmentation

The second operational stage for the proposed system is responsible for identifying defective component(s) within the assembly after the first stage classifies it as defective. This is achieved through the implementation of a Mask Region-based Convolutional Neural Network (Mask R-CNN), a high-performance image segmentation model that detects and delineates, and classifies individual assembly components as functional or defective.

The Mask R-CNN was selected for its capabilities and efficiency in image segmentation tasks, its proven robustness in handling complex images and objects, and its distinctive ability to output pixel-wise masks of objects. It employs a Residual Neural Network (ResNet) as its underlying architecture. This backbone network leverages batch normalization and rectified linear activation function (ReLU) accelerate the training process and increase stability of the model. The Mask R-CNN accepts an image in tensor format as input. In the first phase—Region Proposal Network (RPN)—it proposes regions of interest (RoI) that could potentially contain an object and passes it on to the subsequent phase. In the second phase, the Mask R-CNN concurrently classifies the proposed RoIs, fine tunes the bounding box coordinates, and generates a binary mask encircling the rocket part as shown in Fig. 8. a successfully trained Mask R-CNN yields 3 outputs for each detected object: classification, bounding box, and mask that highlights the object in the bounding box as illustrated in Fig. 9.

Fig. 8
figure 8

Mask R-CNN Architecture

Fig. 9
figure 9

Input Image (left) Mask R-CNN Output (right)

The training of the Mask R-CNN was facilitated using the TensorFlow Object Detection API, utilizing the GPU acceleration services of Google Colaboratory. A comprehensive dataset comprising 500 images of defective rocket assemblies was curated, labeled, and formatted into the Common Objects in Context (COCO) format using the same in-house annotation tool as outlined in Sect. 3.2. To enhance the dataset's richness, 193 images out of the total were subjected to augmentation techniques including the addition of noise, blur, and brightness modification, either singularly or in combination. This process aided in broadening the model's exposure to variations, improving its robustness. The dataset images were distributed across different defective categories: 275 images had defective body parts, 113 had a defective base piece, and 112 had a defective tip. Additionally, to evaluate the model's performance effectively, the dataset was split into training and testing subsets, with 80% and 20% of images respectively.

This in-depth discussion presents the rationale, process, and intricacies behind the second-stage dataset and image segmentation algorithm, providing insights into our approach and the technical considerations that influenced our research decisions.

Performance evaluation indicator selection

The selection of accuracy and binary cross entropy loss as performance evaluation indicators for the CNN aligns with standard practices in evaluating classification tasks. Accuracy provides a straightforward measure of the model’s overall correctness in classifying images, while binary cross-entropy loss offers insights into the model’s optimization process and convergence behavior.

Similarly for the Mask R-CNN model, monitoring the total loss serves as a comprehensive measure of the model’s performance, capturing both the localization accuracy and segmentation quality. Total loss in the case of the Mask R-CNN, while not as straightforward as accuracy, provides the model trainer with a holistic view of the model’s effectiveness in instance segmentation tasks. This is done through the aggregation of classification, bounding box regression, and mask segmentation losses.

Furthermore, the use of confusion matrices complements these quantitative metrics by providing detailed insights into the model’s performance across all labels the segmentation model generates, providing an easier to digest view of model performance than quantitative metrics such as mean average precision (mAP) and mean average recall (mAR) in situations such as this this where there are numerous classes in the data.

Stage three: machine correction

The third operational stage in our proposed system primarily deals with machine correction of the assembled rockets. This stage is triggered once a convolutional neural network (CNN) classifies a fully assembled rocket as defective. If such a classification occurs, the Mask R-CNN comes into play to identify and classify the individual components of the assembly as either functional or defective. Figure 10 visually represents the subsequent corrective procedures. Every step required for the replacement of a defective component, such as piece removal and defect discarding, is defined as separate paths within Siemens Process Simulate, the robotic arm's programming software. This modular path creation strategy anticipates the potential increase in unique combinations of assemblies as more color options are added to the rockets. Therefore, despite the proliferation of assembly combinations, the number of corrective paths remains manageable and efficient.

Fig. 10
figure 10

Proposed Application Workflow

As outlined in Sect. 3.4, each detected object within an image yields three outputs from the image segmentation model: a classification, a bounding box, and a mask. Each of these outputs plays a crucial role in discerning the corrective measures for the assembly. The classification validates the type and order of the pieces using the bounding boxes and masks to ensure appropriate alignment and sequence. Once verified, the defective parts, identified through the object classifications, are systematically replaced.

The corrective procedure is an algorithmic process which proceeds in three steps: removal of all pieces situated above the defective part, replacement of the defective part, and reassembly of the rocket. The output of this algorithm is a set of path IDs that are stored in the robot controllers and scheduled by the digital twin and the edge device through a data pipeline. Consequently, the defective assemblies undergo real-time automated correction. Figure 10 illustrates the decision-making and workflow processes that consolidate the independent models into the proposed application.

The present approach for automated correction relies on predefined paths for defect rectification. While this approach ensures automation, it lacks flexibility and efficiency in dealing with unpredicted defects. To address this issue, we are working on integrating an AI processing layer that utilizes Human in the Loop (HITL) and dynamic grasping algorithms. The implementation of HITL would convert the system from being merely automated to a more flexible, autonomous foundation capable of efficiently dealing with new and unpredicted complications in the assembly line. HITL algorithms come into play when the computer vision system is unable to resolve a task independently, requiring human intervention for novel path creation, retraining, and testing purposes.

Advantages of the proposed method

While advancements in image segmentations algorithms offer potentially faster processing times and higher accuracies, the practicality of implementing such methods remains a challenge due to the scarcity of accessible open-source resources and the complexity of integration. Mask R-CNN, with its robust library support and straightforward implementation process, stands out as not only a technically viable option but also a pragmatically superior choice. Given these considerations, Mask R-CNN was selected for its proven effectiveness in balancing the competing demands of detection accuracy, computational efficiency, and resource optimization. Its deployment in this investigation is a testament to the method's adaptability and performance superiority in challenging industrial environments. In comparison, the only other algorithm that parallels the capabilities of Mask R-CNN is YOLO V8, a more recent development released in 2023. It's important to note that YOLO V8 emerged after our model had already been trained and implemented in the facility. YOLO V8 is recognized for its speed, owing to its single-pass approach to object detection. However, this streamlined process may lead to a trade-off in accuracy compared to methods like Mask R-CNN. YOLO V8's approach can be advantageous in scenarios where speed is a critical factor, but for applications where precision is paramount, as in our case, the detailed segmentation capability of Mask R-CNN offers a more suitable solution. This distinction underscores the importance of choosing an algorithm that aligns with specific operational needs, balancing speed, and accuracy according to the task at hand.

The proposed method utilizes an initial Convolutional Neural Network (CNN) for image classification, streamlining the inspection process by swiftly clearing non-defective rockets through holistic image assessment. This approach accelerates the review of non-defective assemblies, enhancing efficiency. A subsequent layer of inspection involves the Mask R-CNN, which transcends mere error detection by performing pixel-level segmentation and enabling detailed, physics-based root cause analysis. This dual-model strategy aligns with the ultimate goal of the system to be able to detect and identify various defects such as cracks or deformations, making it suitable for deployment across diverse manufacturing applications.

The initial stage of model training and testing, albeit formidable, pales in comparison to the intricacies of deployment and real-time application monitoring. Herein, the methodologies of Computer Vision algorithms are crucial as they transform visual data into a quantitative array, paving the way for the subsequent phases of model training—where the algorithm is tuned to the expected scenarios—and model testing—where the precision and confidence of the algorithm's predictions are scrutinized. Notwithstanding, the true test of this system’s mettle lies in its performance amid anomalous and unprecedented events during the execution and deployment phase. This is the juncture where two primary deficiencies of the model could manifest: the encounter with scenarios beyond its training, necessitating vigilant monitoring to recalibrate actions of low predictive confidence, and the practical constraints of continuous system oversight in a bustling manufacturing environment, where any misstep could precipitate injury or disrupt production. To address these critical issues, the proposed application adopts a multifaceted approach, integrating a sophisticated array of machine learning techniques, each contributing to a composite, robust system. Instead of relying on a singular algorithmic approach, be it supervised learning or reinforcement learning, the system is empowered by an ensemble of algorithms. These are structured as sublayers, each with a distinct yet interrelated role in the decision-making hierarchy.

The foundational layer of our system employs supervised learning algorithms, renowned for pattern recognition, calibrated with a dataset reflective of a wide spectrum of operational scenarios, offering high-confidence predictions from historical data. Building upon this, an additional machine learning layer imbues the system with the flexibility to adjust to new and evolving conditions through real-time environmental interaction, enhancing performance while minimizing disruptions. Human-in-the-loop (HITL) interventions are integral to this architecture, initially engaged when the base layer produces a low-confidence prediction, thus enabling human expertise to not only correct but also enrich the training dataset. The second HITL interface is a testament to the dynamic interplay between human oversight and the adaptive machine learning layer. Should the system’s autonomous correction be flagged for low confidence; this interface enables human engineers to provide targeted feedback. This crucial intervention allows for the recalibration of the system’s decision-making processes, fostering a continuous learning loop that refines the model’s predictive acuity over time. In this intricate setup, human expertise complements algorithmic precision, ensuring that each layer’s potential is maximized while its limitations are mitigated. Through this nuanced, multi-layered approach, the model not only achieves heightened precision in confronting known issues but also boasts the resilience and adaptability to manage the unpredictability of novel events, embodying a comprehensive strategy to counterbalance the potential for false positives and negatives.

Our method, based on the integration of CNN and Mask R-CNN, is theorized to surpass existing methods in several key areas. Literature suggests that similar AI-driven approaches have demonstrated superior accuracy in object detection and segmentation tasks, a cornerstone of our method's functionality. Additionally, the efficiency of our approach is accentuated by its capacity to handle diverse and complex assembly tasks with minimal human intervention, a significant advancement over more traditional, labor-intensive quality control methods.

Results

This section meticulously describes the process used for training and testing our model, considering various scenarios to assess its performance and accuracy. Our experimental setup features an overhead camera installed within the assembly cell, which constantly monitors the assembly endpoint. After each rocket assembly completion, the camera captures an image that is instantly evaluated by our models. Based on this evaluation, the system decides whether to progress with the assembly process or instigate a corrective measure.

Transitioning to the specifics of model training, let's first discuss the first-stage model, the Classification Algorithm. A crucial factor influencing the performance of this model is the number of epochs in the training process. After experimenting with various values, the optimal count was discerned to be 125 epochs before the implementation of the labeling technique. With the introduction of the labeling tool, we managed to make significant strides in training efficiency. This resulted in a considerable reduction in the epoch count, from 125 to a mere 5, and a decrease in the batch size from 16 to 5 for each epoch.

As depicted in Fig. 11, the integration of the labeling tool played a pivotal role in optimizing time management during the training and testing phases. This marked a critical step towards the evolution of our system, now capable of making real-time decisions.

Fig. 11
figure 11

The Classification Algorithm Epochs Comparison Before and After Labeling

In machine learning, epochs represent the number of times the entire training dataset is processed by the learning algorithm, while batch size refers to the number of training examples utilized in one iteration. Properly setting these parameters is crucial for finding the right balance between model convergence and computational efficiency during training. Without enough training, the model will underperform in accuracy and loss metrics, and with too much training the model can overfit and only be effective on the training data and underperform when deployed in the system. While the chances of overfitting can be minimized by having a diverse number of angles, backgrounds and positioning to the dataset along with image augmentation adding noise, blur and other effects introduced to images, it’s important to include an early stopping factor into your model trainer to cut training short once the improvements to accuracy and loss are little to none. In the case of our image classification algorithm, we trained the model with a batch size of 4 and the model training was ended after 6 epochs as shown in Fig. 12. The researchers chose a smaller batch size of 4 rather than a more conventional batch size of 16, 32, or 64, not out of memory constraints, but an advantage in the model’s ability to adapt to evolving datasets as we plan to do.

Fig. 12
figure 12

Image Classification Model Fit Generator

Following this process optimization, we undertook an initial experiment to gauge the performance of the proposed system. This involved 16 distinct color defect detection scenarios, leveraging a dataset of 500 representative images. The primary objective was to critically evaluate the effectiveness, precision, and real-time defect detection capabilities of our digital twin application. To facilitate this, we strategically partitioned the dataset, reserving 80% for model training and allocating the remaining 20% for testing and validation purposes. This deliberate division of data ensured a comprehensive and rigorous evaluation of the model's performance.

Figure 13 displays a number of test images and their respective predictions of picking up defects using the image classification model. A threshold was defined that is connected by an IoT alarm management system, which triggers alarms that require action. If the accuracy of the prediction is 95% or higher, the system acts autonomously without disturbing or breaking the process. However, if the accuracy is lower than 95% a notification will be sent to the responsible personnel that include the picture and its analysis for human interference as shown in Fig. 14.

Fig. 13
figure 13

Classification Results of Different Scenarios

Fig. 14
figure 14

IoT Notification System

The image classification model accuracy is 100% and was calculated using log loss function as displayed in Fig. 12. The first model performance was evaluated by plotting the training accuracy, training loss, validation accuracy, and validation loss. Additionally, the binary cross-entropy loss (log loss function) was used to calculate loss. The training accuracy and loss were plotted on one graph and the training and validation loss were plotted on another graph. These figures can be seen in Fig. 11.

Confusion matrices are made to visualize the accuracy of the trained models and understand what classifications and parts the models are struggling with. A confusion matrix assigns each prediction a model makes to a box that represents an instance of when an object, labeled along the y-axis, was given a classification, labeled along the x-axis. Figure 15 shows a confusion matrix that displays the accuracy of the image classification model on a data set of 500 images.

Fig. 15
figure 15

Image Classification Algorithm Confusion Matrix

In each box, there is a number and a percentage. The number (on top) corresponds to the number of times the ground truth was given that classification, and the percentage corresponds to the rate at which the ground truth was given that classification. For example, the lower left box corresponds to when a correct assembly (pass) was classified as defective, this incorrect classification occurred 5 times while testing the model on the dataset. Those 5 incorrect classifications mean that the model incorrectly classified 4.63% of passing rockets in the dataset. If the image classification algorithm classified the rocket as defective, the image segmentation algorithm identifies which piece(s) are not in place for further machine correction as shown in Fig. 16. Figure 17, another confusion matrix, was made using predictions the image segmentation model made on a group of 500 images. Each rectangle on the confusion matrix represents an instance of when an object, labeled along the y-axis, was given a classification, labeled along the x-axis to break down the model performance.

Fig. 16
figure 16

The Image Segmentation Algorithm Predictions Capabilities

Fig. 17
figure 17

Image Segmentation (Mask R-CNN) Confusion Matrix

Each rectangle on the confusion matrix represents a component of the assembly being given a classification. A correct detection would have the same classification as the ground truth. Each box has a number, which shows the number of times the classification was assigned to that ground truth, while the percentage shows what percent of the time that ground truth is given that classification in the dataset used. For example, the top left box shows how often a “defective piece 1” (rocket tip) is classified correctly, with the number inside the box (94) showing the number of times a defective rocket tip was correctly classified, while the percentage (100%) refers to the percentage of the time that a defective piece one was classified correctly.

For a detection to be classified as valid, the model must have a confidence above 60% in the prediction and must have an IOU score of above 0.75. The bottom row and far right column (labeled F/N) are there to show any predictions that don’t meet the pre-determined criteria. The total of 15 detections in the F/N column are predictions that didn’t meet the confidence threshold to be counted as a valid detection, and the 4 detections in the F/N row are from predictions that didn’t have a high enough intersection over union score to match the prediction with an annotated object in the image.

While the first confusion matrix is useful to understand what parts the model is better or worse at correctly identifying, this simplified binary confusion matrix Fig. 18 helps give a better understanding of the performance of the model as a whole. This binary confusion matrix classifies each prediction the model made as a true positive, true negative, false positive, or false negative. Where true positive and true negative are correctly classified parts, and false positive and false negative are incorrectly classified parts. The number of false positive and false negative detections are the same in this case, because if one piece is falsely classified it’s a false positive of one classification and a false negative of its true classification.

Fig. 18
figure 18

Mask R-CNN Binary Confusion Matrix

The choice of training parameters in the case of the Mask R-CNN was more limited than for the CNN, as hardware restraints that Google Colaboratory assigns to a user limited the size of a batch size to 2 or less. The epochs for the Mask R-CNN were set to 50, but the early stopping method stopped the training at roughly 10,000 steps (40 epochs).

Before training the Mask R-CNN, there were many parameters that can be tuned aside from the training parameters mentioned earlier. Some notable parameters—IoU threshold, maximum detections per class, and maximum total detections—were adjusted to fit the qualities of the system. For instance, every rocket is made of 4 different components, so the model is capped at generating one detection per class and four total detections, and the IoU threshold for training was set at a commonly used 0.75. There were additional parameters—such as learning rate—the researchers opted to retain the tuned settings, given the pre-trained model’s established performance on its pre-trained tasks. By leveraging transfer learning with a pre-trained model, the researchers aimed to capitalize on the learned representations and avoid unnecessary adjustments that can disrupt the model’s performance.

The accuracy of the image segmentation model was calculated using the total loss method as illustrated in Fig. 19. This figure plots the average total loss for each set of 100 steps. After undergoing 10,000 steps of training, the Mask R-CNN, yielded a total loss of 0.1362. The total loss is a composite objective function that the machine learning model aims to minimize by fine-tuning the weights of features. As the model continues to train, the loss function decreases and eventually stabilizes at a near-zero positive value. To delve deeper into the mechanics of the total loss function, it is essential to understand its constituent components, which are defined in Eq. (1) (Wang, et al., 2021).

Fig. 19
figure 19

Mask R-CNN Total Loss

$${L}_{all}={L}_{cls}+{L}_{mask}+{L}_{box}$$
(1)

Here, \({L}_{cls}\) is the classification loss of the prediction box, \({L}_{mask}\) is the average binary cross-entropy loss, and \({L}_{box}\) is the regression loss of the prediction box. \({L}_{cls}\), \({L}_{mask}\), and \({L}_{box}\) are determined using Eqs. (2)(3)(4).

$${L}_{cls}\left({p}_{i},{p}_{i}^{*}\right)=-lb[{p}_{i}{p}_{i}^{*}+(1-{p}_{i})(1-{p}_{i}^{*})]$$
(2)

In this equation, pi is the predicted probability that anchor point i is the target, \({p}_{i}^{*}\) is the predicted value of the real area label, and lb represents the log loss function.

$${L}_{mask}=-\frac{1}{x}{\sum }_{i}[{x}_{i}^{*}log(p\left({x}_{i}\right))-\left(1-{x}_{i}^{*}\right)*\mathit{log}\left(1-p\left({x}_{i}\right)\right)]$$
(3)

Here, x is the number of pixels, \({x}_{i}^{*}\) is the true category label where the pixel is, and p(xi) is the predicted probability of that pixel belonging to a certain category.

$${L}_{box}\left({t}_{i},{t}_{i}^{*}\right)=R\left({t}_{i}-{t}_{i}^{*}\right)$$
(4)

In this equation, R is the robust loss function, ti represents the parameterized coordinate vectors of the predicted bounding box, and \({t}_{i}^{*}\) is the ground truth coordinate vector.

Evaluating integration and computational efficacy: advancing smart manufacturing solutions

In the culmination of this research, an extensive assessment of the solution's integration and computational efficiency within a manufacturing setting reveals promising results from both an implementation and computational perspective. The seamless incorporation of our system into the existing operational infrastructure, aligning with both physical and digital realms of the manufacturing environment, underscores full compatibility with the current hardware and software configurations, indicating a high degree of integration feasibility. This alignment has been instrumental in enhancing operational workflows without the disruptions typically associated with manual intervention processes. The scalability of our solution, evidenced through its contribution to a broader initiative aimed at virtualizing labor inspection and control, showcases its adaptability across diverse applications including safety protocols and packaging. This adaptability not only facilitates a transformative reallocation of skilled labor but also amplifies the solution's scalability. User-centric design principles have ensured that the solution is accessible and requires minimal training, thereby promoting a smooth transition for the workforce and fostering widespread acceptance.

The processing time of our system is well-tuned to meet the demands of the manufacturing environment, delivering real-time or near-real-time responses during testing phases. This efficiency is crucial for maintaining the flow of operations and for the immediate rectification of identified issues. From a computational resource standpoint, the model's training was conducted on a Google Colab notebook, which negates the need for costly computational infrastructure. Opting to label data and train the model in-house represents a significant cost reduction compared to alternatives such as purchasing software licenses or rule-based camera systems, which may not offer the same level of efficiency.

Moreover, the economic viability of the solution becomes increasingly apparent when applied across various stages and operations within the facility. This assessment will extend beyond mere financial metrics, encompassing both tangible and intangible benefits introduced by computer vision, such as time savings and quality improvements. In manufacturing, where cost considerations and return on investment are paramount in decision-making, this solution stands out as not just a technically viable option but also a financially sound investment. The ongoing analysis will provide further insights into the economic impact, reinforcing the feasibility of implementing computer vision within industrial settings for enhanced safety management and operational efficiency.

Conclusively, our investigation affirms the practicality and financial soundness of implementing computer vision in manufacturing, marking a significant stride towards revolutionizing industrial safety and efficiency. This endeavor not only highlights the technical viability of our approach but also sets a new benchmark for smart manufacturing solutions, steering the industry towards a future where safety and efficiency are paramount.

Conclusion

Digital transformation is set to play a pivotal role in the manufacturing industry permitting processes in manufacturing to be more efficient through automation. Recent inspection mechanisms are an example of this digital transformation trend adopting new inspection elements to replace traditional human inspection to produce AVI products. In that regard, this paper focused on the implementation of a cost effective, open-source AVI technique for fault detection and automated correction methodology. This implementation was deployed on a physical manufacturing system capable of assembly processes through multiple robotic arms. The overall model detects defects in the assembled product and determines what corrective actions need to be made. To enhance the findings of this study, future research should focus on refining the control mechanisms for autonomous corrective actions performed by robots.

While the proposed computer vision system is a controlled environment with intentionally introduced defects aimed at simulating a manufacturing environment, it is paramount to identify and address uncertainties present in the computer vision system, and the assembly line. As with most computer vision solutions, it is normal to have internal uncertainties in the model’s accuracy, and this system is no exception to this. Any external uncertainties related to lighting conditions and camera variations were considered during the data collection stage and were minimized during the image augmentation process. Additionally in the used system, the authors worked with a relatively small pool of parts with intentionally introduced defects, minimizing the non-parametric uncertainties relating to variations in defect characteristics, but in the applicability of this system in a real manufacturing environment, the variations of defects and their characteristics would be a large uncertainty. The plan to incorporate HITL to this system is designed to mitigate all these uncertainties by having an operator verify or edit low confidence predictions and over time, improving the model’s performance to minimize these uncertainties.

Future work

Acknowledging the dynamic nature of technology and the continuous evolution of machine learning methodologies, we remain open to exploring and integrating newer algorithms that may offer enhancements in efficiency, accuracy, or ease of use. The advent of models such as YOLOv8, with potential advantages in processing speed, presents an intriguing avenue for future research. However, given the recent introduction of these technologies and the considerations around open-source availability and implementation complexity, our current commitment to Mask R-CNN is well-justified. Future work may include comparative analyses with emerging techniques, assessing their viability and performance in the specific context of our industrial applications. The choice of Mask R-CNN stands as a well-founded decision based on the current technological landscape and project requirements. Given that YOLOv8 and advanced segmentation approaches were released in 2023, subsequent to our model's training, the authors are considering future implementation and comparison with YOLOv8 for the same manufacturing dataset.

In future endeavors, the authors outlined several projects aimed at enhancing the efficiency and robustness of our system. One potential approach is the integration of Human-In-The-Loop (HITL) algorithms, utilizing artificial intelligence processing techniques. On top of that, further capabilities can be added through monitoring the confidence levels of the segmentation process and creating alerts for technicians when falling below a threshold. Another plan to increase the robustness of the system is the expansion of the system's capability to identify a broader range of defects, including smaller issues such as cracks and deformations in the manufactured parts. This expansion aims to closely mimic the diverse defects that can arise from more robust manufacturing operations. To achieve a more detailed inspection, we plan to incorporate Object Detection, Image Classification, and Instance Segmentation tasks from multiple camera angles at once. This approach is essential to ensure comprehensive coverage and the accurate identification of defects. This planned enhancement aligns with our commitment to continually improve the system's performance.

Moreover, we are considering the development of a unified model moving forward. This model would require a large-scale dataset to be built for object detection, image segmentation, and other computer vision tasks. This dataset would include key point detection on rocket parts and the completed assembly from a multitude of camera angles and backgrounds, yielding a robust model capable of performing numerous different computer vision tasks at once. The goal is to streamline the training process for new models, not only for this project but also other computer vision tasks done by our research team by employing transfer learning to reduce the amount of data collection and training required. This unified model approach is anticipated to contribute to a more effective and versatile system design in the future.