Real-time hollow defect detection in tiles using on-device tiny machine learning

This study addresses the challenge of subsurface defect detection in floor tiles for quality control in residential construction. To overcome the limitations of traditional inspection methods and the complexities associated with existing artificial intelligence (AI)-based approaches, we have developed the AI diagnostic Stick (AID-Stick), a novel tool designed to advance the field of tile defect detection. This innovative tool integrates an embedded machine-learning framework, leveraging convolutional neural networks and tiny machine learning techniques. The AID-Stick utilizes spectrogram, Mel-frequency cepstral coefficient, and Mel filterbank energy for real-time, on-microcontroller unit diagnostics of auditory signals from tile tapping tests. Our methodology effectively utilizes these acoustic features in distinguishing between intact and subsurface hollow defective tiles. The study’s findings, revealing a notable validation accuracy of 97% and a real-world accuracy of 81.25%, showcase a promising improvement over traditional methods. The AID-Stick’s practicality, cost-effectiveness, and user-friendly design make it potentially beneficial for small-to-medium enterprises and economically constrained markets. Furthermore, this research opens avenues for future enhancements in embedded AI systems, with potential applications extending beyond the construction industry to other domains requiring non-destructive testing. This work not only contributes to the field of industrial quality control but also to the development of intelligent diagnostic tools, paving the way for future innovations in automated inspection technologies.


Introduction
Floor tile inspection is paramount in residential construction before finalizing any property transaction.Far from being a mere formality, this process is a crucial safeguard against the considerable expenses of rectifying substandard construction, which can amount to as much as 4% of the total contract value [1].This underscores the necessity for a comprehensive and detailed evaluation of new homes, focusing primarily on the quality of ceramic tiling.* Author to whom any correspondence should be addressed.
Tile integrity is traditionally assessed by manual tapping to ensure structural soundness and visual appeal [2].Though basic, these tests are proficient at detecting voids, which indicate potential structural flaws within the tile assembly [3,4].The acoustic reverberations collected from these tests yield a preliminary yet insightful evaluation of the tiles' integrity.
Nevertheless, the accuracy of these methods can be compromised by various factors, such as the tools' composition, the tile materials' properties, and the inspector's expertise.To address these variables, the industry has turned towards artificial intelligence (AI), which has revolutionized the approach to detecting tile defects.
Within the field of AI, deep learning has established itself as a formidable tool for classifying auditory signal data.Utilizing neural networks with multiple hidden layers, this approach endows deep learning with a substantial capacity for effective data classification.Deep learning includes several approaches, such as deep neural networks [18], recurrent neural networks (RNNs) [19], generative adversarial networks, and convolutional neural networks (CNNs) [20,21].Among these, CNNs have been increasingly recognized for their efficacy, especially in their application to acoustic diagnostics.CNNs, as specialized forms of multilayer perceptrons, excel in recognizing patterns within multi-dimensional data [22].Their effectiveness is enhanced through training, which involves fine-tuning the network's weights to optimize performance.This training process may also include employing shared weights, a strategy that improves efficiency and reduces the number of necessary parameters, thereby conserving memory.
Sound recognition within CNNs commences with data framing, followed by the application of the Fourier fast transform, which transforms the data into a frequency spectrum.These spectra, compiled over time, create a visual spectrogram for the CNN input.The CNN's convolutional layers then apply filters to extract features, which are further simplified by pooling layers [23,24].A fully connected layer transforms the multi-dimensional data into a one-dimensional (1D) format, with the softmax function bolstering classification accuracy [25].
Acknowledging the limitations of current methodologies, such as the need for multiple devices and the high costs associated with AI-based diagnostics [7,8,13], the research community has explored the potential of tiny machine learning (TinyML), a technology that combines machine learning with microcontroller units (MCUs) [33][34][35][36][37]. TinyML optimizes models for devices with limited computational resources, enabling the deployment of machine-learning models on a smaller scale.
In light of these challenges, this paper introduces a novel system architecture to detect subsurface tile defects, such as hollow spots, utilizing an embedded machine-learning framework.The proposed solution employs the AI diagnostic Stick (AID-Stick), which combines three acoustic featureextraction methods-spectrogram, MFE, and MFCC-with CNNs and TinyML techniques.This combination is engineered to offer a cost-effective, precise, portable, and realtime fault detection solution.The AID-Stick, an applicationspecific integrated device, is tailored to operationalize the machine-learning model in real-world scenarios, representing a significant advancement in construction quality assessment.

Methodology
The methodology depicted in figure 1 presents the systematic architecture for detecting subsurface anomalies, such as hollow defects in floor tiles, using an embedded machine-learning framework.This process is divided into five sequential stages for a thorough approach to the challenge.
The first stage begins with collecting and curating auditory signals from a standardized tapping test on the tiles.Human expertise is integral at this stage, offering insights that significantly contribute to the subsequent analytical process.
In transitioning to the second phase, feature engineering, we utilize the advanced capabilities of the Edge Impulse platform, tailored for edge-based machine-learning development.This phase transforms raw audio data into a structured format suitable for AI by creating spectrograms, extracting MFCCs, computing MFE, and applying the normalization techniques, ensuring the data is primed for modeling.
The third phase focuses on designing and optimizing a CNN.This involves iterative model refinement and evaluation against performance metrics, bridging theoretical design and practical application.
The refined model is embedded into a system infrastructure in the fourth step and subjected to rigorous simulation routines.The model's predictive capabilities are tested, and its robustness and reliability in an operational environment are confirmed through systematic verification protocols.
Finally, the deployment phase culminates the methodology, where the AID-Stick, an application-specific integrated device, operationalizes the model in real-world scenarios.This stage represents a significant leap towards automated diagnostics, highlighting the system's practicality in enhancing the integrity assessment of floor tiles.

Data processing
As illustrated in figure 1, the data processing segment is the second phase of our system's architecture, immediately following the initial collection of auditory signals.In this phase, feature engineering, the raw audio data are transformed into a form where salient features, particularly the MFCCs, can be extracted for subsequent AI modeling.
The process begins with a preprocessing step that amplifies the high-frequency parts of the audio signal through preemphasis, setting the stage for clearer feature extraction.The signal is then divided into overlapping frames in a process called frame blocking, which allows for a detailed analysis of the audio characteristics.To reduce spectral leakage, each frame is smoothed by passing it through a Hamming window, thus preparing it for the Fourier transform.The Fourier transform transitions the time domain into the frequency domain, where the power spectrum is computed by squaring the magnitudes of the frequency components.By the following application of the Mel filterbank, these components are mapped onto the Mel scale-a scale that closely resembles human auditory perception.Subsequently, the energy in the Mel filterbank, referred to as MFE, is extracted through a logarithmic transformation of these Mel-scaled frequencies, refining the data for the AI's interpretive processes.
MFE employs a non-linear scale in the frequency domain, referred to as the Mel scale, which is predominantly utilized to recognize non-speech sounds that are discernible by the human ear.The process of MFE feature extraction bears similarity to that of spectrograms but includes two additional stages.The initial stage involves squaring the amplitude after the acquisition of the spectrum map, and the subsequent stage entails the application of a triangular filter to the Mel scale to extract the frequency bands.The bandwidth and the quantity of features extracted are regulated by three parameters: the number of filters, the lowest frequency, and the highest frequency.The Mel filterbank constitutes a collection of triangular bandpass filters.The quantity of filters and critical bands will be comparable, and the central frequency of the bands is denoted as the Mel scale.The Mel filterbank is presumed to exhibit linearity below 1000 Hz and logarithmic growth above 1000 Hz, given that its bandwidth is approximately 100 Hz.The ensuing equations establish the relationship between the Mel frequency, Mel(f ), and the general frequency, f [38]: The MFE method utilizes the non-linear Mel scale to recognize non-speech sounds, akin to human auditory capabilities.It shares similarities with the generation of spectrograms but includes additional steps such as squaring the amplitude after obtaining the spectrum map and applying a triangular filter on the Mel scale to extract the frequency bands.The Mel filter bank, composed of a series of triangular bandpass filters, is fine-tuned by three parameters: the number of filters, the lowest frequency, and the highest frequency, which determine the bandwidth and the quantity of the extracted features.The Mel filter bank is designed to mirror the human auditory system's response, with linearity below 1000 Hz and logarithmic scale above, with a bandwidth around 100 Hz.The mathematical relationship between the Mel frequency (Mel(f )) and the standard frequency (f ) is presented in the following equations [38]: Then, the mth filter function of the Mel filter bank, Hm(k), is defined as follows: Here, m ranges from 1 to the value of the filters' total number, k bm , stands for the critical frequency band edges, and k is the kth coefficient of the discrete Fourier transform spectrum comprising K points.The critical band edges, k bm , can then be calculated using the formula: where f mel is the function converting frequency to Mel frequency, F s is the sampling rate.The factors f low and f high are the lower and upper cutoff frequencies of the filterbank, respectively.
The inverse Mel frequency function, f −1 mel , is given by: The triangular bandpass filter is applied to smooth the spectrum, reducing the influence of harmonics and highlighting the formant frequencies, which are critical for distinguishing different sounds and speech patterns.The following flowchart (figure 2) illustrates the sequential steps in the feature extraction process using the MFE method for sounds from tapping solid and hollow tiles and background noise.Combining with the MFE, the discrete cosine transform (DCT) is applied to the logarithmic MFEs to derive the MFCCs.These coefficients are a collection of features that have become a standard in speech recognition applications.The MFCC process uses the Mel scale and the DCT to capture the cepstral representation of the audio signal.Typically, the first 13 coefficients are preserved for their significant role in audio recognition tasks.The coefficients are calculated using the DCT equation [39]: where N is the number of data points in the signal segment, s(m) is the logarithmic energy passed through the triangular band filter, M is the number of triangular filters, and L is the number of MFCCs (L < M)-commonly set to 13 for audio recognition.
These coefficients, from C( 1 ) to C( 13 ), serve as the feature values for the analyzed time frame.Moreover, the figure alludes to the creation of spectrogram features, which visually depict the frequency spectrum of a sound signal over time and are obtained by logging the power spectrum.These spectrograms complement the MFCCs, providing a robust feature set for the AI to interpret the audio data.
Proceeding with the analysis, figure 3 offers a visual comparison of acoustic properties for floor tiles in different integrity states.The spectrogram for intact tiles (figure 3(a)) showcases a defined and consistent frequency pattern, reflective of uniformity and sound installation.In contrast, the spectrogram for defective tiles (figure 3(d)) exhibits a disordered frequency pattern, hinting at potential flaws such as cracks or issues with the installation process.
In observation of the MFEs over time, intact tiles (figure 3(b)) demonstrate a consistent energy distribution across the Mel-frequency bands, suggesting a reliable acoustic response attributed to the material's quality and proper installation.In contrast, the MFEs for defective tiles (figure 3(e)) show an erratic distribution, with certain bands displaying abnormal energy levels, indicative of irregularities within the tile or its placement.
Moreover, the MFCCs for intact tiles (figure 3(c)) exhibit uniform and stable coefficients over time, reflecting a consistent acoustic signature and structural integrity.In comparison, the MFCCs for defective tiles (figure 3(f)) present a coefficient variation, which can signal inconsistencies in the acoustic properties, potentially due to defects or improper installation.
The application of spectrogram, MFE, and MFCC as inputs for a CNN is validated by their alignment with the network's learning paradigms.Spectrograms provide a time-frequency representation of audio signals, enabling CNNs to identify patterns across temporal and spectral dimensions [40].This dual analysis is critical for detecting acoustic discrepancies between intact and defective floor tiles.Additionally, the visual nature of spectrograms complements the CNN's image recognition capabilities and discerns the complex patterns within the audio data [41].
Derived from the Mel scale, MFE echoes the human auditory system's response, which offers a perceptually relevant and computationally efficient representation for CNN processing.The reduction in dimensionality with MFE provides a focused input feature set for CNNs, potentially improving learning efficiency.MFCCs contribute a robust feature set, adept at capturing the essential characteristics of sound signals and resistant to noise interference [42].The cepstral representation underlying MFCCs effectively captures the formant structure of audio signals, which differentiates the subtle acoustic signatures of intact versus defective tiles.The compactness of MFCCs eases the computational load on CNNs and aids in a more effective training process, especially with large datasets.
Integrating these acoustic features with CNNs' architecture establishes the foundation for an auditory analysis system adept at detecting nuanced differences in sound.Such precision ensures floor tiles are accurately classified by their structural integrity and density.

The CNN model
This study uses CNNs, a key component of our AI framework, for feature extraction and data classification.Renowned for their effectiveness in image recognition, CNNs employ a series of convolutional and pooling layers to process input data.Through learnable filters, the convolutional layers detect local features in the data, while the pooling layers downsample and reduce data dimensionality.This dual approach enhances the model's capacity to handle variations in input data and creates translationally invariant features.Figure 4(a) illustrates the application of filters in the convolutional layers, and figure 4(b) demonstrates the downsampling function of the pooling layers.
The feature data dimensions for the spectrum, MFE, and MFCC are 6237 (63 × 99), 3960 (40 × 99), and 650 (13 × 50) points, respectively.Our CNN architecture, detailed in figure 5, includes an input layer (right), followed by two 1D convolution layers with a kernel size of three and rectified linear unit (ReLU) activation, and two pooling layers with a kernel size of three and a stride of two (middle).The data first passes through the convolution layers, where local features are extracted, and then through the pooling layers, where the data dimensionality is reduced.This process is repeated to enhance feature detection.The architecture concludes with a flattening layer at a 25% dropout rate to prevent overfitting and a softmax output layer for final classification (right).The 1D convolution layers process the data along the time axis, extracting temporal features, while the pooling layers downsample the information, creating translationally invariant features.
ReLU was chosen as the activation function for its efficiency in facilitating quick convergence during gradient descent, outperforming functions such as Sigmoid and Tanh.This efficiency stems from ReLU's operational simplicity, where it activates only for positive inputs and effectively reduces computational complexity.However, ReLU has its limitations-it does not activate for negative input values, leading to a scenario where neurons can become inactive,  hindering updates to the model's parameters.This characteristic of ReLU also restricts the ability to increase the learning rate significantly, as it may not be zero-centered, and negative inputs result in no activation.To counter these limitations and enhance the neural network's learning process, the Adam optimizer was implemented, with a learning rate set to 0.0005.The Adam optimizer dynamically adjusts the learning rate during training, effectively optimizing the network parameters and minimizing the loss function, thereby addressing the challenges posed by ReLU's inactivity with negative inputs.

Deployment of hollow defective inspection algorithm on AID-Stick for tile analysis
Figure 6 depicts an algorithmic flowchart for evaluating the structural integrity of floor tiles by detecting hollow spaces, using visual indicators for real-time status updates.The algorithm initiates with a green LED lighting up for a precise 0.5 s interval, signaling the start of the data collection and analysis phase.During this phase, acoustic data is recorded from the tile and analyzed computationally to assess the possibility of a hollow.The algorithm's decision-making is binary, hinging on a probability threshold: a value <0.75 indicates no hollow, negating the need for LED indication; conversely, a probability ⩾0.75 signals a hollow, activating a red LED to provide a clear visual alert.This red LED serves as a critical diagnostic output, confirming the detection of a defect.Following this, the system enters a 3 s latency period, a designed temporal buffer allowing for necessary actions before proceeding to the next inspection cycle.This structured approach ensures a systematic and efficient evaluation of tile integrity, integrating both auditory signal processing and visual feedback.
Algorithm 1 outlines the pseudo-code for the inspection process.The green and red LEDs and microphone are built into the AID-Stick to facilitate inspection.The system categorizes audio signals into three distinct classes-hollow tile, solid tile, and background noise.The inspection cycle, detailed from lines 1-4, is timed by the illumination of the green LED for a predetermined interval.Once the green LED extinguishes, the microphone records acoustic data for a specified duration.This audio is then analyzed to infer the probability of the sound corresponding to one of the categories.Line 7 details the 'Time_Frame', which refers to the length of audio segments processed individually for inference.If the probability of a hollow equals or exceeds 0.75, the red LED is triggered for one second to provide a clear notification.Subsequently, the recorded audio is deleted to free up memory, ensuring the inspection process can proceed promptly.

Experiment overview
A series of six experiments were conducted to assess the precision and effectiveness of audio signal processing models in detecting hollow defects within floor tiles.The first experiment evaluated the accuracy of the spectrogram, MFCC, and MFE models using validation and test sets.In the subsequent three experiments, these models' performance was assessed at threshold levels of 0.85, 0.75, and 0.65 on floor tile specimens labeled (a) to (c) in figure 7(f), each with hollow defects at different locations.
The fifth experiment compared these models to select the most suitable one for integration with the AID-Stick, a device engineered for automated tile defect detection.Finally, the sixth experiment deployed the AID-Stick, equipped with the spectrogram model at a 0.75 threshold, on polished and quartz floor tiles to evaluate its performance, thereby validating the models and the AID-Stick's practical application.

Design of the AID-Stick
The AID-Stick is a specialized tool developed for this study to enhance the practical application of our research in detecting indoor tile defects.While any MCU compatible with TinyML can be adapted for this purpose, the AID-Stick is customdesigned to improve user accessibility and facilitate empirical testing.
The design process utilized 3D drafting software to create a handle accommodating a power battery, an operation control switch, and a voltage step-up module for power regulation.The design files were subsequently converted into STL format for 3D printing, followed by assembly, as depicted in figure 8. Prominently featured on the AID-Stick is a retractable impact rod, seamlessly integrated with the handle and designed for user convenience and ease of operation.
The core of the AID-Stick is the Arduino Nano 33 BLE Sense microcontroller, selected for its compact dimensions, 45 mm × 18 mm, ideal for portable applications.It operates on the Nordic nRF52840 chipset, which is anchored by an Arm Cortex-M4 core and features an in-house integrated microphone.This microphone adeptly captures acoustic signals generated from tile tapping, eliminating the need for an external microphone.
The AID-Stick operates on an 18 650 lithium battery with a nominal discharge voltage range of 3.2 V-3.7 V.The range can escalate to ∼4.1-4.2V when the battery is fully charged, and it is possible to decrease to ∼2.7-3 V as the battery nears depletion.To align with the AID-Stick's optimal operating voltage of 3.3 V and to mitigate the battery's voltage fluctuations, a DC-DC step-up module is incorporated.This module stabilizes the output at a consistent 5.1 V, ensuring the device's operational stability and optimal LED performance.A power switch inclusion serves dual purposes: it conserves energy when the device is inactive and reduces the frequency of battery replacements, thereby enhancing the AID-Stick's sustainability.

Preparation of floor tile specimens and data collection
The preparation of floor tile specimens for model training and empirical validation is detailed in figures 7(a)-(d), starting with the construction of wooden molds depicted in figure 7(a).These molds were sized at 42.5 cm × 42.5 cm × 5.2 cm.For cement mortar specimens, the dimensions were established at 40 cm × 40 cm × 4 cm.The mortar mixture was prepared with water, fine aggregate, and cement in a ratio of 0.6:3:1, weighing 2.1 kg, 10.05 kg, and 3.5 kg, respectively.
The mortar mixture, once prepared, was thoroughly blended in a mechanical mixer to ensure uniform consistency.The blend was then introduced into the molds and compacted to ensure optimal density and structural integrity (figure 7(b)).Next, hollow brick models were embedded into the surface using a trowel to create level and smooth finishes.These models were subsequently removed after the mortar was partially set, forming the intended hollow voids within the specimens (figure 7(c)).
Next, tiles were placed on the prepared mortar bed.Each tile was properly tapped with a rubber mallet to ensure solid adhesion and prevent potential voids, as shown in figure 7(d).This step was essential to replicate real-world conditions where tiles were firmly attached to their base.After setting the tiles, the specimens underwent a curing process and were subsequently demolded.This produced a series of tile samples, some with hollow defects and others solid, ready for the training and testing phases of the study.Figure 7(e) showcases  these samples, distinguishing between defective and intact tiles.
A dataset of 2500 audio samples was compiled for this study to support the machine-learning-based detection of hollow tiles using the AID-Stick.The samples were divided into 'Intact' for solid tiles, including 1300 samples, and 'Defective' for hollow tiles, comprising 1200 samples (table 1).
For machine learning, the dataset was allocated into training and testing sets.A total of 1968 samples, with 1011 'Intact' and 957 'Defective', were designated for training.The training set, making up the majority of the dataset, is instrumental in enabling the model to learn the distinguishing features between intact and defective tiles.The remaining 532 samples, consisting of 289 'Intact' and 243 'Defective', were reserved for testing.The testing set is crucial in evaluating the model's performance, particularly its ability to accurately identify hollow tiles in new and previously unseen data.
In summary, the dataset's division into 1968 training samples and 532 testing samples reflects a well-considered approach in machine learning.This structure not only facilitates the model's learning but also ensures a thorough assessment of its ability to accurately identify hollow tiles, underscoring the study's focus on precision and practical applicability.For the spectrogram feature (figures 9(a) and (b)), both the loss and accuracy curves show a rapid initial learning phase followed by stabilization.This feature demonstrates effective generalization, evidenced by closely tracking validation metrics with training metrics.However, a minor discrepancy between training and validation accuracy suggests a slight overfitting.

Performance analysis of acoustic feature-extraction models-spectrogram, MFE, and MFCC-in tile defect detection
In the case of the MFE feature (figures 9(c) and (d)), a more gradual reduction in loss is observed, alongside a smoother but slower increase in accuracy.This indicates a gradual and steady learning curve, emphasizing consistent and reliable progress over time.
The analysis of the MFCC feature (figures 9(e) and (f)) presents the most closely convergent loss curves, signifying its efficacy in providing informative data representation for the model.Nevertheless, a slight divergence in accuracy towards the end of the training could be due to the initial overfitting stages.
Overall, the model displays a promising learning trajectory across all features.While there are no overt signs of significant overfitting, except for minor divergences, it is inferred that further tuning could refine the model's performance, possibly achieved through adjustments in the model or by applying advanced regularization techniques.
This research employed a supervised machine-learning approach for classifying floor tiles into hollow and solid categories.The effectiveness of our classification model was evaluated using a confusion matrix, as outlined in table 2. This matrix is instrumental in calculating four critical performance metrics-accuracy, precision, recall, and F1 score-which are standard measures for assessing classification models.
Accuracy indicates the model's overall ability to correctly classify the data, while precision measures the proportion of TP predictions against all positive predictions made by the model.Recall assesses the ratio of TP predictions to the total number of actual positive instances.The F1 score, the harmonic mean of precision and recall, provides a balanced measure of the two metrics.
An ideal model is characterized by high TP and TN rates, coupled with low FP (type I error) and FN (type II error) rates.Equations ( 7)-( 10) detail the formulas for these performance indicators: This study's primary goal was to determine the effectiveness of various machine-learning models, specifically the spectrogram, MFE, and MFCC models, in identifying hollow defects in floor tiles.We evaluated these models' performance on both validation and test sets.
Building upon the insights from figure 9, the accuracies of three distinct feature-extraction methods-spectrogram, MFE, and MFCC-are compared in figure 10.This comparison provides a clear perspective on each method's performance in validation and test scenarios.
The spectrogram method shows the highest accuracy for both validation and test sets, reinforcing its efficacy and reliability in feature extraction for machine-learning models.Its superior performance highlights its robustness and adaptability across varied datasets.
The MFE method, although slightly less precise than the spectrogram, still showcases commendable performance.This reflects its potential applicability in feature extraction tasks where high accuracy is essential.
Conversely, the MFCC method, despite its effectiveness, records the lowest accuracy among the three methods.This is particularly evident when examining the transition from the validation to the test set, where a noticeable decrease in accuracy is observed.This decline could hint at challenges such as overfitting or limitations in generalizing across diverse datasets, suggesting a need for cautious application in scenarios where high adaptability is crucial.
In summary, the spectrogram method particularly stands out for its adaptability and robustness, making it a preferred choice for machine-learning tasks requiring accuracy and generalizability across diverse datasets.
Continuing the deep-dive analysis initiated by figures 9 and 10, figure 11 through 13 offer an examination of the performance metrics for three distinct models-spectrogram,  The performance metrics of the spectrogram model's abilities for the three tile specimens are plotted against varying threshold levels (figure 11).The precision-recall curves (left) illustrate the trade-off between precision and recall for each specimen.Specimen (a) shows a significant peak in precision at a recall threshold of 75% before experiencing a steep decline, suggesting high initial confidence in predictions that wane with increased recall demands.For specimens (b) and (c), a more steady descent in precision is observed as recall heightens, indicative of a balanced yet less pronounced model performance.Among the three specimen (c) retains relatively high precision across a wider recall range, although it starts from a lower point than its counterparts.The accuracy-F1 score-threshold curves, as shown in figure 11(right), shed light on another dimension of the model's performance.Specimens (a) and (b) exhibit congruent increases in accuracy and F1 score with the threshold, denoting that stricter thresholds enhance TP and TN predictions.However, these trends diverge at a 0.85 threshold for specimen (b), suggesting a performance limit beyond this point.Specimen (c) presents a contrasting scenario, with a significant peak in both metrics at a 0.75 threshold and a sharp decline at 0.85.This indicates heightened sensitivity to threshold adjustments, where an optimal precision-recall balance is achieved at a moderate threshold.
The plots collectively imply that while the model is generally robust, its performance is intricately tied to the chosen threshold, which governs the balance between sensitivity (recall) and specificity (precision).The accuracy and F1 trends in specimens (a) and (b) suggest that the model is well-calibrated for these specimens, with a consistent increase in performance up to a certain threshold.However, the erratic behavior observed in specimen (c) necessitates a more nuanced approach to threshold selection, likely requiring bespoke adjustment to reconcile precision-recall equilibrium and maintain high model accuracy.
Overall, figure 11 accentuates the importance of threshold calibration in model evaluation.It elucidates that while a universal threshold may enhance model performance for some specimens, it can severely compromise the performance for others, emphasizing the need for specimen-specific threshold optimization in precision-critical applications.
Moving to figure 12, we delve into the diagnostic performance of the MFE model, providing an intricate depiction across three distinct floor-tile specimens.This figure offers a nuanced understanding of how varying threshold levels affect the model's precision, recall, accuracy, and F1 score.
For specimen (a), the precision-recall curve demonstrates a marked decline as the threshold is increased from 0.65 to 0.75.This indicates that while the model is confident in its predictions, its accuracy in correctly labeling positive instances falters with more stringent thresholds.Alongside this, the accuracy-F1 score-threshold curve for specimen (a) displays a peak and plateau in accuracy, accompanied by a consistent decline in the F1 score.This divergence suggests potential issues such as class imbalance or the sensitivity of the F1 score to certain model behaviors that are not captured by accuracy alone.
Specimen (b) presents a pattern of relatively stable but declining precision across threshold levels, implying a more consistent but less optimal trade-off between the true positive rate (TPR) and the positive predictive value.The accuracy and F1 scores for specimen (b) mirror this decline, with the F1 score decreasing more sharply.This trend implies that the model, at higher thresholds, is becoming too stringent, resulting in a higher number of FNs, which adversely affects both accuracy and the F1 score.
The most striking pattern is observed in specimen (c), where an initial decline in the precision-recall curve is followed by a dramatic rise at the highest threshold.This unusual pattern suggests a threshold-specific behavior where the model's performance significantly improves at a certain point, potentially indicating an optimal threshold for this particular specimen.However, the accompanying accuracy-F1 score-threshold curve reveals a contradiction: accuracy drops while the F1 score increases substantially at the highest threshold.This complex interplay of the model's predictive behavior suggests that the threshold increase significantly improves the precision-recall balance but at the expense of correctly identifying TNs or FPs, thus reducing overall accuracy.
These intricate patterns seen in figure 12 underscore the non-linear and highly specimen-specific relationship between the chosen threshold and the model's performance.They highlight the nuances present in the data or the model's predictive behavior that necessitate a considered approach to threshold tuning.The MFE model requires a customized threshold strategy for each specimen to optimize the precision-recall trade-off while maintaining high overall accuracy and a robust F1 score.The observations from figure 12 underscore the complexity of model performance relative to threshold adjustments and the importance of customizing threshold strategies for different specimens.
Continuing this evaluation, figure 13 focuses on the performance of the MFCC model, analyzing its behavior across three distinct floor-tile specimens (a), (b), and (c) at varied threshold levels.This figure contributes an essential layer to our understanding of the model's predictive capabilities and the impact of threshold settings on precision, recall, accuracy, and F1 score.For specimen (a), the precision-recall curve demonstrates a marked decrease in precision as recall increases, particularly noticeable between thresholds of 0.65 and 0.75.This trend indicates a reduction in the model's efficacy in accurately identifying positive instances under tighter thresholds.The corresponding accuracy-F1 score graph further illuminates this dynamic, showing a downward trend in both metrics as the threshold is raised.This divergence may indicate the model's struggle to balance sensitivity and precision at elevated threshold levels.
Specimen (b) reveals a different pattern, with a more uniform decline in precision across escalating recall thresholds.This finding is echoed in the concurrent trends of accuracy and F1 score, which peak at a mid-level threshold of 0.75 before diminishing.This behavior suggests that the model's overall performance wanes as it adopts more stringent threshold settings.
Intriguingly, specimen (c) exhibits an unconventional response, where the precision-recall curve initially decreases before surging significantly at the highest threshold.This unexpected rise in precision contrasts with a simultaneous decrease in accuracy, as observed in the accuracy-F1 score graph.Such a pattern indicates a complex interplay within the model's predictive mechanism, where an increased threshold substantially enhances precision and recall, yet potentially at the expense of overall accuracy.
These observations underscore the MFCC model's high sensitivity to threshold adjustments and distinctive performance characteristics across different specimens.While higher thresholds can enhance the precision-recall balance in some cases, they may also lead to a drop in overall accuracy.This complexity necessitates a deliberate and nuanced approach to threshold selection, tailored to the specific attributes of each specimen, to optimize model performance and achieve a harmonious balance between precision and recall.The insights from figure 13 thus highlight the need for a thoughtful application of threshold settings in practice, particularly when deploying the MFCC model in diverse and variable scenarios.
The diverse behaviors and sensitivities to threshold adjustments observed in the spectrogram, MFE, and MFCC models across different specimens, as delineated in figure 11 through 13, underscore the multifaceted nature of machine-learning model performance.This analysis brings us to a comparative evaluation of the models, considering their performance across the three specimens and the implications of their precision-recall dynamics, accuracy, F1 scores, and threshold sensitivities.

Model performance across specimens:
The spectrogram model, as explored in figure 11, consistently demonstrates higher precision across different thresholds for specimens (a) and (b) compared to the MFCC model shown in figure 13.This trend highlights the spectrogram model's robustness in feature extraction.Meanwhile, the MFE model, as depicted in figure 12, exhibits intermediate performance, positioning itself between the spectrogram and MFCC models.Notably, for specimen (b), the MFE model demonstrates optimal performance at a moderate threshold before beginning to decline.
Precision-recall trade-off: Across all three models, a tradeoff between precision and recall is evident.However, the spectrogram model maintains a more favorable balance.This is particularly apparent in its less dramatic reduction in precision with increasing recall.In contrast, the MFCC model, especially for specimen (c), displays an unusual pattern where precision increases at the highest threshold, diverging from the expected inverse relationship between these two metrics.
Accuracy and F1 score trends: The accuracy and F1 scores of the spectrogram model are relatively stable or even increase with the threshold.This indicates that higher thresholds efficiently filter out FPs without significantly affecting the TPR.Conversely, the MFE model's peak performance in accuracy and F1 scores occurs at a moderate threshold before it begins to decline.For the MFCC model, there are varied responses.Notably, in specimen (c), the F1 score increases with the threshold, which is contrary to the decline in accuracy, suggesting a complex interplay of metrics at higher thresholds.

Threshold sensitivity:
The spectrogram model exhibits less sensitivity to threshold changes, maintaining a consistent performance across thresholds for most specimens.On the other hand, the MFE and MFCC models are more responsive to threshold adjustments, with their performances peaking at different thresholds for different specimens.This indicates the necessity for specimen-specific threshold optimization.

Overall suitability:
The spectrogram model, with its high precision and stable trends in accuracy and F1 score, appears most suitable for applications demanding a high TPR.The MFE model may require precise threshold adjustments to reach peak performance, serving as a balanced option between the spectrogram and MFCC models.The MFCC model, given its particular sensitivity to threshold changes, may be best suited for scenarios where precision is paramount or the cost of FPs is high.
In summary, this study implemented a uniform CNN architecture to analyze various audio features, thereby establishing a consistent environment for direct comparisons of key performance metrics such as precision, recall, accuracy, and F1 score.It was observed that different audio feature types respond uniquely to threshold level changes, which is essential in evaluating their robustness and sensitivity under various classification thresholds.The research identified the most practical combination of features and thresholds by analyzing performance indicators like precision-recall and accuracy-F1 score-threshold curves.This choice was based on a thorough comparative analysis that considered the distinct attributes of each audio representation and their performance trade-offs at different thresholds.For example, the spectrogram model exhibited peak performance at a specific threshold, a characteristic distinct from the MFE and MFCC models.These insights are precious for hollow tile detection, suggesting that certain audio feature representations yield more effective results when optimized with suitable threshold settings.The application of a consistent CNN architecture across different audio features also highlights the adaptability of CNNs to various forms of sound data representation.
Furthermore, while each model-spectrogram, MFE, and MFCC-has unique strengths, the spectrogram model is notably robust across various specimens and thresholds.The MFE model emerges as a versatile option, though it requires precise calibration for optimal performance.In contrast, the MFCC model, sensitive to threshold adjustments, is potentially ideal for high-precision applications.Therefore, the choice of model and threshold setting should be carefully considered, considering the application's specific needs and the data's unique characteristics.

Performance comparison of AID-Stick on different tile types
Figure 14 presents a radar chart that evaluates the AID-Stick's inspection capabilities on two categories of floor tilespolished (blue) and quartz (green).The chart compares the tool's performance metrics, including precision, recall, F1 score, and accuracy.
The precision quantifies the proportion of TP identifications out of all positive classifications and shows a slightly higher value for quartz tiles.This observation suggests a marginal difference in the AID-Stick's precision between the two tile types.
For the recall metric, which evaluates the tool's ability to identify all actual positives, the radar chart indicates a slight advantage for polished tiles.This suggests a variation in the AID-Stick's effectiveness in detecting relevant instances among different tile materials.
The chart also examines the F1 score, a metric combining precision and recall, and the accuracy, reflecting the ratio of correctly identified instances.Both metrics show similar levels for polished and quartz tiles, indicating a consistent performance of the AID-Stick across these metrics for both tile types.
In summary, the radar chart demonstrates a similar performance pattern for both polished and quartz tiles across all evaluated metrics.This symmetry suggests that the AID-Stick performs uniform inspection for different tile materials.Minor differences in precision and recall are shown to have a minimal impact on the tool's overall effectiveness.
Equipment integration and simplification: All previous methodologies listed in table 3 require multiple instruments, complicating the setup and potentially introducing sources of error.In this study, we diverge from this path by incorporating the 'AID-Stick', a unifying device designed to consolidate the detection apparatus.This simplifies the experimental setup and offers a scalable solution suitable for various operational settings, from industrial environments to consumer applications.

Cost-reduction strategies:
A cost-effective model is presented in this work, contrasting with the high costs associated with methods in studies (table 3).This approach aims to lower entry barriers, particularly for small-to-medium enterprises and economically constrained markets, facilitating broader technology adoption.
Reduction of operational complexity: Our methodology significantly reduces operational complexity, shifting toward user-centric design.This reduction allows for easier integration into manufacturing workflows, potentially transforming quality assurance in the tile industry.

Enhanced portability and adaptability:
The compact design of our proposed equipment, coupled with a high degree of adaptability, distinguishes our work from previous studies, which are often restricted by the spatial and material limitations of their respective apparatus.Its portability enables in-situ diagnostics, which is particularly valuable for field inspections and quality control at installation.Innovations in embedded system inference: Our research introduces on-MCU inference, an absent feature in all referenced studies in table 3.This innovation allows for realtime defect detection, which is crucial for production line integration.
Accurate and realistic performance metrics: While studies [7,8] report 100% accuracy in laboratory conditions, our research shows a competitive accuracy of 97% for validation, 92.48% for testing, and a real-world measured accuracy of 81.25%.These figures represent a realistic performance in practical scenarios, a disclosure often omitted in academic literature.
Convenience and usability: Finally, our research achieves an optimal convenience factor, improving upon previous studies' 'Inadequate' ratings.This encompasses ease of deployment and operation, resulting in interpretability, which is vital for end-user acceptance.

Model Size between traditional CNN and TinyML-based CNN:
In the field of CNN, there exists a significant variation in model sizes, which span from relatively compact models of several hundred kilobytes to extensive architectures that exceed a gigabyte.This variation primarily stems from the differences in architectural complexity and the intended applications of these models.For instance, efficiency-optimized CNN typically occupies less than a few megabytes, indicative of streamlined designs.On the contrary, traditional, more complex CNN models are often characterized by their larger size, sometimes requiring substantial memory in hundreds of megabytes.
In stark contrast stands the domain of TinyML, which is distinguished by its focus on ultra-compact model architectures.These models, usually under a few megabytes in size, are ingeniously crafted for deployment in environments with stringent power and resource constraints, such as microcontrollers.Our study concentrates on this segment, providing a comparative analysis between quantized and unoptimized TinyMLbased CNN models.
In summary, CNN is highly effective in image recognition, extracting features from images through convolutional operations.They are contrasted with RNN, better suited for semantic recognition, analyzing sentence contexts, and extracting key phrases, commonly used in continuous speech recognition.In scenarios like analyzing brief knocking sounds, removing noise from the audio input and utilizing feature extraction methods like MFCC are essential.The processed data then resembles graphical features, making CNN the appropriate model for recognition.Moreover, CNN is ideally suited for sound classification in MCU due to their efficiency in processing time-frequency sound representations, such as spectrograms, lower memory and computational demands than RNN, and superior feature extraction compared to simple feedforward networks.Thus, CNN offers a balanced solution for MCU's unique constraints and requirements in sound classification tasks.Regarding neural network architectures for MCU, two critical characteristics of MCU are relevant: their limited processing power and memory and the need for energy efficiency, especially in battery-powered devices.CNN are preferred for sound classification due to their efficient feature extraction, especially when sound is converted into a spectrogram, their lower memory footprint compared to RNN, and their computational efficiency, which aligns with MCU's limited processing capabilities.RNN, while excellent for sequential data like speech recognition, demands higher computational power and memory, making them less suitable for MCU.Simple feedforward neural networks, lacking in sophisticated feature extraction and the ability to process temporal or spatial patterns, are also less effective for complex tasks such as sound classification.
In the context of limited MCU performance, using CNN as the model architecture for tile defect recognition can maintain decent accuracy at low efficiency hardware.However, due to hardware limitations, the CNN model has a small number of layers, and the input image size cannot be too large.This may lead to misjudgments in situations with noise or varying degrees of defects.Additionally, constrained by memory space, the model cannot recognize continuous tapping sounds but can only perform defect detection in a single-tap recognition mode.
In conclusion, our investigation represents a multidimensional advancement in tile defect detection, addressing the limitations of previous methods and proposing a practical, userfriendly industrial solution.

Performance of deep learning audio classification models in MCU
Prior study has addressed deploying deep learning models on energy-efficient MCUs with limited resources, typically under 1 MB Flash and 512 KB RAM [43].The developed model, Micro-ACDNet for real-world application, requires only 303 KB RAM for intermediate calculations, significantly below the MCU's RAM capacity.With manual optimization, this can be reduced further to under 200 KB, making Micro-ACDNet deployable on even smaller MCUs (Nordic nRF52840 SoC with 256 KB RAM and 1 MB Flash).Importantly, Micro-ACDNet needs only 500 KB of Flash memory for model storage (see table 4), highlighting its suitability for resource-constrained MCUs.
In subsequent, past study employed 8-bit post-training quantization via TensorFlow Lite Micro to reduce the ACDNet model size for MCU deployment [43].This quantization process, however, decreased the model's accuracy to 71.00%.To assess the impact of different frameworks on quantization efficacy, the study also explored PyTorch, achieving a notably higher accuracy of 81.50% with an 8-bit quantized model of equivalent size.Despite this, the actual deployment utilized TensorFlow Lite Micro, guided by implementation considerations.The study highlights the possibility of deploying an alternative ACDNet version on standard MCUs, achieving 81.5% accuracy, as detailed in table 4.This indicates the potential for improved performance with varying quantization approaches in resource-limited environments.
In our study, we found that the quantized TinyML based on CNN model is exceptionally efficient for audio defect recognition, utilizing only 16.8 KB of RAM and 34.7 KB of flash memory while maintaining a remarkably low latency of 1 ms.In comparison, the unoptimized TinyML model variant requires more resources, consuming 53.5 KB of RAM and 38.6 KB of flash memory, and exhibits an increased latency of 16 ms.These comparisons highlight the crucial role of model optimization in TinyML applications, particularly in minimizing memory usage and reducing latency, which is essential for optimal performance in resource-limited environments.Furthermore, a notable validation achieved at 97% accuracy and a real-world tiles tapping test accuracy of 81.25%, showcase a promising improvement over traditional tile defect detection method.

Future research directions in tile defect detection
This research presents several approaches in tile defect detection and outlines several key areas for advancement: Data volume expansion: Expanding the dataset used for training the recognition algorithms has the potential to enhance their precision, leading to more accurate detection capabilities.

Granular void ratio categorization:
Refining the classification of void ratios into specific categories (e.g.<33% for mild hollowness, <66% for moderate hollowness, and up to 100% for severe hollowness) could facilitate more immediate and precise identification of critical maintenance areas.
Enhanced labeling and data collection: Incorporating a broader range of samples with diverse void ratios and hollow area proportions would improve the model's adaptability and precision in diverse conditions.

Indicative lighting optimization:
The current implementation of the sounding rod signals red upon detecting a void and remains unlit for solid materials and background noise.Optimizing the current sounding rod to provide distinct lighting indications (no light for background noise, green for solid materials, and red for hollow spaces) could enhance the clarity of results for users.
Inertial measurement unit (IMU): Integration for impact precision: observations indicate that the force, angle, and speed of tapping considerably influence acoustic generation and the subsequent recognition results.Incorporating an IMU could enable more precise control over the dynamics of tapping (force, angle, speed), thus improving the accuracy of acoustic generation and recognition.

Force variation data collection:
Collecting data reflecting various levels of impact force would address inconsistencies caused by user-applied force variations during testing, increasing precision and reducing errors.

Material diversity in sample training:
Samples made of different materials would enable the system to recognize defects in various tile types.

Automated tapping mechanisms:
Exploring automated mechanisms for tapping, such as small vehicles or robots, could reduce operator-induced variability and provide more consistent results.

Conclusion and future work
This research has significantly contributed to the field of detecting subsurface hollow defects in floor tiles, focusing on enhancing the accuracy and reliability of the inspection process.The introduction of the AID-Stick marks a methodological advancement in tile inspection, utilizing a combination of spectrogram, MFE, and MFCC alongside CNNs and TinyML techniques.This innovative approach has demonstrated its effectiveness in differentiating between intact and defective tiles, thereby offering a new dimension to non-destructive testing methods in construction.
The investigation into acoustic feature-extraction modelsspectrogram, MFE, and MFCC-provided valuable insights into the acoustic characteristics of floor tiles.This study showcased the AID-Stick's adaptability across varying tile materials, particularly polished and quartz tiles, underlining its potential for widespread application in the construction industry.
While this research has laid a foundation for future explorations in AI-driven tools for construction material inspection, there remain opportunities for further enhancements.Expanding the training dataset, refining classification systems, and integrating advanced technologies could further optimize the tool's effectiveness in diverse operational environments.
In conclusion, this research introduces a novel approach to tile defect detection that can benefit various stakeholders in the construction industry.The progress made here sets the stage for ongoing research to further refine and expand the capabilities of AI-based diagnostic tools in construction material inspection.

Figure 1 .
Figure 1.System architecture for detecting hollow defects floor tiles using on-device TinyML.

Figure 2 .
Figure 2. Flowchart of feature extraction process for MFE and MFCC.

Figure 3 .
Figure 3. Acoustic analysis of floor tiles: a comparative visualization of the spectrogram (a) and (d), MFE (b) and (e), and MFCC (c) and (f) for tiles in intact versus defective conditions.

Figure 4 .
Figure 4. CNN fundamentals illustrated: (a) convolution layer, applying filters to input data to capture pertinent features within small receptive fields.(b) Pooling layer, reducing the spatial dimensions of feature maps for computational efficiency.

Figure 5 .
Figure 5. Layered structure of the CNN model: starting with data input (right), followed by feature learning through convolution and pooling layers (middle), and culminating in a classification output (left).

Figure 6 .
Figure 6.Flowchart of the tile condition inspection algorithm.

Figure 7 .
Figure 7.The six steps involved in fabricating floor tile specimens.

Figure 8 .
Figure 8. AID-Stick and its hardware components.

Figure 9
Figure 9 illustrates the training dynamics of a machinelearning model analyzed over 30 epochs, employing three different audio features-spectrogram, MFE, and MFCC.The analysis reveals distinct learning behaviors for each feature.For the spectrogram feature (figures 9(a) and (b)), both the loss and accuracy curves show a rapid initial learning phase followed by stabilization.This feature demonstrates effective generalization, evidenced by closely tracking validation metrics with training metrics.However, a minor discrepancy between training and validation accuracy suggests a slight overfitting.In the case of the MFE feature (figures 9(c) and (d)), a more gradual reduction in loss is observed, alongside a smoother but slower increase in accuracy.This indicates a gradual and steady learning curve, emphasizing consistent and reliable progress over time.The analysis of the MFCC feature (figures 9(e) and (f)) presents the most closely convergent loss curves, signifying its efficacy in providing informative data representation for the model.Nevertheless, a slight divergence in accuracy towards the end of the training could be due to the initial overfitting stages.Overall, the model displays a promising learning trajectory across all features.While there are no overt signs of significant overfitting, except for minor divergences, it is inferred that further tuning could refine the model's performance, possibly achieved through adjustments in the model or by applying advanced regularization techniques.This research employed a supervised machine-learning approach for classifying floor tiles into hollow and solid categories.The effectiveness of our classification model was evaluated using a confusion matrix, as outlined in table 2. This matrix is instrumental in calculating four critical performance

Table 2 .
Confusion matrix.MFCC-across various floor-tile specimens labeled (a), (b), and (c).These figures unravel the complex behaviors of each model under different threshold settings, thereby underscoring the significance of precision-recall trade-offs and the criticality of precise threshold calibration in machine learning.

Figure 10 .
Figure 10.Comparative accuracy results for each feature-extraction method across validation and testing sets.

Figure 11 .
Figure 11.Performance indicators of the spectrogram model for all three types of floor-tile specimens (a), (b), (c) across all three threshold levels: (left) precision-recall curves; (right) accuracy-F1 score-threshold curves.

Figure 12 .
Figure 12.Performance indicators of the MFE model for all three types of floor-tile specimens (a), (b), (c) across all three threshold levels: (left) precision-recall curves; (right) accuracy-F1 score-threshold curves.

Figure 13 .
Figure 13.Performance indicators of the MFCC model for all three types of floor-tile specimens (a), (b), (c) across all three threshold levels: (left) precision-recall curves; (right) accuracy-F1 score-threshold curves.

Table 1 .
Details of the dataset.

Table 3 .
Comparative analysis of tile defect detection studies: methodologies, materials, equipment, and performance metrics across various studies and this work.

Table 4 .
Comparison of prior studies of deep learning deployment and performance on MCU for classification.