A Survey on Fingerprinting Technologies for Smartphones Based on Embedded Transducers

Smartphones are a vital technology, they improve our social interactions, provide us a great deal of information, and bring forth the means to control various emerging technologies, like the numerous IoT devices that are controlled via smartphone apps. In this context, smartphone fingerprinting from sensor characteristics is a topic of high interest not only due to privacy implications or potential use in forensics investigations but also because of various applications in device authentication. In this work we review existing approaches for smartphone fingerprinting based on internal components, focusing mostly on camera sensors, microphones, loudspeakers, and accelerometers. Other sensors, i.e., gyroscopes and magnetometers, are also accounted, but they correspond to a smaller body of works. The output of these transducers, which convert one type of energy into another, e.g., mechanical into electrical, leaks through various channels such as mobile apps and cloud services, while there is little user awareness on the privacy risks. Needless to say, miniature physical imperfections from the manufacturing process make each such transducer unique. One of the main intentions of our study is to rank these sensors according to the accuracy they provide in identifying smartphones and to give a clear overview on the amount of research that each of these components triggered so far. We review the features which can be extracted from each type of data and the classification algorithms that have been used. Last but not least, we also point out publicly available data sets which can serve for future investigations.

also led to an increase in smartphone usage [1]. The online market and consumer data platform Statista places the number of mobile devices in 2022 at 15.96 billion, expecting 18.22 billion by 2025, out of which 20% will have 5G connectivity [2]. As expected, in this context, smartphone security and user privacy are continuously gaining importance. Last but not least, smartphones are a key technology in controlling various Internet of Things (IoT) devices that improve the quality of our life and productivity in smart homes or offices.
Nowadays, smartphones have overwhelming computational power and memory resources, they are equipped with many sensors, such as camera sensors, microphones, accelerometers, magnetometers, gyroscopes or radio frequency sensors (e.g., NFC, UWB, GPS, etc.) but also with actuators such as loudspeakers. Generally speaking, these can be referred to as transducers, i.e., devices that convert from one type of energy to another, electrical into mechanical (in the case of loudspeakers) or the reverse (in the case of microphones), etc. Each transducer has unique characteristics, caused by imperfections in the manufacturing process, which can be used for fingerprinting the mobile device. However, device fingerprinting based on the unique features of the embedded transducers is not always straightforward due to various environmental conditions such as noise, temperature, etc., which can affect the fingerprint. This makes the deployment of noninteractive device authentication mechanisms, based on such fingerprints, more challenging. Consequently, there are a lot papers addressing smartphone fingerprinting. In this survey we analyze existing works targeting each type of transducer and we outline various features of the signals that are used, the clustering methodology and the results, also pointing on the number of devices that were used and the publicly released data sets.
Brief Depiction of Smartphone Transducers: In Fig. 1, we show a disassembled Samsung Galaxy J5 which is a commonly used mid-range smartphone. We used this device to illustrate various sensors, i.e., front/back camera, microphone, and accelerometer and also the loudspeaker, which is technically an actuator that converts electrical energy into sound. As mentioned, both sensors or actuators, as devices that convert one form of energy into another can be referred to as transducers. The Samsung Galaxy J5 was also used to extract data for the specific needs of this article in order to give a more accurate depiction on the statistical properties of the fingerprints. We extracted data from its camera sensors, loudspeakers, and accelerometers and we were forced to use a Samsung Galaxy S6 for microphone data since the J5 did not have a replaceable microphone (the microphone could be replaced only with This    the smartphone mainboard). In the following sections, as a practical example, to determine the distance between fingerprints collected from identical devices (also referred to as the intradistance), we use either five identical Samsung J5s phones or, alternatively, we couple different transducers to the same device. Further, to determine the distance between fingerprints collected from different devices (also referred o as the interdistance), we use several smartphones from different manufacturers.
Distribution of Works by Topic: Generally speaking, there are two main types of fingerprints: 1) software-based fingerprints and 2) hardware-based fingerprints. In this work, we are concerned with the latter, i.e., hardware-based fingerprints. This is because they use characteristics of the transducers embedded on the circuit board that are more difficult to replace-on the one hand, making the fingerprint harder to forge, but on the other hand also creating higher privacy risks as such fingerprints can carry over between different mobile applications, use cases, and even operating system reinstalls.
There are a lot of papers published in the recent years addressing mobile device identification based on their sensors characteristics. In this work, we survey more than 130 papers. To give an accurate figure, in Table I, we list all sensor fingerprints that have been exploited so far and the number of papers covered by this survey (papers using multiple sensors are counted once for each sensor). In Fig. 2, we give an overview of the analyzed papers. Almost half of them discuss device identification based on the camera sensor, 20% of them discuss smartphone identification based on their microphone and only 5% of them discuss smartphone fingerprinting based on their loudspeaker. About 4% of the works discuss fingerprinting based on accelerometer sensors and 8% discuss device fingerprinting based on multiple sensors, i.e., accelerometers, magnetometers, and gyroscopes. Last but not least, 14% of the analyzed papers discuss device fingerprinting based on other, less commonly used sensors, e.g., magnetometers and gyroscopes, or even battery consumption, etc.
Several surveys on smartphone fingerprinting have been already published. A study published in 2015, regarding mobile phone fingerprinting, discusses the use of the network layer, i.e., IP and Internet control message protocol (ICMP) packets, as well as the application layer, i.e., browsers or mobile apps [3]. The work also mentions some countermeasures against fingerprinting. A later work, from 2017, addresses smartphone identification based on physical fingerprints [4]. The authors survey distinct techniques for fingerprinting starting with techniques based on signals emitted by smartphone components and processed by external systems, i.e., radio frequency, medium access control (MAC), display, clock differences, then they pursue techniques based on sensor identification, i.e., camera sensors, microphones, magnetometers. Finally, the authors discuss some risks and countermeasures for smartphone fingerprinting. In the same year, i.e., 2017, a study regarding fingerprinting algorithms, e.g., ratio and relational distance, K-nearest neighbor (KNN), thresholding, Gabor filters, etc., was published in [5]. A short study from 2019 analyzes research papers which are focusing  on smartphone identification based on their accelerometers, cameras, loudspeakers and wireless transmitters [6]. One year later, in 2020, another study dedicated to smartphone fingerprinting was published in [7], investigating device identification based on various fingerprints, i.e., IMEI, MAC, serial numbers, or based on internal circuits, i.e., sensors and memory defects. Several techniques used for identification, machine learning, physical unclonable function (PUFs) and sensor calibration are discussed. More recently, in 2021, a survey of device fingerprinting focusing on IoT devices was published in [8]. The authors discuss data sources, techniques for device identification, application scenarios, and data sets. In Table II, we briefly compare the previous surveys. Compared to these, our work is more focused on smartphones fingerprinting and also adds the existing data sets into discussion. We also provide a brief experimental analysis to outline the differences between the most commonly employed sensors.
Roadmap to Our Work: In Fig. 3, we provide a graphical overview of smartphone fingerprinting technologies which can be regarded as a roadmap for the current survey. Our work is organized as follows. In Section II, we discuss the operation principles for smartphone transducers, the most commonly used features and classification techniques, performance metrics, and some application scenarios. In Section III, we briefly present some concrete experimental data for cameras, microphones, loudspeakers, and accelerometers. These topics can be retrieved from the subsections on the left side of Fig. 3. Then, the upper side of Fig. 3 shows the structure of our work with respect to smartphone transducers: Section IV addresses cameras, Section V addresses microphones, Section VI addresses loudspeakers, and Section VII addresses accelerometers. Next, in Section VIII, we survey some papers which propose device identification based on the mixed use of the previous sensors, possibly with other sensors as well. In Section IX, we discuss some countermeasures and the stability of fingerprints in front of external factors. Finally, in Section X, we conclude our work.

II. BACKGROUND
In this section, we present the sensor fingerprinting procedure, starting from the operation principles of sensors, then discuss the most common techniques for feature extraction, the classification algorithms, and metrics. Last but not least, we present some application scenarios.

A. Operation Principles for Smartphone Transducers
In what follows, we briefly discuss the operation principle for the aforementioned smartphone transducers, i.e., camera sensors, microphones, loudspeakers, and accelerometers.
1) Operation Principle of Camera Sensors: There are two commonly used types of sensors: 1) charge-coupled device  Operation principle of MEMS microphone (redrawn based on https://www.digikey.be/nl/articles/how-mems-microphones-aid-sounddetection). (CCD) and 2) complementary metal-oxide-semiconductor (CMOS) sensors. CCD sensors are used for digital cameras and systems which need to acquire high-quality images. CMOS sensors are smaller and consume less power, so they are typically used in small-size devices, e.g., smartphones, laptops, IoT devices, etc. [10]. In Fig. 4, we depict the operation principle of a CMOS sensor. The light captured by the lens goes into a Bayer filter array which parses the light into three components red, green, and blue. Half of the filter elements are green because the human eye is more sensitive to green, the other two elements are for red and blue. Finally, the light is transformed into an electrical signal by the CMOS sensor.
2) Operation Principle of Microphones: Smartphones are equipped with microelectromechanical systems (MEMSs) microphones due to their low power consumption, low costs, and small dimensions. In Fig. 5, we show the components of a MEMS microphone. The microphone is enclosed in a case with a small opening that facilitates the reception of sound. Inside the case, there are two main components: 1) a transducer used to convert the acoustic signal into an electrical signal and 2) an application-specific integrated circuit (ASIC) which amplifies the signal received from the transducer and implements the analog digital converter (ADC) functionalities. The transducer is connected to the ASIC with a golden wire.
To improve the quality of the received sound, a special sealing material is used to hermetically isolate the microphone. The printed circuit board (PCB) of the phone is depicted on the back of the sealing material.
3) Operation Principle of Loudspeakers: In Fig. 6, we depict the main components of a smartphone MEMS loudspeaker. The loudspeaker is covered by a sieve which protects the diaphragm. The diaphragm is usually built from plastic (alternatively, it can be built from paper or aluminium) and allowed to move by the suspension, which is made from a flexible material and anchors it to the case (also called basket). After the diaphragm, a voice coil is present, which is fixed in the loudspeaker's case. Behind it, there is a pole and a magnet which make the voice coil vibrate, driven by the electromagnetic force, and so the diaphragm generates sound. Fig. 7, we depict the operation principle of MEMS accelerometers. The accelerometer contains a moving beam structure which has a fixed solid plane and a mass on springs. When an acceleration is applied, the mass is moving and the capacitance between the fixed plane and the moving beam changes.

B. Frequently Used Features for Device Fingerprinting
We now give a brief summary of the most common techniques for feature extraction that facilitate smartphone identification from data produced by the aforementioned transducers.  [11], [12], [13], [14], [15], [16], [17], [18], and [19]. An exhaustive list of the features would be out of scope.

2) Features Extracted From Camera-Collected Images:
a) Fixed-pattern noise (FPN) is the noise generated by the sensor which makes some pixels brighter than the average intensity. Based on the image type, there are two types of FPN: i) dark signal nonuniformity (DSNU) [20] which appears in the absence of light (dark images) and ii) photograph response nonuniformity (PRNU) [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37] which appears in conditions when light is present. PRNU is the most used technique for camera identification [38]. b) DCT is a common technique used to convert an image from the spatial domain to the frequency domain. In JPEG compression, DCT is applied on 8×8 image blocks, while for decompression the inverse DCT (IDCT) is used [39]. This transformation can be used with both DSNU and PRNU [40], [41]. c) Local binary pattern (LBP) and local phase quantization (LPQ) are another two features commonly used for processing images in the scope of camera identification [42]. LBP is a local texture pattern descriptor for images. The image is split in 3x3 blocks and the center pixel is considered the threshold for the neighbor pixels [43], [44]. LPQ is a descriptor based on the blur invariance from the Fourier phase spectrum extracted from images.

3) Features Extracted From Audio Signals:
a) The power spectrum, i.e., the frequency-amplitude pair obtained by applying the Fourier transform is the most basic method used to extract frequencies of the spectral estimates of the audio signal. Such features are commonly used for audio signals, in the scope of loudspeaker and microphone identification [45]. b) Mel-frequency Cepstral coefficients (MFCCs) are another commonly used feature for audio signals. This technique is used in several research works to extract features from human speech in the scope of microphone identification since these coefficients are frequently employed in speech recognition [46], [47], [48], [49], [50], [51], [52]. They have also been used for loudspeaker identification [9], [53], [54]. To extract the MFCC coefficients, the audio signals is split into windows and for each such window the fast fourier transform (FFT) is computed. The Mel filter is applied to the result and the logarithm of each Mel frequency is computed to which the DCT is finally applied giving the MFCCs. c) Linear frequency Cepstral coefficients (LFCCs) is a technique similar to MFCC, except that a linear filter is used instead of the Mel filter [48]. Linear predictive codes coefficients (LPCCs) and perceptual linear prediction coefficients (PLPCs) are also used for human speech analysis [46].

C. Metrics and Classification Techniques
In what follows, we give a brief summary of the most frequently used classification techniques for fingerprinting each of the previously mentioned smartphone components. Starting from some basic metrics up to deep learning, several approaches have been considered.
1) The Euclidean distance is used in [55] for loudspeaker identification. It is computed as the square root of the sum of squared differences between two samples: where a and b are the signals from two devices expressed as vectors, i.e., a i is the ith sample from signal a, and b i is the ith sample from signal b.
2) The Hamming distance defines the number of indices at which the corresponding symbols are distinct and it is given as: d(s, t) = n i=1 |s i − t i |, where s and t are signals (vectors) from two devices, s i is the ith sample from signal s and t i is the ith sample from signal t.
3) The Mahalanobis distance is the distance between a distribution and a sampling point. It is given by d = , where y is a vector, μ is the mean value, and cov is the covariance. 4) The intra and interdistances are useful in separating between devices based on established distance metrics, e.g., such as the Euclidean or Hamming distance. a) The intrachip distance is calculated as the arithmetic mean between fingerprints extracted at different times from the same chip. While this metric can be computed for any fingerprint, most commonly, it is used to evaluate PUFs, such as those based on CMOS sensor [56], [57], where the intrachip Hamming distance indicates the average number of flipped bits among the PUFs from different images. Also, the bit error rate (BER) can be calculated by the intrachip Hamming distances. The reliability can be also calculated based on intrachip Hamming distances. We define these according to [58] where R i is the correct PUF calculated from the average of all PUFs of the evaluated chip and R i,j is the PUF of the jth image, n is the number of bits, and m is the number of images. b) The interchip distance describes the uniqueness of a PUF, which is calculated as the Hamming distance between the PUFs of two distinct chips. Again, according to [58], it can be defined as where R u is the PUF of the uth chip, R v is the PUF of the vth chip, n is the number of bits, and m is the number of images. The intra and interchip distance are used in various works, e.g., [20], [59], [60], and [61]. 5) Thresholding is a known approach for image segmentation, i.e., to convert a gray-scale image into a binary one. It is also used for classification for various sensor data. In the case of smartphone sensor fingerprinting, thresholding is mostly used within the scope of camera identification, both for feature extraction but also as a stand-alone method for classification [20], [28], [36], [56], [57], [60], [62]. This approach is also used for classification when other signals are involved such as accelerometers [19] or for various device properties [63]. 6) Correlation, i.e., corr(x, y), is a function which describes a statistical relationship between two distinct variables x and y. It is computed as: corr(x, y) = [(cov(x, y))/(σ x × σ y )], where cov(x, y) is the covariance of x and y, σ x is the standard deviation of x and σ y is the standard deviation of y. The correlation is used by many works for fingerprinting smartphones, such as [21], [22], [23], [24], [25], [26], [27], [62], [64], [65], [66], [67], [68], [69], and [70]. 7) Classical Machine Learning Approaches: a) Support vector machine (SVM) is a supervised machine learning algorithm which can be used to train binary or multiclass models. SVM is a common classification algorithm and, based on the literature we surveyed, appears to be more commonly used for camera sensor identification [30], [33], [43], [44], [71], [72], [73], [74], [75], [76], [77] and microphone identification [47], [48], [51], [78], [79], [80], [81], [82], [83], [84], [85], [86], [87], [88]. Occasionally, it was also used for other transducers, e.g., accelerometers [12], [13] or loudspeakers [54]. b) KNN is another commonly used supervised classification algorithm which is employed in the literature for smartphone identification based on various components, e.g., microphones [78], [79], [83], [84], [87], loudspeakers [9], [53], accelerometers [12], [13], etc. KNN usually employs the Euclidean distance between the training samples and the test samples. c) Gaussian mixture model (GMM) is a probability function defined as a sum of Gaussian component densities. GMM is recommended to be used in speech recognition tasks. For device sensor fingerprinting, GMM was used for microphone [46], [48], [49], [52] and loudspeaker-based identification [9], [53]. It seems to be particularly useful when the underlying signal is human speech. d) Gaussian supervector (GSV) is an algorithm based on GMM which concatenates all the means of the features from each Gaussian component into a supervector [89]. GSV was used for microphone identification based on human speech [82], [85]. e) Random forest (RF) is an ensemble classifier algorithm that can employ different methods for classification, including AdaBoost learners, Bagged Trees, Subspace Discriminant, RUSBoost Trees, Subspace KNN, and GentleBoost. RF was used for accelerometer identification [12], [13], camera identification [40], [76], [90], loudspeaker identification [54], and smartphone recognition based on multiple sensors [16], [17], etc. f) Decision tree is another supervised machine learning algorithm, the data is structured as a tree in which the internal nodes store the features from the data sets. Branches contain the decision rules and leaf nodes, which are the end nodes, represent the outputs. This technique was used for smartphone identification based on magnetometer [18], gyroscope [91], multiple sensors [14], [15], [92], etc. g) Linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) are supervised machine learning algorithms based on the Gaussian distribution. LDA uses linear Gaussian distributions, i.e., it creates linear boundaries between classes and QDA uses quadratic Gaussian distributions, i.e., it creates nonlinear boundaries between classes. LDA was used for microphone identification [93], smartphone identification based on wireless charging [14], and for smartphone identification based on magnetic induction emitted by the CPU [94]. QDA was used for smartphone recognition based on accelerometer and gyroscope data [14], [15].

D. Commonly Used Performance Criteria for Classifiers
We now give an overview of common performance metrics used in the literature. The first metrics are commonly used and it makes no sense to point to specific papers that use them. In the next section, we will give some concrete results corresponding to these metrics. . Other metrics which are rarely used include the purity [113] and the adjusted rand index (ARI) [22], [113], [114].

E. Application Scenarios
There are many areas that can benefit from smartphone fingerprinting technologies, including include device authentication, various day-by-day applications, and even forensics investigations. We discuss each of them next. 1) Authentication: Device authentication and multifactor authentication based on a transducer fingerprint can minimize user interaction and reduce the vulnerabilities caused by weak security tokens, such as passwords. The unique fingerprint may act as one factor in user (or device) authentication which is specifically important for IoT applications where devices may not have a user interface or cannot be easily accessed (e.g., they are placed in an inconvenient location) while fast and secure authentication mechanisms are needed. There are various works which use the device fingerprints in the scope of authentication as we outlined next.
Generic Device Authentication: The PUFs extracted from camera sensors are proposed for authentication by using PRNU patterns [23], the DSNU, or FPN [57]. Live streaming surveillance footage is used for authentication in [61]. Microphones and loudspeakers are used in [115] for smartphone identification by exploiting the frequency response of a speaker-microphone pair belonging to two wireless IoT devices (this offers an acoustic hardware fingerprint). Audio signals with frequencies between 4 and 20 kHz, having an increment of 400 Hz, are emitted by a smartphone and recorded by another one while authentication relies on the correlation of the signals. Microphone fingerprints based on ambient sounds were also proposed for authentication [116]. Accelerometer fingerprints were proposed in a Web-based multifactor authentication scheme [19]. Some works have merged between data from multiple sensors, such as accelerometer, gyroscope, and camera for a robust smartphone authentication [92]. Also, acceleration, the magnetic field, orientation, gyroscope sensors, rotation vector, gravity, and linear acceleration are used in [16] to extract smartphone fingerprints for authentication in the context of Web applications. The hardware fingerprint of IoT sensors has been used for secret-free authentication in [117]. Authentication schemes for smartphones and IoT devices were also recently surveyed in [118].
Specific Environments for Authentication: Some works have been more specific regarding the exact area of application. One specific scenario which seems to be more interesting are the vehicular environments. In [45], smartphone fingerprinting is performed from data recorded by in-vehicle infotainment units. The smartphone emits a linear sweep between 20 Hz and 20 kHz while the infotainment unit records the sounds. Also, Anistoroaei et al. [119] proposed an in-vehicle authentication protocol between the smartphone and the infotainment unit. Specific acceleration patterns in various transportation environments have been also studied in the scope of device-to-device authentication [120].

2) Specific Applications for Sensor Data:
In what follows, we show some positive use cases of sensor data but we must emphasize that exposing this data adds privacy risks for users as well. Some works have considered activity or transportation mode recognition based on accelerometer patterns [121], [122], [123]. Besides activity recognition, the accelerometer and other sensors were used for daily life monitoring and health recommendations [124]. Driving style recognition and driver behavior classification [125], [126], [127], [128], [129] is another application from which car rental services or insurance companies may benefit. The accelerometer data has been used for road condition monitoring [130], real-time pothole detection [131], or gait recognition [132]. Data from motion sensors has been also proposed for theft detection [133]. IoT sensor fingerprints are also commonly used for detecting attacks, unauthorized firmware modifications or fault diagnosis [8]. Another application mentioned in a recent survey is sensor quality control [4].
Privacy Concerns: Smartphone fingerprints can be exploited for tracking users which is a serious privacy concern. Motion sensors, i.e., accelerometers, have been used for tracking users [11], tracking metro riders [134], and detecting activities from the metro station [135]. Other works discuss preventing privacy risks for distinct data, e.g., cameras [68] or loudspeakers [9], [55]. Smartphone operating systems are increasingly concerned with the exploitation of sensor data by apps for device fingerprinting and user tracking purposes. As a consequence, additional restrictions to accessing such (meta) data are being added.
3) Forensics Investigations: A complementary topic is forensics investigations. Microphones [86], [106] and cameras [33], [77], [99], [113] have been commonly discussed in the context of forensics investigations since they can be used for finding (suspected) criminals by recognizing their smartphones based on the sounds or images recorded in connection with the respective crime [136], [137]. Anti-forensics techniques have been discussed to falsify the source of audio signals by adding specific noise [51]. Another recently emerged topic is combating the dangerous effects of the AI. Machine learning techniques are already being employed to create deepfake audio or video recordings. These applications use deep learning to create very realistic recordings [138]. This technology can be used to manipulate the public opinion by creating fake news or for public persons defamation, which endangers national security and can be used as a tool by the organized crime. 1 To combat the dangerous effects of deepfake applications, deepfake detection algorithms are (currently) not very efficient, but source camera identification can be used to improve the results [139], [140]. By using unique fingerprints extracted from cameras or microphones, deep fakes could potentially mitigated by creating an end-to-end trust chain to the raw sensor data.

III. BRIEF COMPARATIVE ANALYSIS OF SENSOR DATA
To bring a clearer image on the quality of data retrieved from smartphone transducers, in this section we briefly present some concrete results. As an experimental basis, we compare data from five distinct and five identical smartphones.

A. Brief Experiments With Smartphone Camera Identification
We now evaluate the interdistances for five distinct devices (Samsung Galaxy S7, Samsung Galaxy A21s, Allview V1 Viper I, LG Optimus P700, and Samsung Galaxy J5) and the intradistances for five identical devices (Samsung Galaxy J5). We select only the green channel because it has more encoding power, i.e., there are two green pixels for every red and blue pixel, and filter each image using a wiener2 filter. To extract the DSNU from each image, we compute the difference between the original image and the filtered image. The noise which results is used to compute the Euclidean distances between devices. To clarify the computation, the distance between two distinct images is computed 2 where a i and b i are the DSNU coefficients extracted by the DCT transform (see [141] for details). The 4 458 240 values correspond to the number of coefficients that can be extracted from a 1920×2322 pixel matrix. 1 https://lionbridge.ai/articles/deepfakes-a-threat-to-individuals-andnational-security/

1) Distinct Smartphones:
We captured 50 dark images with each device. Since devices may have different resolutions, we consider only the top left corner from each image leading to images of equal sizes, i.e., 1920×2322. In Fig. 8, we show the results as a heatmap (left) and numeric values (right). The values form the main diagonal are clearly much lower than the rest, which means that devices can be easily identified.
2) Identical Smartphones: In the case of identical devices, we used the data set from [141] which contains 50 dark pictures captured by five identical Galaxy J5 cameras. To compute the distance for a single smartphone we split the data set into two distinct data sets, i.e., one with 25 pictures chosen randomly and another one with the rest of 25 pictures. In Fig. 9, we show the results as a heatmap (left) and numeric values (right). The values form the main diagonal are lower than the rest of the values which means again that the devices can be identified correctly with ease.

B. Brief Experiments With Microphone Identification
Using the public data set from [93], we evaluate the interdistances for five distinct devices (Samsung Galaxy S7, Samsung Galaxy A21s, Allview V1 Viper I, LG Optimus P700, and Samsung Galaxy J5) and the intradistances for five identical devices (Samsung Galaxy S6 smartphones). We use the live recordings of hazard lights to separate between distinct devices and the prerecorded vehicle's horn sound to separate between identical devices, according to the public data set from [93]. From each recorded sound we extract the power spectrum which is used to compute the mean of the Euclidean distances between devices. Each file contains 4096 samples which correspond to a frequency range between 0 and 22 050 Hz at a resolution of 5.384615 Hz (which results in 4096 sampling points). Therefore, the distance between two microphone samples is computed as: 2 where a i and b i are the power spectrum coefficients (amplitudes) for the two microphones represented as real numbers (floating points). The values of these coefficients were usually in the range of 0 to 70 db.
1) Distinct Smartphones: The data set in [93] contains 500 measurements with distinct devices of hazard lights sound for which we compute the mean of the Euclidean distances (between each pair of smartphones). To compute the distances  for a single smartphone, we split the data set into two distinct data sets each of 250 measurements selected randomly and extract the distances between the two. In Fig. 10, we show the results as a heatmap (left) and numeric values (right). The values from the main diagonal are lower than the rest of the values. While the differences are smaller than in the case of camera sensors, the microphones can still be clearly separated.
2) Identical Smartphones: For this case, the data set in [93] contains 50 measurements with identical microphones of the same Samsung Galaxy S6 which records a car honking sound generated by a Hi-Fi system. To compute the distance for identical devices, we split the data set in two random sets of 25 measurements. In Fig. 11, we depict the mean of the Euclidean distances between each pair of smartphone microphones. Again, the devices separate clearly as the values from the main diagonal are lower than the rest of the values.

C. Brief Experiments With Loudspeaker Identification
Using the public data set from [45], we compute the interdistances for five distinct smartphones and the intradistances for five identical Samsung Galaxy J5 smartphones. The data set contains a linear sweep between 20 Hz and 20 KHz played by the smartphones and recorded by an infotainment headunit. The distance between the smartphones and the head-unit was 1 m. To evaluate the interdistances for distinct devices (Samsung Galaxy S7, Samsung Galaxy A21s, Allview V1 Viper I, LG Optimus P700, and Samsung Galaxy J5), we performed five additional measurements with each smartphone in the same circumstances as in the data set from [45]. For each recorded sound we extract the power spectrum, which is used to compute the mean of the Euclidean distances between devices. Each file contains 1914 samples which correspond to a frequency range between 700 Hz and 11 kHz with a resolution of 5.384615 Hz. The distance between two samples is 2 where a i and b i are the power spectrum coefficients (amplitudes) for the two speakers represented as real numbers (floating points).
1) Distinct Smartphones: We select five measurements in a random order and compute the mean of the Euclidean distances between each pair of smartphones. To compute the distance for a single smartphone we split the data set into two equal data sets containing random samples. In Fig. 12, we  show the results as a heatmap (left) and numeric values (right). Again, the values from the main diagonal are lower than the distances between distinct devices. Compared to microphones, the distances are more variable which suggests that microphones are a better alternative for classification (still, not as good as camera sensors).
2) Identical Smartphones: The data set contains 100 measurements with identical microphones for the same Samsung Galaxy J5 smartphone. To compute the distance for the same device, we randomly split the data set in two equal subsets. In Fig. 13, we depict the mean of the Euclidean distances between each pair of smartphone loudspeakers. The distance between the smartphones A and C is lower than the values from the main diagonal, which means that the loudspeaker C was misidentified as A and vice versa. This suggests that simple inter and intradistances are not enough for separating between loudspeakers. Indeed, for a better separation between two loudspeakers, the work in [45] has used two deep neural networks: 1) a BiLSTM and 2) a CNN.

D. Brief Experiments With Accelerometer Identification
Now, we evaluate the interdistances for distinct devices (Samsung Galaxy S7, Samsung Galaxy A21s, Allview V1 Viper I, LG Optimus P700, and Samsung Galaxy J5) and intradistances for five identical devices (Samsung Galaxy J5). We collected data at a sampling rate of 10 ms in an environment with constant vibrations. The data is scaled and aligned to have the same amplitude and also time-aligned to compute the Euclidean distance. The amplitudes on each axis are squared, summed and the square root extracted to get the overall amplitude, i.e., a = a 2 X + a 2 Y + a 2 Z . The distances between devices are computed on subsets of 5000 elements. To compute the intradistance, we choose several samples, split them in four subsets of the same size, and we compute the mean of the Euclidean distances between two subsets randomly selected.
The distance is thus computed as  as clear as in case of any of the previous transducers (camera sensors, microphones, and loudspeakers).
2) Identical Smartphones: In Fig. 15, we show the results for identical smartphones as a heatmap (left) and numeric values (right). In the case of intradistances, again the values from the main diagonal are lower than the rest of the values, but the intradistances are slightly reduced. This suggests the same conclusion that accelerometer imperfections can be used to separate between devices, but likely produce a poorer separation compared to other transducers.

E. Overall Interpretation of Heatmap Data
The previously presented heatmaps with data collected from all four sensor show significant differences. We now try to briefly clarify why it is so. Smartphone camera sensors give a significantly higher amount of information compared to other sensors, i.e., microphones, loudspeakers, or accelerometers. Concretely, the resolution of the images was 1920×2322 pixels for the cameras that we used (or we cropped the image to this size in case of higher resolutions), while each pixel encodes 24 bits of information (1 byte for each color R, G, B). This leads to a matrix of 1920×2322 bytes for each color on which we compute the Euclidean distances. That is, the Euclidean distance is computed as a sum of more than four million values and unsurprisingly leads to values in the order of hundreds of thousands, as can be seen in Figs. 8 and 9. In the case of loudspeakers and microphones, the audio signal is in the range of 20 Hz-20 kHz and we extract the power spectrum from it which yields a vector of 1914 coefficients expressed as 24-bit floats. Therefore, when we compute the Euclidean distances, this is done over a vector of less than 2000 values and results in a much smaller sum compared to camera sensors, generally in the order of tens of thousands at most as can be seen in Figs. 12 and 13. For accelerometers, the sampled data is on 24 bits (8 bits for each axis) and we choose a vector of 5000 elements. However, as done in most previous works and explained previously, we normalized the data on the three axis in order to avoid orientation issues by extracting the square root from the sum of squared accelerations, which technically reduces the 24-bit data to at most 9 bits. Therefore, the Euclidean distance is even smaller, less than 100 as can be seen in Figs. 14 and 15. Clearly, in the case of all sensors, the value of the Euclidean distances will depend on the specific inputs and the previous discussion only tries to clarifies what should be expected in general.
Another observation is that the intradistances may seem unexpectedly higher in case of the identical speakers from Fig. 13, but this is easily explainable. Smartphone loudspeakers are electromechanical devices that consist of a coil and a plastic diaphragm which may be affected over time by various environmental factors. The speakers from the data set that we used come from disassembled smartphones that had several years of use in different conditions. Aging is very likely why the interdistances vary so much between otherwise identical loudspeakers. Regarding the number of measurements, in the data set from [45] that we used, in case of different smartphone models, only five measurements were made since the differences were quite obvious and the separation immediate. In the case of identical speakers, 100 measurements were needed to make the separation clearer since the results were much closer [45]. This may also contribute to the variations.
The same information about the statistical distances is also suggestive about the effectiveness of each fingerprint type. Clearly, images are the most effective for fingerprinting due the large amount of information that a sensors captures and because an image can be taken in an instant. Second to this are microphone and loudspeaker data, but this may require seconds or more of collected data. For example, in the experiments from [45], a sweep signal took about 10 s, in the experiments in [93], a car honking took about 1 s, hazard lights took about 2 s, wipers took about 3 s, etc. Accelerometers seem to be the least effective as previous works used 30 s [11] or 3 s per sample [12], etc. Regarding the efficiency of the fingerprinting process, it is worth mentioning that some scenarios may call for high efficiency. One such example is the advertisement ecosystem, where users may access the websites only for brief moments of time and a fast response is needed in order to create unique user profiles and recognize them. Aspects related to the advertisement ecosystem are mentioned in various fingerprinting works like [3], [11], [13], [142], [143], [144], and [145]. Other apps may not require a fast fingerprint extraction since they have access to sensor data for prolonged periods of time, such as various e-health, social media, or communication apps.

IV. MOBILE DEVICE IDENTIFICATION BASED ON CAMERA SENSORS
In this section, we survey works on device identification from camera sensors. In Fig. 16, we show an overview on the camera identification techniques and the amount of works that has been done through the years. Almost half of the surveyed papers use machine learning algorithms, including deep learning techniques. A large number of these works, about 17%, proposes PUFs, while 35% use other techniques, e.g., thresholding, correlation, etc. The past three years account for more than half of the publications we survey. In Table III, we compare the features, classifiers, results, number of devices,

A. PUF-Based Approaches
PRNU noise is used in [64] to build a PUF from camera sensors. The authors validate their proposed method using 320 images from nine cameras and use the correlation function as classifier. In terms of results, they obtain a FRR between 1.36 × 10 −1 and 4.41 × 10 −14 depending on the applied correction factor and JPEG compression. PRNU is also used in [23] for camera identification. The noise is removed from the images by applying a high-pass filter and then the high frequencies are used to obtain the camera fingerprints. The authors use 14 cameras, i.e., one digital single-lens reflex (DSLR) and 13 smartphones, to validate the approach and the resulting correlation for full images is between 0.0022 and 0.02. A different approach based on dust spots from images captured by DSLR cameras is proposed in [146]. Dust spots are detected using the shape properties and a Gaussian identity loss model. For the experiments, the authors use four cameras and, to cluster them, a confidence value based on occurrence, smoothness, and shift validity metrics for each dust spot is computed. The identification reaches 99.1% accuracy.
Specific PUFs for distinct technologies for CMOS sensors are proposed in the literature. A PUF for 65-nm CMOS sensors using hardware changes is proposed in [56]. A thresholding technique is used to validate the method and results are obtained at temperature fluctuations between 0 • C and 100 • C with a uniqueness of 50.12% and a reliability of 100%. Another PUF based on FPN is proposed in [57]. To validate the results, five chips of 180 nm camera sensors are used and for clustering the thresholding approach is applied. At temperature variations between 15 • C and 115 • C, the uniqueness is 49.37% and the reliability 99.80%. Zheng et al. [148] proposed an event-driven PUF for 1.8-V 180-nm CMOS sensors based on dynamic vision sensor (DVS). At temperature fluctuations between -35 • C and 115 • C the uniqueness is 49.96% and the reliability in between 96.3% and 99.2%. Another PUF for 180-nm CMOS sensors based on DVS is discussed in [61]. A reliability greater that 98% is obtained at temperature variations between -45 • C and 95 • C. An optical PUF for 65-nm CMOS sensors based on FPN is proposed in [60]. The experiments are performed on 14 CMOS sensors and to validate the method thresholding and 1-D autocorrelations are used. The authors obtain an interchip Hamming distance of 49.81% and intrachip Hamming distance of 0.251%.
A PUF for smartphone CMOS sensors based on DSNU is proposed in [20]. The image is denoised after which the DCT is applied, high-frequencies are extracted and then the IDCT is applied. Finally, the thresholding method is applied to remove bright pixels. The approach is validated on five identical sensors from two distinct smartphones and the obtained interchip Hamming distance is between 46% and 54% while the intrachip Hamming distance is lower than 10%. An PUF based on camera sensor SRAM is proposed in [59]. The average intrachip Hamming distance is 0.51% and the average interchip Hamming distance is 49.95% for 20 devices.

B. Machine Learning Approaches
A significant number of papers addressing identification with machine learning techniques are using the SVM classifier. The lens radial distortions are used in [71] as features for the SVM classifier. For three cameras the SVM classifier reaches an accuracy of 91%. Also, the multiclass SVM is used in [43], but the features are extracted based on LBP. The average accuracy reaches 98% for 18 cameras. PRNU and the wavelet transform are the features used by the SVM classifier in [30]. The average accuracy reached for 14 cameras models from 5 manufactures is 87.214%. LPQ and LBP are also used as input for the SVM classifier in [72]. For 14 camera models, the accuracy in between 98.13% and 100%. SVM with radial basis kernel is used in [73]. In the experiments, three distinct cameras are used, and the overall prediction accuracy is grater than 99%. Also, in [74], an accuracy of 99.01% is reached for eight camera models using the SVM classifier. For the green and red channels of the images, the authors extract an I-Vector using the LBP. A coupled feature representation is used as input for the SVM classifier in [75]. For 27 cameras, the identification accuracy reaches 87.6%. Weber's and LBP (WLBP) features are discussed in [44]. The features are translated in a vector which is used as input for the SVM classifier again. This method reaches 99.52% accuracy for nine cameras.
Also, deep learning algorithms are used in several research works. CNN, AlexNet and GoogleNet are used in [95] for camera identification. The images are first filtered using a high-pass filter and then deep learning algorithms are applied. For 33 cameras the accuracy is 91.9% in the case of CNN, 94.5% in the case of AlexNet, and 83.5% in the case of GoogleNet. A CNN based on features extracted using the LBP and LPQ is proposed in [42]. For ten camera models, the accuracy is between 84.1% and 99.5%. In [96], the images are split into k patches using sliding windows and the extracted features are used as input for a CNN. With this approach, the authors reach an average accuracy close to 100% for 74 cameras. CNNs were also used for source camera identification in [97]. Yang et al. [98] built a contentadaptive CNN (CA-CNN). The detection accuracy achieved is between 89.56% and 97.37% for 74 cameras. A method for source camera identification using images from Facebook is proposed in [112]. The authors propose a deep learning neural network based on an existing ResNet50 network. The network is tested with photographs from five cameras which are uploaded to Facebook and then downloaded back. The maximum classification accuracy was 96%.
A CNN is used in [99] to extract the noise of the images. For 125 cameras, the F1-score is between 0.205 and 0.444 and the average precision is between 0.144 and 0.399. Transfer learning and CNN are used in [76] for feature extraction while for camera identification, machine learning algorithms, i.e., SVM, logic regression (LR), and RF, are used. In the experiments, five cameras are classified with SVM as a final layer with 98.82% RANK-1 accuracy. With RF 97.16% RANK-1 accuracy was reached, while with LRs, 98.57% RANK-1 accuracy was reached. The RANK-5 accuracy was 100% for all the involved classifiers Ding et al. [100] used a multiscale high-pass filter (HPF) to remove the noise from the images. The authors use the multitask learning approach based on CNN and ResNet for camera clustering. This approach reaches 84.3% accuracy for 125 devices. In [77], a vector which contains features extracted using a statistical descriptor, color filter array (CFA), and CNN-derived is used as input for multiple classifiers: Weibull-calibrated SVM (WSVM), decision boundary carving (DBC), specialized SVM (SSVM), SVM with probability of inclusion (PISVM), and open-set nearest neighbors (OSNNs). The top-left corner of the images are used as input for a CNN in [101]. For 74 devices, the accuracy is between 0.943 and 0.961 for the same smartphone model and between 0.98 and 0.994 for the same brand. The accuracy unfortunately drops to 0.475 when a pool of 74 devices is used.
PRNU features and classification using CNN are discussed in [34]. In [33], a combination of PRNU and noise-print extracted by a CNN is used as feature, while for classification, the results from three classifiers are used: 1) SVM; 2) likelihood-ratio test (LRT); and 3) fishers linear discriminant (FLD). A maximum accuracy of 0.952 is reached with SVM. In [35], PRNU extracted from images is used as input for a neural network based on ResNet101 and SVM. For 28 devices, this approach reaches an accuracy of 99.58%. A neural network based on CNN, namely, EfficientNet, is discussed in [151]. For 23 000 images captured by 27 smartphones cameras, this neural network reaches a 99.1% accuracy. CNN and RemNet are used in [102]. This approach reaches a 97.59% accuracy for 18 distinct cameras. The use of the Ensemble classifier based on the demonsaicing residual features extracted from the CFA filter is discussed in [152]. The authors reach an average accuracy of 98.14% for the identification of 68 cameras. Also, in [103], the demosaicing approach for feature extraction is discussed. For clustering, a CNN is used which reaches an accuracy greater than 91% on 35 devices for WhatsApp images and 95% for YouTube scenes. Different pretrained CNNs, i.e., GoogleNet, SqueezeNet, Densenet201, and Mobilenetv2 are discussed in [37]. For 4500 images captured by 18 smartphones, the authors reach an F1-score greater than 91%. Features extracted using patchwise mean, variance scoring and K-means clustering are discussed in [111]. For classification, a Res2Net is used which reaches 92.62% accuracy for 74 cameras. A multiscale content-independent feature fusion network (MCIFFN) is discussed in [153].

C. Other Approaches
Adaptive thresholding is used in [62] for camera identification. For 74 cameras, the authors obtain an intercorrelation between 0.1 and 0.45 and intracorrelation between 0.46 and 0.7. Behare et al. [26] discussed camera identification based on PRNU using correlation. The experiments are done on 800 images from the Dresden database containing 25 distinct cameras.
Sensor pattern noise (SPN) and correction are discussed in [66] for camera identification. For clustering, the authors proposed an alternating direction method of multipliers (ADMMs) and spectral clustering. For 31 cameras, they obtain an F1-score between 0.90 and 0.97. PRNU and the locally adaptive DCT (LADCT) are used in [28] for camera identification. The authors use two data sets: their own data set with 13 cameras, for which they obtain an FNR between 5.46% and 21.27% and an FPR between 0.48% and 1.77%, and the Dresden data set with ten cameras for which they obtain an FNR between 0.93% and 14.11% and an FPR between 0.10% and 1.74%. SPN extracted from the green channel using a HPF is discussed in [147]. For five cameras, an FNR of 53% and an FPR of 10.75% were obtained. Also, in [36], SPN and PRNU are used to cluster 34 camera models. The features extracted from PRNU are used as input for a hierarchical search using MapReduce in [29]. For 1174, cameras a mean precision of 91% was obtained. The features extracted using the linear dependencies among SPN are used in [114] for camera identification using large-scale sparse subspace clustering. For 107 cameras, the precision is 0.92, recall is 0.88, the F1-score is 0.92, and ARI is 0.88. PRNU is also used in [22], [24], [25], [27], [31], and [32].
Rouhi et al. [113] used the SPN approximation for feature extraction while for classification, they use Markov clustering and a newly proposed hybrid clustering algorithm. For a data set with 35 smartphones, the precision is 0.997, the recall is 0.765, the F1-score is 0.866, the ARI is 0.863, and the purity is 0.997. A ranking index for the quality of each fingerprint is used in [149] to cluster cameras. For 10 960 images captured by 53 cameras, the precision is almost 1, the recall is between 0.65 and 0.85, and the F1-score is between 0.7 and 0.9.
In [41], using DCT, the low frequencies of SPN are removed from the images and the peaks are suppressed using the spectrum equalization algorithm high-frequency (SEA-HF). For 14 594 images from 57 cameras, the TPR is 88.54%. Spatial-domain averaged (SDA) frames are used in [67]. The peak-signal-to-noise ratio (PSNR) is used in [65] for camera identification. PRNU obtained using the maximum likelihood estimator is used in [21] for feature extraction from images. For six devices, with a FAR fixed at 10 −5 , the FRR is between 9.6 * 10 −2 and 8.4 * 10 −15 . In [68], a method based on Gaussian blurring and removing the least significant bit (LSB) from images is proposed. The authors obtain a correlation lower than 0.075 for 11 787 images captured with 48 cameras.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  [154] and Gupta et al. [155] surveyed some works focused on camera source identification.

D. Data Sets for Camera Identification
The most commonly used data sets for camera identification are enumerated as follows. 11) The work in [90] uses the data sets for ear biometrics from the following works: IITD-I [162], AMI [163], WPUT [164], and AWE [165] to test a wavelet-based camera identification method.

V. SMARTPHONE IDENTIFICATION BASED ON MICROPHONES
In this section, we survey works addressing mobile device identification based on their microphones. Table IV compares the features, classifiers, results, the number of devices and whether the data sets used in these works are public. We discuss them in detail in what follows. In the results column from Table IV, we generally refer to the accuracy reported by the works. As already stated, some works did not report the accuracy of their method and in this case, we refer to other metrics as presented in the table or in the accompanying text.

A. Microphone Identification Based on Synthetic Sounds
Distinct music genres, i.e., metal, pop, techno and instrumental, as well as sine waves and white noise are used in [78]. The Fourier coefficients are extracted from the recorded sounds and distinct classifiers are applied, i.e., NB, multi class SVM, decision trees and KNN. This approach was tested with 7 microphones and the highest accuracy was 93.5%. Ambient noise generated by a fan cooler is used for microphone identification in [69]. The authors use interclass cross correlation for clustering eight commercial microphones based on 24 recordings and reach a 100% correct classification. Indoor sounds, outdoor park areas, and street noises are used in [79] for microphone identification. One class classification algorithms, i.e., Gaussian model (GM), GMM, KNN, PCA, and incremental SVM (ISVM) are used to identify five microphones. In terms of results for indoor measurements, the recall is between 0.774 and 0.859, for park, noise is between 0.7354 and 0.885, and for street, noise is between 0.206 and 0.784. This was improved, using a representative instance classification framework (RICF) proposed by the authors, to get a recall between 0.741 and 0.874. In [81], a method based on FFT features extracted from ambient noise is discussed. For 21 devices, the maximum accuracy achieved is 96.72% with the SVM classifier.
Sine waves at 1 and 2 kHz are used in [83]. For the classification of 32 smartphones, the authors use SVM, KNN, and a CNN. They test the proposed approach at distinct signalto-noise ratio (SNR) levels. The accuracy for a 20-dB SNR is 96% for the 1-kHz wave and 96.8% for the 2-kHz wave, while for 10-dB SNR, the accuracy drops at 67.27% for 1 kHz and 82.75% for 2 kHz. Also, the work in [84] uses sine waves at 1 kHz and the SVM, KNN, and CNN classifiers. For 34 smartphones, at 10-dB SNR, the accuracy reaches 80% for CNN, 40% for SVM, and 10% for KNN. In [87], in addition to the 1-kHz sine wave, a pneumatic hammer and gunshot sounds are also used. Hafeez et al. [168] generated 80 sine waves in the range of 100 Hz-8 kHz and then uses an artificial neural network with a single layer which achieves 100% accuracy for six commercial microphones.
Ambient sounds from distinct places, e.g., bus, food court, kids playing, metro, restaurant, etc., are used in [116]. The authors extract 15 features from the time and frequency domains, e.g., RMS, ZCR, low energy rate, spectral centroid, etc., and apply three binary classifiers in cascade. This approach was tested on 12 smartphones from two distinct models and the TPR reached 81% for one model and 98% for the other.

B. Microphone Identification Based on Human Speech
Three classifiers, i.e., radial basis functions neural network (RBF-NN), MLP, and SVM are used in [80] for smartphone microphone identification using the MFCC coefficients extracted from the human speech of 12 males and 12 females recorded with 21 smartphones. The highest accuracy, i.e., 97.6%, was reached with RBF-NN. The work in [46] uses GMM and the highest accuracy reached is 99.58%. The features they use are the LPCC, PLPC, and MFCC coefficients extracted from the speech of four speakers recorded with 16 microphones. Also, in [47], the MFCC coefficients extracted from human speech are used with the SVM classifier to cluster 26 smartphones. The accuracy achieved was 90%. The SVM classifier was optimized with the sequential minimal optimization (SMO) algorithm. In [48], MFCC and LFCC with GMM and SVM are used to cluster 14 smartphones. The achieved accuracy is 98.39%. In the case of 16 devices, by using the GMM and the MFCC coefficients extracted from human speech, the highest reported accuracy is 99.27% in [49]. In [82], GSV and MFCC are used to extract the features from human speech. For clustering, the SVM classifier is used and an error rate between 2.08% and 7.08% is reported for 14 devices.
Audio signal characteristics, such as mean, standard deviation, crest factor, dynamic range, and autocorrelation are used in [70] to fingerprint two identical microphones. Li et al. [50] used a neural network and Gaussian SVM for the identification of 21 smartphones based on their microphones. The features extracted with MFCC from human speech were used as input for the classifiers. The reported accuracy reaches 88.1%. A band energy descriptor is proposed in [167] as classifier. This approach reaches 96% accuracy for 170 devices which record human speech. In [104], 40 smartphones are identified with the highest achieved accuracy of 99% based on human speech using CNN. The voice from 25 speakers is used in [85]. GSV and the sparse representation-based classifier (SRC) reaches an accuracy between 78.17% and 85.58% for 4 microphones. Human speech is also used in [51], [52], [86], [105], [106], and [166].
A distinctive approach based on electrical network frequency (ENF) analysis is proposed in [88]. For seven devices, the TPR is above 60%.

C. Data Sets Used for Microphone Identification
The following data sets for microphone identification are publicly available. 1) TIMIT [169] is a speech database for voice recognition which contains 6300 sentences from 630 speakers, i.e., ten sentences from each speaker, 439 males, and the rest are females, recordings from this data set were also replayed and recorded by various works for smartphone recognition, e.g., [48], [49], and [82]. There are also several reissues of the TIMIT data set, such as TIMIT-RSD [86] which recaptured the data set with 24 smartphones. 2) MOBIPHONE [80] is a speech database which contains recordings did with 21 smartphones. For each smartphone there are 12 males and 12 females who read ten sentences. The speakers are selected from the TIMIT database. 3) T-L-PHONE [48], [166] contains speech recorded with 14 mobile phones from six brands. 4) SCUTPHONE [170] contains speech recorded with 15 distinct mobile phones from six brands. 5) Ahumanda [171] contains speech recorded by six devices from 150 males and 150 females. 6) CRC-SD [86] contains speech recorded by 24 smartphones from seven brands (6 males and 6 females). 7) KSU-DB [172] is a speech database which contains 136 speakers (68 males and 68 females) recorded with four devices in three environments. 8) Live recordings [166] containing 10-min speech from a single speaker, recorded with 14 smartphones. 9) The microphone fingerprinting data set from [93] contains 19 200 samples with 16 different and 16 identical devices that record various automotive specific sounds, e.g., car honk, tiers, wipers, hazard lights, etc.

VI. SMARTPHONE IDENTIFICATION BASED ON LOUDSPEAKERS
In this section, we survey some works which discuss mobile device identification based on their loudspeakers. In Table V we compare the features, classifiers, results, number of devices and data sets that are used for smartphone identification based on loudspeakers. Compared to camera sensors and microphones, there are far less papers addressing this topic.
Zhou et al. [55] fingerprint 50 identical smartphones based on a cosine wave between 14 and 21 kHz, with an increment step of 100Hz, emitted by each loudspeaker. The smartphones are identified using the Euclidean distance and an error rate around 1.55 * 10 −4 % is reached. Berdich et al. [45], fingerprint 28 smartphones loudspeakers, out of which 16 are identical loudspeakers placed in the same smartphone case, using a linear sweep signal between 20 Hz and 20 kHz which is recorded by an in-vehicle head unit. In this work, the roll-off characteristics of the power spectrum are used. For classification, a linear approximation as well as machine learning algorithms, i.e., KNN, RF, and SVM and deep learning algorithms, i.e., CNN and BiLSTM, are used (the later two deep neural networks are the main subject of the investigation). An accuracy between 95% and 100% is achieved for identical smartphone speakers. In this work, the authors also analyzed the influence of the volume level and the speaker orientation angle in the fingerprinting process. For four distinct smartphones, the experiments are also done at 50%, 75% and 100% volume level, and the authors observe that the fingerprints for each smartphone are clustered around the volume level, but the smartphones can still be clearly identified. The same behavior was observed in the case of experiments for distinct loudspeaker orientation, i.e., 0 • , 90 • , and 180 • .
A total of 15 features in the time and frequency domain i.e., RMS, ZCR, low energy rate, spectral centroid, spectral entropy, spectral irregularity, spectral spread, spectral skewness, spectral kurtosis, spectral rolloff, spectral brightness, spectral flatness, MFCCs, chronogram, and total centroid are used in [9] and [53]. The features were extracted from three types of sounds, i.e., instrumental, sound, and human speech. For classification, the authors use KNN and GMM classifiers. The experiments are done for both distinct and identical smartphones. In [9], for 15 identical smartphones the authors reach a 93% accuracy, while for 19 smartphones (identical and distinct), they achieve a 98.8% accuracy using the MFCC coefficients extracted from human speech. In [53], for 52 smartphones out of which at most 15 are identical, the authors achieved a 100% F1-score when they used the MFCC coefficients from each signal (instrumental, song and human speech) with the KNN classifier. When GMM on MFCC is used for instrumental sounds, the F1-score is 100%, while in the case of human speech and songs, the F1-score is 99.6%. From the 15 time and frequency-domain features used in both these papers, MFCC leads to the best results. MFCC and sketches of spectral features (SSFs) extracted from human speech are used in [54]. Machine learning algorithms, i.e., SVM and RF, as well as deep learning algorithms, i.e., CNN and BiLSTM are used to cluster 24 smartphones. The authors achieved a maximum accuracy of 99.29%.

B. Data Sets for Loudspeaker Identification
To the best of our knowledge, there is currently only a single public data set for smartphone identification based on their loudspeakers, which corresponds to the work in [45]. The data set contains linear sweep signals played by 28 smartphones (16 identical and 12 distinct) recorded by the a vehicle head unit at 1-m distance. A total of 2900 measurements are made public.

VII. SMARTPHONE IDENTIFICATION BASED ON ACCELEROMETERS
In this section, we survey several works which discuss device identification based on their accelerometer sensors. Interestingly, while there are a lot of papers which discuss device pairing based on data collected from accelerometers, only few works are focused on smartphone fingerprinting based on accelerometers. In Table VI, we compare the features, classifiers, results, and the number of devices that are used.

A. Time and Frequency-Domain Features for Accelerometer Fingerprinting
Besides smartphone identification based on their microphone (which was addressed previously), the authors from [145] also discuss smartphone identification based on accelerometer sensors. The measurements are collected when the smartphone is kept at a constant velocity or when it is in a resting position, and the first sample from each measurement is considered the smartphone fingerprint. With this approach, only 15.1% of devices were correctly identified.

B. Data Sets for Accelerometer Fingerprinting
Bojinov et al. [145] reported a public website which holds accelerometer related data, 2 however, the website was not accessible at the time of writing this article. Also, Dey et al. [11] reported another data set but the link was again not functioning at the time of this writing. 3

VIII. OTHER SENSORS AND TECHNOLOGIES FOR FINGERPRINTING
In this section, we briefly present other sensors which have been used for fingerprinting, as well as some combined approaches that used multiple sensors.

A. Other Sensors: Magnetometers and Gyroscopes
The time and frequency-domain features extracted from magnetometer sensors are used in [18] for smartphone fingerprinting. For classification, the SVM, KNN, and Bagged Tree classifiers were used. This approach reached an F1-score between 61.3% and 90.7% for nine smartphones. A more recent work [91] uses the gyroscope resonance for smartphone fingerprinting. Ten features based on resonance, e.g., resonance peak, position of resonance peak, etc., are extracted and used as input for decision trees and regression tree classifiers to cluster 20 smartphones and five gyroscope sensors. The highest accuracy reached with this approach was 96.5%.

B. Combined Approaches Based on Multiple Sensors
Rather than using single transducers, several research works discussed smartphone fingerprinting from multiple sensors. We address them separately in this section. Most of these works start from analyzing individual sensor data and then combine several sensors to improve the identification rate. In Table VII, we compare the features, classifiers, results, and the number of devices used in the literature for smartphone identification based on multiple sensors. We detail each of these works in what follows.
Amerini et al. [92] used data extracted from accelerometers, gyroscopes and cameras for smartphone identification. For accelerometers and gyroscopes they extract ten time-domain features and 11 frequency-domain features while for cameras they use the PRNU. In terms of classification, decision trees are used and 10 smartphones are clustered with an F1-score greater than 75% for combined data from accelerometer, gyroscope and camera. Data extracted from accelerometers, gyroscopes, magnetometers, and microphones are used in [17]. The authors extract for each sensor several features. In the case of accelerometers and magnetometers, they again extract ten time-domain and 11 frequency-domain features for the normalized signals, while in the case of gyroscopes, they extract the same features for each axis. For microphones, they generate sine waves between 100 and 1300 Hz and for each signal, the value of the dominant frequency is considered as a feature. The classification was done using the NB and RF machine learning algorithms and for ten devices, the authors reach an F1-score of 90% for the combined data.
Combined accelerometer and gyroscope data is also used in [13], [14], [15], [173], and [174]. Das et al. [13] and [14] used 25 time and frequency-domain features, while in [15], they use 26 features. Several machine learning algorithms are used in these works which include SVM, NB, KNN, decision tree, QDA, and bagged decision trees. In [173], the entropy features are extracted from the collected data and used as input for the SVM classifier. For three devices, the authors reach an accuracy greater than 90%. A multidimensional balls-into-bins model is proposed in [174] to extract the features from the collected data and then a multi-LSTM network is used to cluster the devices. For 117 devices from 77 users, this approach reaches an accuracy higher than 98.8%. Acceleration, magnetic field, orientation, gyroscope, rotation vector, gravity, and linear acceleration are used in [16]. Five sensor combinations are discussed: 1) individual accelerometers; 2) accelerometers and gyroscopes; 3) all Zhang et al. [144] and [175] proposed a new method, called factory calibration fingerprinting, that is able to bypass existing protections for tracking users based on motion sensor data. They extract data from gyroscopes and magnetometers in [144] and accelerometers, gyroscopes and magnetometers in [175]. Their work involves distinct Android and iOS devices. The fingerprint is generated based on a gain matrix (squared Euclidean 2-norm function) of the data processed by computing the difference between two consecutive axes and the estimated value of the ADC.

C. Other Technologies for Device Fingerprinting
Now, we enumerate additional device fingerprinting technologies, some of which are based on other components while others are based on software (which are not part of the main scope of this work, therefore, the list is not exhaustive). In Table VIII, we compare the features, classifiers, results, and the number of devices used in the literature for smartphone identification using these different approaches.
Chen et al. [182] proposed a technique based on battery power consumption. Distinct tasks are running on the smartphones having different power consumption rates, e.g., heavy file writing and reading, computations with large numbers, broadcast transfer, etc. Time and frequency-domain features are extracted for the recorded power consumption and an unsupervised learning algorithm is applied to cluster the smartphones. The accuracy in identifying the phone was higher than 86% for 15 smartphones. Mobile devices are identified based on wireless charging fingerprints by [183]. The clock oscillator and the power receiver are used to extract the features which are then used in the SVM, AdaBoost, decision tree, KNN, and LDA classifiers. This approach reaches 97.9% accuracy for 52 devices.
Another interesting approach for device fingerprinting based on magnetic induction signals radiated by the CPU is discussed in [94]. The authors measure the CPU magnetic induction when the CPU load is at 100% as the inductor from the DC/DC converter of the CPU may produce high magnetic induction at high currents. They use for the experiments 90 devices (20 smartphones and 70 laptops) and to validate this approach 10 machine learning algorithms are used, i.e., LR, NB, KNN, LDA, QDA, decision tree, SVM, ExtraTrees, RF, and gradient boosting. The authors report a maximum accuracy of 99.9%. The peripheral input timestamps are used in [107] for device identification. Dhakal et al. [185] and Palin et al. [186] used two public data sets, the peripherals include keyboard, mouse connected via USB and collection was done automatically on a Web based platform which evaluate the typing skill. For classification, the FPNET CNN is used and a maximum accuracy of 97.36% was achieved for 76 768 mobile devices and 151 483 desktop devices. Capacitive screen fingerprints are used in [178] for smartphone recognition. RMS and MFCC features are computed from the signature segmentation extracted from the voltage consumption. For classification, the authors use the KNN and GMM classifiers and reach an F1-score of 100% for 14 smartphones.
ICMP timestamp requests from which the device clock skew is extracted are proposed in [176] for smartphone fingerprinting. 10 min of collected ICMP timestamps are sufficient to distinguish between five smartphones as their oscillator skews differ in several parts-per-million (ppm). The slope of the clock skews is computed as a linear programming minimization problem. The network traffic from popular apps, e.g., Facebook, WhatsApp, Skype, Dropbox, etc., is used in [177]. Distinct features, e.g., packet size, packet ratio, number outgoing packets, byte ratio, etc., are extracted. For classification KNN and SVM are used on 14 devices with an F1-score of 100%. Khodzhaev et al. [180] discussed an approach based on the performance of the transmission control protocol (TCP). For classification, KNN is used and for 3 distinct devices this method reaches only 75% accuracy.
The device configuration and parameters are used for smartphone fingerprinting in [142]. The authors discuss 29 features of the Apple iOS platform, e.g., device name, language settings, installed applications, played songs, etc., and extract them from 8,000 distinct devices. The SVM classifier reaches an accuracy of 97% for this approach. In [179], 38 features are used: 1) hardware related, e.g., name, device model, manufacturer, storage capacity, etc.; 2) OS related, e.g., kernel information, Android version, etc.; and 3) user-setting related, e.g., time-zone, hour format, data format, ringtone, notification, etc. A fingerprint matching algorithm (FMA) and a fingerprint association algorithm (FAA) are used to select the relevant features and then the NB classifier is applied to cluster the devices. For 2239 devices, they reach an F1-score of 99.46%. Similar features are also used in [63], but here a thresholding method is used for clustering and an accuracy of 99.97% is reached for 815 devices.
In [184], a method for smartphone fingerprinting based on the radio frequency emitted by Bluetooth is discussed. The authors achieved a test accuracy between 96.9% and 99.2% using SVM and between 96.5% and 99.6% using a neural network classifier for 27 smartphones. Device identification based on remote GPU fingerprinting is proposed in [187]. The authors use 26 smartphones and 62 desktop/laptops and obtain a maximum accuracy of 95.8%. Vastel et al. [181] showed that it is possible to detect countermeasures for browser fingerprinting by using the inconsistencies that these countermeasures introduce and, besides spotting the altered fingerprints, the original fingerprint values can be also obtained.

IX. COUNTERMEASURES AND STABILITY IN FRONT OF EXTERNAL FACTORS
In this section, we discuss countermeasures for fingerprinting and the resilience of fingerprints in front of external factors that can change them over time.

A. Countermeasures
Smartphone fingerprints can be also used by malicious apps to infringe on user's privacy. This is a very serious concern and we cannot end our survey without mentioning it along with some countermeasures. Briefly, to combat these attacks, several countermeasures can be implemented: 1) adding noise to the sampled data (which is also commonly referred to as obfuscation); 2) calibrating the sensors so that differences become negligible; 3) restricting the access to sensors' data; or 4) lowering the sampling fidelity. These approaches can be also combined. We discuss them in what follows.
Adding Noise (Obfuscation): A simple method to modify the smartphone fingerprints is to add noise. This approach does not affect the smartphone functionally [4] and it is not expensive in computations and power consumption. The addition of noise has been also discussed in [93] within scope of microphone identification. This work considers various types of sounds e.g., traffic, train, barrier, etc., and reports that the accuracy drops below 50% at a SNR below a specific threshold, e.g., -40 db for car horn, -20 db for car tiers, so that microphone identification no longer works. Also, Baldini and Amerini [83] analyzed the influence of additive white Gaussian noise (AWGN) at distinct SNR levels and the accuracy drops below 50% at a SNR of 0-5 db. The work in [45] also shows that in the case of loudspeaker identification, the volume can influence the fingerprints.
Sensor Calibration: Calibration is generally used to increase the precision of measurements performed by various sensors, but it was also proposed as a countermeasure against sensor fingerprinting. More commonly, it is proposed for accelerometers and gyroscopes. For example, the calibration of accelerometers and gyroscopes is discussed in [15] as a countermeasures against sensor fingerprinting. Notably, some works have managed to fingerprint accelerometers and gyroscopes even if factory calibrations were performed [144], [145], [175]. To prevent this and make fingerprinting infeasible, the last two of these works propose that one can round the factory calibrated sensor output to the nearest multiple of the nominal gain [144], [175].
Restricted Access to Device Peripherals and Data: Implementing policies that control the access rights of other applications on sensor data is another countermeasure proposed in [3] and also discussed in [4]. It may be also worth recalling here that malicious apps with access to the microphone can allow the interception of the phone's PIN code [188]. This proves how serious are the implications of giving access to such peripherals. Notably, smartphones also leverage the use of various IoT devices that surround our home, exposing even more data about owners. Having this in mind, the work in [189] discusses a mobile-cloud framework with fine-grained permission authorization for IoT. A privacy risk assessment for mobile applications, which considers permissions and information flow leakage, is presented in [190].
Lowering Sampling Fidelity: Lowering the sampling rate can also be a countermeasure and it may also increases the battery life (especially in the case of data collected from motion sensors). Data filtering and reducing the sampling rate can hide part of features such that the fingerprinting process will no longer be possible. The Android platform is already considering risks related to fingerprinting by sensor sampling and started to limit the access for applications since Android 12 (API level 31). For a sample rate higher than 200 Hz (or about 50 Hz for direct, raw sensor data), apps need to be granted a new permission called HIGH_SAMPLING_RATE_SENSORS. Note that this is declared as a normal level permission and therefore granted automatically, but can be used for determining apps that potentially access higher sample rates [191]. As a further mitigation, motion sensors (including accelerometer) are always rate limited even for apps holding this permission if the microphone has been turned off by the user. Finally, Android 10 introduced an UI element in the form of the Sensors Off quick tile that can be used to disable app access to all sensors, including microphone, camera and motion sensors (with the exception of phone calls still using the microphone). However, this UI element needs to be enabled through developer options and is therefore not targeting end-users at the time of this writing [192]. On Apple iOS, apps seem to be able to use Core Motion to request sample rates as far as the hardware supports it [193]. Apple recommends as best practice to avoid using accelerometers or gyroscopes outside of active gameplay [194]. To the best of our knowledge, there seem to be no automatic limitations at the time of this writing.
It is also true that these countermeasures are not always applicable, or it is highly inconvenient to use them. For example, sometimes sampling restrictions cannot be applied, as in the case of gaming applications that require the maximum sampling rate from accelerometers for better accuracy. Reducing the sampling of accelerometers also has impact on physical activity monitoring apps [195]. Regarding camera sensors, photograph editing software may require access to the raw image data (that may contain even more phonerelated artifacts) for optimal performance. As expected, all countermeasures come at a price.
One important factor that seems to be omitted by most works is the stability of the samples over time. To the best of our knowledge, only the excellent work from [15] evaluates the stability of the samples by collecting data at one month distance. Concretely, accelerometer and gyroscope data is collected at an interval of 37 days and the F-score, which was 100% for data collected during the same day, drops between 88% and 92% for different days. Further evaluations may be needed to asses if samples are stable in the long run. Zhang et al. [144] also relied on the sensor factory calibration file, which is stored in the nonvolatile memory and should not change over time. Other works assess the stability of hardware fingerprints in the case of different electronic components. For example, the magnetic signals from the CPU are used in [94] and the authors prove that they do not change over the course of two days and in distinct locations. Fingerprinting the GPU from JavaScript collected data is proposed in [187] and the fingerprints are shown to be stable during 24 days of experimentation. The stability of clock-based fingerprinting is also discussed in [196] where measurements are performed two months apart.

X. CONCLUSION AND FUTURE DIRECTIONS
There is a very largenumber of works that address smartphone identification based on the physical fingerprints of their embedded transducers, mainly cameras, microphones, loudspeakers, and accelerometers. The most consistent body of works which we surveyed was concerned with camera fingerprints. This is somewhat natural as users nowadays commonly upload photographs on various websites, making them very easy to collect. Also, a lot of samples and features can be extracted from images and there are several public data sets dedicated for research works. A lesser number of works used microphones and there are only a few works which are using loudspeakers. Device fingerprinting based on audio signals, from microphones and loudspeakers, may have attracted less research because, although this kind of data is easy to analyze, it may be more difficult to collect. For microphones, there are several public data sets (the majority of them are targeting speech recognition and crime related investigations) which were also used for device identification based on their microphones while for loudspeaker identification a single public data set is available. In the case of accelerometers, the number of works strictly dedicated to fingerprinting is also somewhat limited, despite the fact that accelerometers were so commonly employed for device-to-device authentication. There are also only isolated attempts in using gyroscopes and magnetometers for fingerprinting. Regarding accuracy, it seems that camera sensors provide the best fingerprint, many of the works from Table III in our survey reporting an accuracy close to 100%. This happens because CMOS sensors collect high amounts of information due to the over-increasing resolution of modern cameras. Next to camera sensors, microphones and loudspeakers may be a reliable source, with a reported accuracy generally between 90%-100% according to Tables IV and V from our survey. Accelerometers seem to have a lower accuracy for fingerprinting, which according to Table VI in our survey is between 58.7% and 95%.
As future research directions, there are several gaps that need to be covered. As outlined previously, there is only a very limited number of works that have addressed sample stability over time and this happened only over a small period of one month [15]. The use of multiple sensors can be also considered for improving the reliability of the fingerprinting process over time, since various sensors may be unevenly affected by wear and tear. Running the experiments over extended time periods and using a larger number of devices in the field may be considered by OEMs or large app developers with a considerable install base (but it is generally out of reach for nonprofit academic research). Last but not least, incremental learning, a well-known method of machine learning which requires to continuously update the existing model as new data becomes available, may be one way to address this problem by ensuring an up-to-date trained model for the device. Also, almost all of the existing works have dealt with closed-world models in which only devices coming from a limited set are to be identified. There are only a few works [12], [197] which address open-world scenarios, that are more relevant for practice since the methodology is also tested against devices that were not part of the training data set. Related to this, the use of one-class classification, which requires a single device in the training data set and later separates it from the rest in the testing data set, is of significant interest. Most of the papers so far tried to separate between multiple devices that were already learned, while only a few works explicitly used one-class classifiers [12], [77], [79], [133]. The selection of specific inputs that give a more accurate classification for the transducers is also one possible area of investigation. It is well known that certain inputs can yield a better response in the case of PUFs, e.g., the RowHammer PUF [198]. As previously stated, in the case of CMOS sensors, dark images seem to give a better response [141], while for loudspeakers, a sweep signal offers a more complete characterization [45]. Other works have considered those inputs which are more realistic for practice, such as human speech in the case of microphones, or music in the case of loudspeakers. Finding specific inputs for which the transducer gives the most specific response is one possible area for future investigations.
There is also a significant number of works that use other technologies instead of transducers, such as software fingerprinting, ICMP timestamp, OS, TCP, battery consumption, wireless charging, capacitive touchscreens, CPU magnetic field, and the input from various peripherals. These works were only briefly accounted here and do not form the main target of our survey. We may consider an in-depth analysis of them as future work.