A User Location Reset Method Through Object Recognition in Indoor Navigation System Using Unity and a Smartphone (INSUS)

: To enhance user experiences of reaching destinations in large, complex buildings, we have developed a indoor navigation system using Unity and a smartphone called INSUS . It can reset the user location using a quick response (QR) code to reduce the loss of direction of the user during navigation. However, this approach needs a number of QR code sheets to be prepared in the field, causing extra loads at implementation. In this paper, we propose another reset method to reduce loads by recognizing information of naturally installed signs in the field using object detection and Optical Character Recognition (OCR) technologies. A lot of signs exist in a building, containing texts such as room numbers, room names, and floor numbers. In the proposal, the Sign Image is taken with a smartphone, the sign is detected by YOLOv8 , the text inside the sign is recognized by PaddleOCR , and it is compared with each record in the Room Database using Levenshtein distance . For evaluations, we applied the proposal in two buildings in Okayama University, Japan. The results show that YOLOv8 achieved mAP@0.5 0.995 and mAP@0.5:0.95 0.978, and PaddleOCR could extract text in the sign image accurately with an averaged CER% lower than 10%. The combination of both YOLOv8 and PaddleOCR decreases the execution time by 6.71 s compared to the previous method. The results confirmed the effectiveness of the proposal.


Introduction
Nowadays, indoor navigation is increasing in importance around the world due to the growth of large, complex buildings such as shopping malls, campuses, airports, and libraries [1].These buildings often lack proper signs and staff assistance to guide new guests to their destinations [2].In such situations, visitors often need to ambulate for a while to find destinations.Then, an indoor navigation system using a smartphone is useful for providing an efficient and user-friendly solution [3].It allows a guest to reach the destination in the building without confusion and time-wasting.
Previously, we developed an indoor navigation system called INSUS using Unity and a smartphone to guide users in unfamiliar indoor environments [4].Unity is a game engine development environment that allows the integration of a phone's sensors and an augmented reality (AR) software development kit (SDK) to build an AR application [5].Besides its function as an integration tool, Unity supports the development of multi-platform applications, enabling the same system to be deployed on Android and iOS [6][7][8].INSUS employs AR technology to overlay directional navigation guides on the smartphone screen, providing intuitive navigational assistance.It utilizes the smartphone's gyroscope sensor and the simultaneous localization and mapping (SLAM) algorithm to accurately detect and map the user's location.
To reduce loss of direction during navigation, INSUS employs the user location reset method using a quick response (QR) code.First, a user needs to scan a QR code that encodes the location information to initialize his/her initial position.Then, SLAM calculates the user's location by measuring the distance from feature points captured by the smartphone's camera across frames.Finally, the gyroscope sensor is used to estimate the user's orientation to determine in which direction the user is facing.Unfortunately, INSUS often faces the loss of direction problem during navigation.The loss of direction refers to localization errors during navigation due to small accumulated errors from the SLAM localization components.Since the localization experiences errors, the generated path from the user's current position to the target position becomes inaccurate.This results in false or absent path guidance visualization.To mitigate it, a user needs to periodically scan the QR code to reset the current position.Therefore, a number of QR code sheets need to be placed around the building, which will add extra load to the implementation.
In this paper, as an alternative user location reset method, we propose the use of natural signs that have been allocated in buildings.These signs may indicate room numbers, room names, or floor numbers which can uniquely identify the location.The proposed method uses the object detection technique to extract the sign from the captured image and the Optical Character Recognition (OCR) technique to recognize the text from the sign [9,10].Specifically, YOLOv8 is adopted for object detection [11], and PaddleOCR is adopted for OCR [12].The recognized text is compared with the registered text in the database using the Levenshtein distance to obtain the location information.Since the text recognized by OCR may contain errors, including wrong or missing characters, the Levenshtein distance between this text and each registered text in the database is calculated, and the one giving the shortest distance is selected as the matched one.
The utilization of these algorithms is based on Python, while the development of INSUS is based on Unity.We integrate these two components with an application programming interface (API), which allows two different programming languages to work together using a standardized format (e.g., JSON).We employ a representational state transfer (REST) API to communicate between Unity and Python, as exemplified by several studies [13][14][15].Python is responsible for the machine learning operations of YOLOv8, PaddleOCR, and the Levenshtein distance process.It receives and processes the input from Unity and returns the output back to Unity.We utilize the FastAPI library in Python to handle these interactions [16].On the other hand, Unity is responsible for providing the input in the form of images and processing the output to update the user's current coordinates.Unity handles this interaction using the Unity Web Request module [17,18].
For evaluations of the proposal, first, the accuracy of sign detection by YOLOv8 is evaluated.Images containing signs in the building are collected as a dataset to generate the model.Then, the precision, recall, and mean average precision (mAP) values over different intersections over union (IoU) thresholds ranging from 0.5 to 0.95 are calculated [19].Second, the accuracy of the text recognition by PaddleOCR is measured using the character error rate (CER) metric [20].Finally, the performance is compared between the proposal and the previous method in #2 and #3 Engineering Buildings at Okayama University.The results confirm the effectiveness of the proposal in reducing the execution time required for the user location reset method.
Our contributions are as follows: we trained a YOLOv8 model with our custom dataset to detect room numbers from sign images.We confirmed the validity of PaddleOCR to extract text with an average CER lower than 10% from small images.We also incorporated the Levenshtein distance to mitigate slight character recognition errors and achieve accurate database matching to acquire the correct room coordinates.
The rest of this paper is organized as follows.Section 2 discusses related works.Section 3 presents previous work.Section 4 presents the user location reset method using object recognition.Section 5 evaluates the proposed system through experiments.Finally, Section 6 concludes this paper with future works.

Literature Review
In this section, we discuss works related to the proposed method.

YOLO Model-Based Detection Methods
First, we reviewed works on implementations of the You Only Look Once (YOLO) model in indoor navigation systems.It integrates a camera to detect specific objects in the environment.YOLO determines the user's position or minimizes navigation errors based on the detected objects according to the defined parameters of the indoor navigation system.
In [21], Ahmad et al. presented a modified YOLOv1-based neural network for object detection to improve detection accuracy and speed.The proposed model enhances the original YOLOv1 by adjusting the loss function to a proportion style, adding a spatial pyramid pooling layer, and incorporating an inception model with a 1 × 1 convolution kernel to reduce weight parameters.Extensive experiments on the Pascal VOC 2007/2012 datasets demonstrated that the modified model achieved better performance than the original YOLOv1.
In [22], Sang et al. developed a fine-tuned model of YOLOv2 to extract vehicle-type information from images or videos.The model incorporates k-means++ clustering for vehicle bounding boxes, normalization for bounding box dimensions, and a multi-layer feature fusion strategy.The fine-tuned model achieved a mAP of 94.78% on the BIT-Vehicle validation dataset and demonstrated the ability to predict new datasets without encountering them in the training process.However, due to the low variety and amount of vehicle-type data, incorporating an additional dataset could further improve the model's accuracy and robustness.
In [23], Zhao proposed a new method for initializing the width and height of predicted bounding boxes to improve the YOLOv3 object detection model.The method uses Markov chains and intersection-over-union for faster convergence and more accurate initial cluster centers.On the MS COCO dataset, it achieves an average IoU of 60.44% (0.56% higher) and a running time 1/297 of the original, while on the PASCAL VOC dataset, it achieves an average IoU of 67.45% (0.13% higher) and a running time 1/81 of the original.The proposed method outperforms YOLOv3 in terms of recall, mAP, F1-score, and the detection of small objects.
In [24], Wang et al. proposed a YPD-SLAM system that provides a real-time VSLAM in dynamic indoor environments.It integrates YOLO-FastestV2 target detection and Cylinder and Plane extraction (CAPE).YOLO-FastestV2 detects and isolates the target to enhance the feature point extraction.The deployed proposed system only utilizes CPU for costeffective hardware implementation.It achieved robust, accurate, and real-time performance.However, tracking failures were detected when the user rotated too quickly.
In [25], Cong et al. proposed a Visual Simultaneous Localization and Mapping (VSLAM) algorithm that is based on ORB-SLAM3 for dynamic scenarios.The proposed algorithm is integrated with the YOLOv5 object detection model.It utilizes the depth information to categorize detected objects and eliminate dynamic feature points.The results indicate an improvement in accuracy compared to the original ORB-SLAM3.This highlights the effectiveness of the algorithm in dynamic scenarios.However, some feature points are removed in stationary scenarios, and missed object class detection is observed.
In [26], Gupta et al. proposed a transfer-learning-based model for real-time object detection, enhancing the YOLOv6 algorithm through pruning and finetuning to improve detection accuracy and inference speed.The model, which utilizes Google Text-to-Speech for audio feedback, was trained on the MS-COCO dataset and demonstrates significant improvements over various baseline models, achieving a 37.8% higher average precision with 1235 frames per second.Despite its improved performance, the model struggles with detecting objects against textured backgrounds.
In [27], Kucukayan proposed Indoor Human Detection (IHD) for drones using the YOLO-IHD model.The proposed systems incorporate YOLOv7-tiny to detect small objects in the images.The results demonstrate an increase in performance, achieving a 42.51% improvement on the IHD dataset and a 33.05% improvement on the VisDrone dataset compared to the baseline model.However, low-light conditions and crowded areas might reduce the accuracy of the detection.
In [28], Lou et al. proposed a small-size object detection algorithm based on YOLOv8 designed to address the limitations of human observation in complex scenes.It ensures higher precision and consistent accuracy across various object sizes.The algorithm introduces three main innovations: a new downsampling method that preserves context features, an improved feature fusion network, and a novel network structure to enhance detection accuracy.Experimental results demonstrate that the proposed DC-YOLOv8 outperforms YOLOX, YOLOR, YOLOv3, scaled YOLOv5, YOLOv7-Tiny, and YOLOv8, with notable improvements in mAP, precision, and recall ratios on the Visdron, Tinyperson, and PASCAL VOC2007 datasets.
The performance of YOLO models has consistently improved with each new version.Incremental advancements in the YOLO architecture have significantly enhanced detection accuracy, speed, and efficiency.This is particularly emphasized by YOLOv8's superior ability to detect smaller-sized objects accurately.Since our system's use case involves detecting room numbers from images where the room number is relatively small compared to the overall picture, the selection of YOLOv8 is well justified.

Optical Character Recognition Methods
Second, we reviewed works on OCR to extract text from images.This involves training deep learning models on a collection of text images with the corresponding reference text labels.Commonly, the performance of OCR models is related to the characteristics of the trained text image [29].Thus, it is important to select the appropriate OCR model for each use case.
In [30], Kamisetty et al. proposed an invoices document processing method to digitize physical documents.The methods involve the usage of computer vision and OCR techniques.The data extracted from the physical documents are structured in JSON and CSV formats.From the experimentation, Tesseract OCR demonstrated superior performance compared to other OCR models.However, this approach is limited to only the invoice images, which leads to poor text recognition from images that are different from the trained image.
In [31], Salehudin et al. evaluated EasyOCR's performance for extracting textual information within Latin characters under image degradation.This study aimed to highlight the capabilities and limitations of EasyOCR.Based on the results, EasyOCR excels at recognizing unique lowercase and uppercase characters including C, S, U, and Z.However, EasyOCR's character detection accuracy decreases, ranging from 30 to 40% when recognizing characters with fonts under size 18.
In [32], Qi proposed a real-time system for identifying spray mark characters on moving steel slabs in the manufacturing process.The system utilizes high-sensitivity cameras, optical filters, temperature control systems, and the PaddleOCR model.The research addresses challenges related to complex lighting conditions, high temperatures, and fast-moving steel slabs.The implemented system has been operational for a year and has achieved positive detection rates every week.The minimum weekly positive rate is reported to be 91.2%, while the minimum weekly detection rate is 98.7%.However, due to the limitations of the cameras, the conveyor needs to be slowed down in the manufacturing process so as not to blur the images and create recognition errors.

Indoor Positioning Methods of AR-Based Indoor Navigation Systems
In [33], Huang et al. developed an augmented-reality-based navigation system called ARBIN for indoor navigation.The localization method employed in their work uses Bluetooth RSSI values from Lbeacon placed inside the building.This approach was able to achieve 3-5 m localization accuracy and provide correct instructions to reach destinations.
However, the need to use Lbeacon for user localization constrains the implementation, making it less scalable due to the requirements for additional Lbeacons and extensive RSSI measurement and mapping in new environments.
In [34], Yang proposed using AR markers in an AR-based smartphone indoor navigation system to help visually impaired people navigate indoor environments.The AR markers act as guides, requiring the system to scan them incrementally to ensure that the user reaches their destination accurately.The approach achieved results close to the true value during the evaluation.However, implementing AR markers for user localization hinders scalability and complicates the implementation process, as the markers need to be placed throughout the building.
In [35], Ng developed a mobile augmented reality system for indoor navigation by utilizing a smartphone's internal sensors.The system employs the IndoorAtlas SDK, which leverages the magnetic field captured by the smartphone sensors and the building's WiFi signal.The localization method achieved an accuracy of around 1.2 m, and the developed system received positive feedback from survey participants.However, utilizing the magnetic field requires mapping the building through fingerprinting, as it is necessary to determine the user's coordinates relative to their current location.When implemented in new buildings or locations, this limits and constrains the system's scalability.

Review of Indoor Navigation System Using Unity and a Smartphone (INSUS)
In this section, we review INSUS in our previous studies.

INSUS Overview
INSUS has been designed and implemented on Android and iOS platforms [36].It shows navigation guides on the smartphone display.It integrates AR technology, the 3D virtual environment, the SLAM algorithm, and the path-planning algorithm in the Unity game engine environment.Figure 1 shows the INSUS overview.The system consists of the modules for Input, Unity game engine, the SEMAR server, and Output.Input considers the QR code reader and other necessary input components for navigation.Unity game engine provides services for real-time navigation processes through the SLAM algorithm, including path-planning and the estimation of user positions.The SEMAR server stores all the data related to navigation and user identification.Output displays the 3D arrow as a navigation guide for users.

Input
The application receives four inputs as the components for INSUS to determine the user's initial position, perform real-time localization, find the shortest paths, and visualize 3D arrows for navigation.The QR codes determine the user's starting position and serve as calibration locations when loss of direction errors occur.Information stored inside the QR code includes the room name and its 3D coordinates (XYZ), aligned with the virtual 3D environment.The camera continuously captures images for real-time localization using the SLAM algorithm.QR codes are used to determine the user's starting position and to recalibrate the position when a loss of direction error occurs.The gyroscope measures the angle and speed at which the user rotates along an axis.It provides a user's pose, including roll, pitch, and yaw.Thus, the gyroscope allows for stabilizing real-time localization through the SLAM algorithm [37].
Three-dimensional environments provide a virtual representation of the real environment.Information in a 3D environment includes 1:1 scale models of rooms and objects aligned with real-world coordinates [38].The combination of the Navigation Mesh (NavMesh) and the 3D environment enables accurate navigation and path-planning to reach the target position.

Unity Game Engine
The Unity game engine is a development platform widely used for creating immersive applications [39].It specializes in real-time rendering and simulation, making it ideal for applications such as games, simulations, and AR experiences [40].Unity's robust feature set includes support for multi-platform deployment, a powerful physics engine, and extensive libraries for graphics and animation [41].
In our system, Unity serves as the primary platform for integrating inputs, enabling navigational processes through the AR interface, performing path-planning in the 3D environment, visualizing the user interface, and managing external communication through the Unity Web Request module.
Unity utilizes AR Foundation as an AR software development kit (SDK).It utilizes continuous camera inputs and gyroscope sensors to determine real-time localization via the SLAM algorithm and by rendering 3D navigation guides that overlay the real environment through the AR interface.The SLAM implementation from AR Foundation is based on Oriented Rotated Brief (ORB)-SLAM [42], which works through feature extraction, feature tracking, pose estimation, and mapping [43].Feature extraction and tracking involve identifying distinctive features in images captured by the camera and tracking them across multiple frames.This tracking provides information regarding the user's pose, rotation, and position.
In this study, SLAM's function for the mapping process was not applied because a 3D environment had already been prepared.The Unity game engine facilitates the creation of a NavMesh to perform path-planning in a predefined 3D environment.The NavMesh enables the A* algorithm to use the 3D environment as a map to find the fastest path to the target position [44].
The limitation of the current implementation is the loss of direction, which results from small accumulated errors in various components used for SLAM localization.These errors can arise from camera miscalibration, inaccuracies in feature tracking, path-planning algorithms, and NavMesh being misaligned from the real environment.As these small errors accumulate, they lead to significant localization errors, hindering the visualization process of the path from the user's current location to the destination.It is necessary to reset the location using a QR code to eliminate these accumulated errors.However, the current implementation of the user location reset method requires the placement of a lot of QR codes throughout the building, adding an extra burden during the implementation process.To address this, we propose using pre-installed signs in the building as an alternative to QR codes through an object detection function and text extraction function.
The user interface displays several components to facilitate interaction and navigation, including the initial login, QR code scanning, destination input, and real-time navigation guides.The Unity Web Request module facilitates communication with the SEMAR server by handling HTTP requests and responses to ensure reliable interactions between the Unity game engine and the SEMAR server.

Output
The output from INSUS includes real-time navigational guidance displayed to the user via an AR interface.This guidance features 3D arrows that show the direction to follow and path overlays in the real environment.These visualizations enable effective indoor navigation.The 3D arrows appear after the floor has been detected by plane detection via AR Foundation.It ensures that the guidance is accurately aligned with the user's surroundings.The path overlay is calculated based on the fastest route determined by the A* algorithm.

SEMAR Server
The SEMAR server functions as a database provider that manages communication through the HTTP protocol [45].It is responsible for storing and retrieving various types of user data, including the user identity for login and authentication processes, the user's initial position after scanning a QR code, the target position inputted by the user through the destination input interface, and the room label associated with the target position.The SE-MAR server ensures reliable communication between INSUS and the server by handling HTTP requests and responses using REST API services.The REST API service handles GET requests when users log into the application and POST requests for transmitting user data to the database.

Proposal
In this section, we propose a user location reset method by object recognition.

System Overview
This section presents an overview of this approach that utilizes existing signs in the building.Figure 2 provides an overview of the integration functions for running this method in INSUS with the help of the SEMAR server.Since the main process of the proposed system is based on Python language and located in the SEMAR server, we utilize an application programming interface (API) to integrate these two components.First, the camera captures an image containing signs as the input, which is called a sign image.The image transmission function sends it to the SEMAR server via HTTP communication using the REST API service in base64 format.Upon receiving the data, the object detection function identifies the sign in the image and saves the result as an isolated sign image.Then, the text extraction function recognizes and extracts text from the isolated sign image.The database matching function selects the room number in the Room Database that is most closely similar to the extracted text to determine the room coordinates.We employ the Levenshtein distance to calculate the similarity between the extracted text and the room number.The SEMAR server sends the room coordinates back to Unity via HTTP through the Unity Web Socket module.Finally, Unity uses the new coordinates to update the user's location.
The connection between Unity and the SEMAR server through the REST API service is visualized in Figure 3. Since the REST API is based on the HTTP communication protocol, we employ the Unity Web Request module on the Unity side to send an encoded base64 sign image to the SEMAR server using JSON data.On the SEMAR server side, due to the utilization of Python as the main process, we employ the FastAPI library to receive the image and process the image via YOLOv8, PaddleOCR, Levenshtein distance, and the regular expression library to select corresponding room coordinates in the database.Finally, the SEMAR server sends the room coordinates back to Unity to update the user's current coordinates through the Unity Web Request module in JSON data.

Sign Image
In this study, we employ sign image as an alternative to QR codes for the user location reset method.Sign image refers to visual representations of existing signs installed within the building, such as room numbers, room names, floor numbers, and signs for toilets or elevators.Our approach utilizes the room number to identify the user's location.Each room number provides a unique visual identifier for each room within the building.
By recognizing the sign image, INSUS can retrieve the text from the room number and determine the room's coordinates to update the user's location.This approach minimizes the need to place numerous QR codes during implementation.Figure 4 illustrates a sign image captured by a smartphone.

Image Transmission Function
The image transmission function sends the captured images to the SEMAR server in base64 format [45].The proposed approach utilizes the sign image captured from the surrounding environment during navigation to reset the user's location when encountering loss of direction errors.The image is captured through the phone's camera when the user presses the rescan button.The image size is defined at 480 × 480 pixels to ensure a small size while retaining image quality.The image is then encoded into base64 format and sent to the SEMAR server through the Unity Web Request.It handles all external communication between INSUS and the SEMAR server via HTTP through a REST API service using a POST request.Since this communication only allows text-based data transmission, the base64 format encodes the binary data of the image into text, ensuring the efficient transmission of image data.

Object Detection Function
The proposed approach of the user location reset employs a text recognition algorithm to retrieve text information from a sign image.However, direct implementation of this algorithm could lead to inefficient processes.Since the algorithm scans all text present in the image, this also causes any unwanted text to be extracted.As a result, the accuracy and the performance decrease.To address this challenge, we introduced the object detection function to identify the sign in the sign image and isolate it for the text extraction function.The object detection function recognizes labeled objects from the dataset using the YOLOv8 model to produce bounding boxes that outline the room number.Then, it isolates the image inside the bounding box, similar to extracting the pixel values, which results in an isolated sign image.
First, the YOLOv8 model [46] receives an image as input with a size of 480 × 480 pixels after it has been decoded from base64 format.Then, it convolves the image to automatically extract the features of the labeled objects.This model introduces the sigmoid function to estimate whether the object is present in the input image by enclosing it with bounding boxes and providing the confidence score of its prediction result.Finally, through this implementation, the object detection function could produce isolate sign image with a high confidence score to be processed in the text extraction function.

Text Extraction Function
The text extraction function pre-processes the isolated sign image before retrieving and cleaning the text from the image.The OpenCV implements the pre-processing step to standardize the image quality [47].Then, PaddleOCR recognizes and extracts text from preprocessed images.Finally, the extracted text is cleaned of uppercase letters and punctuation using the regular expression (re) library [48].
First, we employ image normalization and contrast enhancement through OpenCV to standardize the image quality, addressing the varied illumination levels in the surrounding environment.This includes noise reduction through blur functions, bilateral filters, and average brightness adjustment through histogram equalization.
Then, due to its high accuracy in various illumination conditions, as reported in [49], we utilize PaddleOCR to automatically recognize and extract text from the standardized isolated sign image.This detects text in images, adjusts the orientation of the detected text to horizontal, and recognizes text from the corrected orientation.
Finally, the system removes lowercase letters and punctuation that might be generated from the output of the PaddleOCR using the re library.This step produces cleaned text that is ready to be compared with the room number in the room database for the database matching function.

Database Matching Function
We implemented a database matching function to determine the room coordinates.It compares the cleaned text against the list of room numbers in the room database using the Levenshtein distance algorithm [50].The Levenshtein distance measures the similarity between two strings of text based on the edit distance.Equation ( 1) formulates the calculation of the Levenshtein distance.We employed the RapidFuzz Python library to implement the Levenshtein distance where lev a,b (s, t) represent the Levenshtein distance between strings s and t.The |s| and |t| refer to the length of string s and t.The smaller the Levenshtein distance, the more similar the strings are.It selects the room data with the smallest Levenshtein distance values.To conclude the system's workflow, the types of data processed in each function are visualized in Table 1.
Table 1.Examples of data in each function for the proposed method.In Table 1, the source represents the image captured by the phone camera.The object detection function visualizes the YOLOv8 output and the isolated sign image.Then, the text extraction function's results consist of the PaddleOCR output and the cleaned text.The output of PaddleOCR contains some incorrect characters due to recognition errors, which are caused by the visual similarity of characters in the isolated sign image to other letters or numbers.

Source Object Detection Function Text Extraction Function Database Matching Function
Finally, the database matching function selects the room number with the smallest Levenshtein distance based on the cleaned text, along with its room coordinates.

Evaluations
In this section, we evaluated the proposal by applying it in #2 and #3 Engineering Buildings in Okayama University, Japan.

Training Preparation and Dataset Augmentation
Here, the dataset preparation, training environment, and hyperparameters used to generate the YOLOv8 model are discussed for the object detection function.
First, we prepared a custom sign image dataset based on the YOLO format.Each image consists of the class name and bounding boxes around the sign.We collected 213 images with separate labels from #2 and #3 Engineering Buildings at Okayama University.Since the training process of a deep learning approach requires a number of datasets [52], we performed augmentation processes using the Albumentations Python package to generate a larger dataset [53].This resulted in the generation of 2130 images from the original 213 images.It applied various augmentation methods, including lighting adjustments, image compression, and color shifts.Figure 5 shows the result of augmented methods compared to the original dataset.The image dimensions were standardized to 480 px × 480 px.We divided 80% images for training, 10% for validation, and 10% for testing the model.The training processes were conducted on a device with Ubuntu 20.04 as the Operating System (OS), equipped with an Intel® Xeon® Gold 5218 processor and NVIDIA QUADRO 6000 with 24GB of VRAM to facilitate accelerated computations.PyTorch version 1.12.1 was employed to train the YOLOv8 model.We utilized Python version 3.9 with CUDA version 11.3 to enable the GPU acceleration computation.We trained the YOLOv8 model for 400 epochs.The specific details of the training environment are presented in Table 2.In the hyperparameters, Stochastic Gradient Descent (SGD) provides balance during the training process of the YOLOv8 model.The initial and final learning rate is defined as 1 × 10 −2 .The Weight Decay Coefficient of 0.937 prevents models from reaching overfitting during the training.To enable the reproducibility of the trained model, we defined the random set value as 42.The details of the hyperparameter values are described in Table 3.

Performance Analysis of the Object Detection Function
This section explains the training results and the performance measurement of the YOLOv8 using the augmented sign image dataset via Box, Class Loss, precision, recall, and also the mAP validation methods.

Precision, Recall, and mAP Validation of the Object Detection Function
The evaluation of a YOLOv8 model is measured by the values of the precision, recall, and mAP parameters compared between the training and testing dataset.These parameters refer to the model's ability to accurately identify and classify classes [54].The precision measures the proportion of correctly predicted data (true positives) compared to all data classified as positive.The recall measures the proportion of correctly predicted data (true positives) of all actual positive data in the dataset [55].The mAP metric evaluates the accuracy and coverage of the object detection bounding box by using the Intersection over Union (IoU) threshold [56].The IoU threshold represents the tolerance level for the overlap area between the detected bounding boxes and the ground truth bounding boxes.In this study, we employed an IoU threshold at 0.5 and 0.5-0.95.
In the metrics of precision and recall, the YOLOv8 model demonstrates a precision value of 99.99% and achieves a 100% recall value.This indicates that the model is robust and accurate in detecting the sign image from an input image, as illustrated in Figure 7a.The model achieved a final mAP of 0.995 at IoU 0.5 and a final mAP@0.5-0.95value of 0.978.Based on [57], these mAP values demonstrate the consistent and reliable capability of the YOLOv8 model in detecting the trained object.The corresponding mAP results are shown in Figure 7b.A high mAP value indicates that the model could accurately and confidently detect objects with performance similar to the expected result.

Performance Analysis of Text Extraction and Database Matching Function
This section presents the performance measurement of the text extraction function in recognizing the room number and the Levenshtein distance of the database matching function.We compared the accuracy and execution time of the PaddleOCR before and after incorporating YOLOv8 to obtain the room coordinate from the cleaned text.

Experimental Scenarios
We measured the performance of the text extraction function and the database matching function by implementing INSUS in the #2 and #3 Engineering Buildings in Okayama University.The experiment was conducted in two scenarios.The first scenario involved text extraction using only the PaddleOCR model.The second scenario combines the YOLOv8 model and the PaddleOCR model.For both scenarios, we utilized the same environment, and for each room on each floor, we measured at a distance of one meter from the door.This distance was chosen to ensure consistency, image clarity, and relevance to real-world usage.The specifications of the evaluation environment are shown in Table 4.
The brightness value on each floor in the building is represented using Illuminance (LUX).The illuminance value is measured using the light sensor on smartphones [58].Based on [59], the average illuminance value of the implementation building shows that the lighting conditions are visible.During the experimentation, we utilized two devices with different operating systems.The specifications of the experimental devices are detailed in Table 5.

Accuracy of Text Extraction and Database Matching Function
In this evaluation, the accuracy of the cleaned text was calculated by the Character Error Recognition (CER) method [60,61].It represents the percentage of characters incorrectly recognized by the OCR model compared to the total number of characters in the reference text.The lower values indicate better performance, while higher values indicate worse performance of the OCR model.
The effectiveness of the user location reset method is determined by its ability to retrieve the room coordinates from the Room Database.Based on the experimental scenarios, we evaluated each room for each floor in both buildings.For each scenario, we took five images of each room on each floor in both buildings.Then, the CER values were averaged and normalized to account for variations in the number of rooms.The results demonstrate the impact of incorporating the YOLOv8 model with PaddleOCR, as visualized in Table 6.This measurement evaluates the proposed system's impact on the time required to successfully reset the user location.It measures the time taken to execute the user location reset method until it produces cleaned text from the sign image and resets the user location using the room coordinates retrieved from the database matching function.The evaluation was conducted in two scenarios, as previously described.The measurement process begins with the user positioned 1 m from the door, scanning the Sign Image using INSUS.If the method fails to correctly reset the user's location, the user moves closer to the sign image to rescan it.It was intended to produce more accurately cleaned text.This process is repeated until the correct text is successfully retrieved and the user's location is accurately reset.The results of this comparison are shown in Table 8.Due to YOLOv8's ability to automatically isolate the sign image, the incorporation of YOLOv8 and PaddleOCR achieved better performance, since it allows the PaddleOCR to recognize and extract only the room number from the isolated image rather than the whole image.In addition, the approach that only utilized PaddleOCR requires more time to extract accurate information as it produces inaccurate cleaned text.Thus, the user needs to move closer to the sign image and prolong the execution time.Compared to the proposed approach, it can detect the sign image from the input image and automatically isolate it, therefore resulting in a faster execution time.

Conclusions
This paper proposed the user location reset method using object recognition as an alternative approach in INSUS.The integration between Unity (INSUS) and Python (SEMAR server) is facilitated through HTTP communication using the REST API service.YOLOv8 is adopted to locate and isolate the sign image.PaddleOCR is used to extract the text information from the sign image.The database matching function with the Levenshtein distance is applied to detect the same room number from the database.The location identified from the database matching function is then used to update the user's current position within the building, completing the method's workflow.
For evaluations, we applied the proposal to two buildings in Okayama University, Japan.The results show that YOLOv8 achieved mAP@0.5 0.995 and mAP@0.5:0.950.978, and PaddleOCR could extract text in the sign image accurately, with the averaged CER lower than 10%.The database matching using Levenshtein distance predicted the correct room number from the database, thereby accurately providing the coordinates to reset the user's location.The combination of both YOLOv8 and PaddleOCR decreased the execution time by 6.71 s.Following the results of the proposed approach, the benefits include improved user location reset accuracy after incorporating the object detection function, reduced execution time, and the elimination of the need to install and maintain QR codes.This makes the system more flexible and easier to implement in various environments.The results confirmed the effectiveness of the proposal.
For future works, we will increase types of sign images to be recognized to improve the usability and accuracy of the proposal.

Figure 2 .
Figure 2. Overview of INSUS with object-detection-based user location reset method.

Figure 3 .
Figure 3. Connection between Unity (INSUS) and Python (SEMAR server) with HTTP communication using REST API service.

Figure 4 .
Figure 4. Example input image with sign.

Figure 5 .
Figure 5. (a) Example in original dataset; (b) example in the augmented dataset.

5. 2 . 1 .Figure 6 .
Figure 6.(a) Validation training from box loss.;(b) validation training from class loss.The Class Loss method measures the difference between the detected class and the ground truth class.A lower value of the measure indicates that the YOLOv8 model accurately predicts the class of the objects.Figure 6b shows the validation of the model in predicting the class of the objects.It achieved accurate classification results, with final Validation Class Loss and Train Class Loss values of 0.182 and 0.164, respectively.These results represent the ability of the trained model to detect the class in the datasets.The validation results indicate that the trained model successfully predicted the class of the object and avoided overfitting.

Figure 7 .
Figure 7. (a) Precision and recall during the training process; (b) mAP@0.5 and mAP@0.5-0.95results during the training process.

Table 2 .
Device specification for model training.

Table 4 .
Specifications of evaluation environment.

Table 5 .
Specifications of evaluation devices.

Table 6 .
CER results for every floor level on both buildings with and without the incorporation of the YOLOv8 model.

Table 6
shows the results of a combination of the YOLOv8 model and PaddleOCR.It consistently achieved CER lower than 10% compared to only the PaddleOCR.However, this approach still caused errors in text extraction.To address this problem, we employed the Levenshtein distance algorithm in the database matching function.It compared the cleaned text with the list of room numbers in the Room Database and selected the corresponding room coordinates with the smallest Levenshtein distance.The sample results of the database matching function are displayed in Table7.

Table 7 .
Sample results of Levenshtein distance from two implementation buildings.Comparison of the Execution Time of the User Location Reset Method