Leveraging Graph Cut’s Energy Function for Context Aware Facial Recognition in Indoor Environments

Context-aware facial recognition regards the recognition of faces in association with their respective environments. This concept is useful for the domestic robot which interacts with humans when performing specific functions in indoor environments. Deep learning models have been relevant in solving facial and place recognition challenges; however, they require the procurement of training images for optimal performance. Pre-trained models have also been offered to reduce training time significantly. Regardless, for classification tasks, custom data must be acquired to ensure that learning models are developed from other pre-trained models. This paper proposes a place recognition model that is inspired by the graph cut energy function, which is specifically designed for image segmentation. Common objects in the considered environment are identified and thereafter they are passed over to a graph cut inspired model for indoor environment classification. Additionally, faces in the considered environment are extracted and recognised. Finally, the developed model can recognise a face together with its environment. The strength of the proposed model lies in its ability to classify indoor environments without the usual training process(es). This approach differs from what is obtained in traditional deep learning models. The classification capability of the developed model was compared to state-of-theart models and exhibited promising outcomes.


Introduction
Facial recognition (FR) systems have been used to resolve a host of challenges ranging from theft [1], development of access control systems [2], online examination management [3], gender recognition [4], facial expression recognition [5,6]; and age estimation [7]. However, there is a growing interest in context-aware FR systems which offer the possibility of recognising faces as well as their immediate environments. With this approach to FR, a domestic robot, for example, can carry out security tasks or indoor services having detected and processed relevant information. To achieve this system, a host of models, including the Principal Component Analysis (PCA) [8][9][10], have been deployed. More recently, the deep learning (DL) model has also been used [11]. The latter model works by gathering many training datasets. The training approach fine-tunes the model such that an optimal classification is attained. This method has a downside: when acquiring new faces, the model must be retrained. Recent research has mitigated this challenge by developing an FR model that is trained on millions of faces [12]. For the classification task, vectors of two face images are passed into the model and compared for a match. A threshold value T informs the model's decision; a value greater than T suggests that a match does not exist, while a value less than or equal to T suggests that a match does exist. The model under review has an advantage as it does not require retraining on new faces.
Similarly, there has been growing research interest in place recognition, especially in localising domestic robots. The DL model plays a prominent role in this regard, although traditional models have also been useful. Within these traditional models, global and salient features of indoor environments have been used for indoor classification [13].
Unfortunately, one of the disadvantages of DL models is that they need a considerable number of images to sufficiently learn unique features which distinguish one indoor environment from another. Many studies have, therefore, investigated the use of pre-trained models [14] as they are trained on thousands of images such that they need only two or three custom convolution layers to adapt the model to a given classification problem. These layers are then trained on custom datasets. In Liu et al. [15], a Convolution Neural Network (CNN) feature extraction model is built within a pre-trained model. This model's output is a feature vector of considered environments. Therefore, for the test image, feature vectors of the considered environments are generated and compared; the best match becomes the recognised image. The downside of this model is that more custom data needs to be gathered to train the weights of the added layers [15]. For instance, when indoor environments are unique, acquiring datasets for the environments become a challenge. Fig. 1, for example, depicts a combination of a living room and a dining room. A training dataset for such a scenario is rare because popular place recognition datasets [16] have not considered this uniqueness. This complicates its deployment in real-life scenarios. Fig. 2 depicts how the suggested model works. The names of everyday objects and concepts are passed on to the model through the image, whereafter it makes a prediction regarding the environment. Furthermore, the face in the input image is extracted and recognised using the model proffered in Parkhiet al. [12]. Ultimately, a face together with the immediate environment is recognised.
For an indoor environment, this might be useful as a lightweight domestic robot can perform functions such as cooperating with identified and recognised faces in a defined environment. In the case where a face does not match a given (environment), an alarm or anomaly event can be raised. The contextual face recognition model can also be extended to other systems where safety is prioritised. For instance, road accidents caused by unlicensed drivers are one of the leading causes of death in America [17]. The model can thus shut down a train/bus/car when an unauthorised operator is observed.
The model put forward in this work takes advantage of a free, DL model for object and concept detection [18] alongside a modified graph cut. It is highly customisable, which means that everyday objects in a given environment can be gathered and used for place recognition without the need for training.
To the best of the researcher's knowledge, no work has been carried out concerning localised FR in an indoor environment. In Davis et al. [19], an FR model is offered whose accuracy is enhanced by incorporating environment metadata. The major disadvantage of this model is that when faces move around, the environment metadata may be unhelpful. Another indoor place recognition design has been put forward using the Bayesian model [20]. The model works by assigning probabilities to everyday objects found in indoor environments. Thereafter, these probabilities are then passed over to a Bayesian model for indoor environment prediction. One downside of this model is that objects detected from a given scene rely on a probability value before place recognition is carried out; therefore, this study does not make use of the approach. Rather, a modified graph cut energy function is used for indoor environment recognition and an existing FR. Section 2 of this work discusses everyday objects or concepts in the considered indoor environment and it introduces the graph cut energy function as well as the suggested model. The proffered model is then evaluated in Section 3, whilst Sections 4 and 5 advance the discussion and conclusion of the study.

Theory
For context-aware FR, the names of everyday objects and concepts for the considered environments must be gathered. This research focuses on indoor environments in homes. Hence, five indoor environments have been considered; namely the bedroom, dining room, kitchen, living room and bathroom. Commonly associated objects and concepts for these five indoor environments are listed below.
(3) Bedroombed, pillow,   It is imperative to mention that objects or concepts attached to these indoor environments are customisable. They can be adjusted to meet specific environmental peculiarity.

Graph Cut Energy Function
Graph cut is a segmentation method that partitions an image into foreground and background. In other words, it binarises images [21]. Before segmentation, an image is transformed into a graph G. Image pixels in G are transformed into vertices or nodes (V) while edges (E) connect pixel neighbours. Each node makes a connection with two fundamental nodes -O and Bas depicted in Fig. 3. The figure shows a 2 x 3 image which has been transformed into a graph. Here, a, b, d, e, f, g are image pixels that have been transformed into nodes. One can observe that the nodes have two distinct colours which depict the inherent capability of graph cut segmentation as a binary classifier.
At the heart of this process is the graph cut energy function in Eq. (1), which is used to assign weights to edges of the graph in Fig. 3. For example, edges attached to O and B are given weight values by the graph cut data term (the first term, starting from the left in Eq. (1)). The data term determines the weight using the negative logarithm of the probability, given an observation F, assuming I i ð Þ is attached to nodes O and B. Weights between two neighbouring vertices are assigned by the second term. This term is referred to as the smoothness term. Assuming pixels or vertices I i ð Þand J i ð Þ are neighbours, r is the pixel similarity variance [21]. The min-cut/max-flow algorithm is generally used [22] to optimally partition an image into foreground and background. The in Eq. (1) is a parameter that gives relative importance to the data term at the expense of the smoothness term.

Modified Graph Cut Energy Function
One of the limitations of the graph cut segmentation model [23] is that it can only partition an image into foreground and background. This research, therefore, extends the capability of the graph cut energy function from a bi-classification to a multi-classification. Hence, the graph cut energy function is modified to accommodate multiple categorisations or classifications. As observed in Fig. 3, where objects a, b, d, e, f, g are pixel objects, they could also be transformed into objects detected in an indoor environment. However, instead of partitioning the objects into binary categories (foreground and background), these objects should be partitioned into five categorieskitchen, living room, dining room, bedroom and bathroombased on elicited objects. Fig. 4 depicts an envisioned graph cut segmentation for indoor classification. The coloured boxes, in this case, are everyday objects detected from an indoor environment and they are segmented into the five identified environments.
To achieve the multiple classification goal, Eq. (1) is modified, as shown in Eq. (2). Eqs. (1) and (2) have common attributes. Firstly, they both have data and smoothness terms. However, there is an adjustment in the data term. The modified energy function takes the probability of an object I i ð Þ belonging to a given domain F m fm ¼ 1; 2; . . . ; N g among the considered domains (N ¼ 5 for the five domains in the example used). In Eq. (2), n is the total number of objects elicited from a presented scene.
Furthermore, is applicable only when a given object is unique to the considered domain or environment. This development is also in line with the graph cut energy function offered in [21] where a particular pixel weight is given a high score, based on an interactive pixel selection. For the smoothness term, when an object I i ð Þ belongs to a domain in question, the value of I i ð Þ is 255. This approach gives the impression that there is cohesiveness; otherwise, I i ð Þ becomes 0. E F m ð Þ gives the total number of points accumulated for the given domain F m . In Eq. (2), 60 is assigned to while 0.71 is assigned to r.
Therefore, for the considered domain, there would be EðF 1 Þ, EðF 2 Þ, …, EðF 5 Þ. The energy with the highest value from Eq. (3) indicates the identified domain. Figure 4: Graph cut seen as a multiclass classification

Facial Recognition
Ten frontal face images are acquired and then passed into the deep face recognition model [12]. Thereafter, the model generates ten unique vectors V 1 , V 2 , …, V 10 for these images. It then compares these vectors to the face vector G generated by the model of a face extracted from the presented indoor environment. Eq. (4) is then used to recognise the face G. In this way, the face match with the smallest cosine value becomes the recognised face. A threshold value, T, of 0.4 is set. This means that if G is extracted from a scene and it does not match any stored face in the database, the system gives a face not recognised message.
Algorithm 1 details how the proposed model works.
Algorithm 1: Localised FR 1. Inputs: X and G (X encapsulates the objects elicited from an indoor scene I i ð Þ . . . . . . . . . I n ð Þ, n is the total number of objects and I represents everyday objects or concepts. G is the face vector extracted from the indoor environment).

Outputs:
A and B (A is the recognised indoor environment and B is the recognised face in the indoor environment).

Evaluation
The developed model has been evaluated on 2543 images. These are images of five indoor environments [16], as previously discussed. The Accuracy metric is used to evaluate its performance, as seen in Eq. (5).
Accuracy A ð Þ ¼ total number of correctly classified Images total number of Images referenced (5)

Results
Tab. 1 shows the performance of the proffered model. The model outperformed others in three categories out of a possible five. It did not perform adequately in the Dining room category because a notable similarity of the objects exists between a living room and dining room. However, when the indoor environment had somewhat unique objects, the developed model performed well (as seen in the Kitchen and Bathroom categories). The CNN feature classification model [15] outperforms the offered model in the Bathroom and Living room categories. The performance of this model may be attributed to the fact that the training and testing images may have been derived from the same source. Hence, there is a possibility that patterns of indoor environment images would have been learned from the training dataset. This approach is different from the proposed model, as training is not required. Indoor recognition method [13] 734 -Place recognition model [20] 734 55 Place recognition model [15] 734 75 Proposed model 734 82.2 Living room Indoor recognition method [13] 706 -Place recognition model [20] 706 89 Place recognition model [15] 706 95 Proposed model 706 76.6 It is important to note that the performance of the proffered model hinges on an object detection algorithm. It is observed that the adopted object detection model may not be robust in detecting everyday objects in the images presented. For example, consider the image in Fig. 5. The everyday objects or concepts that were detected are (5 in number, meaning n = 5): a window, table, furniture, chair and counter. However, the object detection model omitted objects such as a pot and cooker. Regardless, based on other identified objects, the proposed model can determine that the presented image is a kitchen. The data and smoothness terms for the considered indoor environment were determined as seen below. Drawing from the calculation, the domain with the highest value was the kitchen with 18.6. Hence, the model classifies the input image as a kitchen. Furthermore, the face in Fig. 5 is extracted and sent to the face recognition model. The model then runs the extracted face G on a collection of registered faces; this returns the least value of 0.308 and a corresponding name. A value of more than 0.4 suggests that a foreign face is identified.

Discussion
In this research, the binary graph cut was modified to a multi-classifier as required. A localised FR was developed, which takes advantage of the multi-classifier and an existing FR model. An advantage of the proposed model over others is that it can be adjusted to work in any indoor environment. The simple Figure 5: Proposed model can recognise the face together with its environmentkitchen process involves gathering everyday objects and concepts particular to the considered environment. This development contrasts the conventional DL model or pre-trained model for place recognition, as they require training.

Conclusion
This paper developed a modified graph cut energy function for place classification. The model was combined with facial recognition to deliver a localised FR system. An experiment carried out on 2543 images showed encouraging prospects.
Funding Statement: The authors received no specific funding for this study.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.