A Function Area Division Approach for Autonomous Transportation System Based on Text Similarity

. Along with emerging technologies and increasing demands, autonomation has become a signifcant trend in current transportation systems. Within this context, the autonomous transportation system (ATS) framework hinges on functions that serve as fundamental units to support its operation. Recognizing the divisions among these function areas can enhance our understanding of their meanings and interrelationships. Tis study introduces a method for dividing function areas within the ATS framework, grounded in text similarity, to mitigate reliance on subjective experience. Precisely, this method quantifes the similarity between functions based on their textual descriptions, and implements hierarchical clustering to delineate them into distinct function areas. To validate the efectiveness of this proposed method, a case study analyzing a vehicle automatic driving scenario was conducted. Te results demonstrate that our approach can efciently divide function areas, producing clustering outcomes that possess superior accuracy and purity when juxtaposed with reference classifcations. Consequently, this method has the potential to facilitate the formulation of function areas within ATS, thereby supporting the autonomous operation and construction of ATS. Moreover, its applicability extends beyond ATS, showing promise for other clustering problems that involve multiple texts, such as in text classifcation.


Introduction
Along with emerging technologies and increasing demands, autonomation has become a signifcant trend in current transportation systems. Against this background, the concept of autonomous transportation system (ATS) emerged [1], aiming to realize autonomous perception, autonomous learning, autonomous decision-making, and autonomous action for transportation systems.
Te construction of the ATS depends on the guidance of the architecture framework, and function area division is an essential part of the ATS framework's research. Related concepts for the task can be explained as follows. Te services are the applications and values that the system can provide for users. For example, a transportation system can provide users with services of "parking space management," "freight administration," "vehicle emergency response," etc. Te functions are the processes and activities used to support services. For example, the realization of the "parking space management" service relies on functions such as "get personal driver request," "process vehicle location data," "determine dynamic parking lot state," and "output parking lot information to drivers." Te function areas are the sets of functions with common data processing characteristics and application scope. Te function area division contributes to organizing the functions of transportation systems and sorting out intra-area correlation and interarea coordination. Furthermore, it benefts in determining key modules of ATS development.
As the basis of the ATS framework's research, the traditional intelligent transportation system (ITS) frameworks have more than 20 years of history, including typical frameworks of the United States, the European Union, and China. Te research on the ITS framework of the United States started frst and has been continuously improved since 1993. Te latest version 9.0 [2] was released in 2020 to adapt to the transportation reform for automatic driving. Te ITS framework of the European Union has been studied since the 1990s, and it was updated to version 4.1 in 2011 [3]. Te ITS framework of China was studied in the early 21st century. It has not been updated and developed since the completion of version 2.0 in 2005 [4]. Te three ITS frameworks have afected the development of other countries' and cities' ITS frameworks [5][6][7][8], and the ITS frameworks are evaluated by some researchers [9][10][11]. Table 1 lists the function areas of ITS frameworks in the United States, the Europe Union, and China, and the contents of the three are similar. However, the three ITS frameworks do not explain and demonstrate the division logic of function areas, which often relies on expert experience. Jiang et al. [12] once tried to use rough sets to identify new function areas of the ITS framework, which was a preliminary study of function area division methods. Still, there have been no further or similar research and applications.
Based on the traditional ITS frameworks, many explorations of the new ITS frameworks and improved ITS subsystems have been carried out recently. Especially, with the rapid development of Internet of Vehicles technology, some related ITS frameworks appeared, such as the vehicle-to-vehicle-to-infrastructure framework [13], the vehicular ad hoc network (VANET) communication architecture [14], and the VANET architecture assisted by unmanned aerial vehicles [15]. In addition, some studies were devoted to improving subsystems of the ITS frameworks, including public transportation systems [16,17], vehicle tracking systems [18,19], ITS security systems [20][21][22], ITS information systems [23,24], and ITS communication systems [25,26]. However, most current research on the ITS frameworks combines limited emerging technologies or only focuses on the ITS subsystems. Tere are few in-depth and detailed types of studies similar to the three ITS frameworks mentioned previously.
Regarding the function area division, the new ITS frameworks' research has updated the functions' content. Still, they mostly rely on subjective construction methods and have not yet formed a clear and complete methodology. As the transportation systems become more complex and the functions become more abundant, an adaptive function area division method is urgent to adapt to the dynamic evolution of the transportation systems. Actually, each function has some short texts that embody function characteristics, which can be used to cluster functions with commonality into the same function areas. Terefore, dividing function area can be regarded as clustering the function texts here, and this task is generally called text clustering. Text clustering is to group similar texts from a set of texts [27] and has many useful algorithms, such as hierarchical clustering [28], k-means clustering [29], eigenspace-based fuzzy c-means clustering [30], and deep embedding [31]. In most text clustering algorithms, text similarity is a necessary step. Te text similarity approach can measure the commonality between two texts, which is often used in text clustering [32], text classifcation [33], information retrieval [34], and document matching [35]. In product-service systems and web service discovery, text similarity is applied in service clustering research [36,37], which is similar to our research problem. Inspired by their works, text similarity is also adopted in establishing the ATS function area division method.
In this paper, an ATS function area division method is proposed based on text similarity, transforming the function area division task into a short text clustering task. Based on the method, the function similarity is measured by texts, and the functions are clustered by hierarchical clustering. Terefore, the method can adaptively form function areas, helping reduce the dependence on subjective experience and improve the efciency of function area division work.
Tis paper consists of fve sections. Section 2 describes the steps of the function area division methods, including function text processing, function similarity calculating, function area dividing, and method performance evaluating. In Section 3, a case analysis for a vehicle automatic driving scenario is provided, and the method performance is evaluated. Finally, Section 4 concludes all the work and discusses possible future improvements.

Methods of Function Area Division
Te function area division will cluster similar functions based on considering various aspects of the transportation system operation. At last, the functions within the same function area have high similarity degrees, and the functions among diferent function areas have low similarity degrees, as shown in Figure 1.
Te technical route for function area division is shown in Figure 2.
Te steps are as follows: (1) Function text processing: Te text of the function name is converted into a phrase set by phrase segmenting and stops word removal. Te texts of three function attributes, including "function provider," "process object," and "service object," are converted into word sets directly. (2) Function similarity calculating: Te function name similarity and function attribute similarity are calculated based on the Jaccard coefcient. Ten, the comprehensive similarity matrix of functions is obtained with a weighted average of the two similarities. (3) Function area dividing: Te fnal function area result is obtained by hierarchical clustering and the silhouette coefcient. Te function areas are named based on analyzing the keywords of the function texts. In addition, the functions in every function area are reclassifed with the "operation stage" attribute, and a three-level function list is fnally formed. (4) Method performance evaluating: A function set of a vehicle automatic driving scenario is constructed, and the function areas are divided manually as reference classifcation. Based on the reference classifcation, the method performance is evaluated with accuracy and purity.

Function Text Processing.
Te text formats and processing of the ATS functions are introduced. In the ATS framework, a function is described by the function name and four attributes. Te texts of the function name and three  Journal of Advanced Transportation function attributes, including "function provider," "process object," and "service object," are processed for clustering functions. It should be noted that the function attribute "operation stage" is not used for clustering functions.

Function Name.
Te function name can summarize the function content and is a short Chinese text composed of "verb + noun or noun phrase," such as "dispatch emergency vehicles." Besides, extra information can be supplemented by parentheses, such as "provide trafc information query (traveler interface)." Some auxiliary words, such as empty words and conjunctions, may exist in the function name text. Terefore, meaningful phrases must be extracted from the function name text for subsequent similarity calculation. Te process of the function name text can be divided into the following two steps: (1) Phrase segmenting: Tis work uses jieba, a mainstream Chinese phrase segmentation tool, to segment the function name texts into independent phrases, such as "monitor/passenger/anomaly/behavior," and each phrase is no more than three words. (2) Stop word removal: In this work, stop words refer to meaningless symbols and redundant words.
Tere are mainly two types: one is empty words, conjunctions, or other independent words after segmentation, and symbols, such as parentheses. Another is the verbs at the beginning of the text, such as "collect" and "process," which provide little help in distinguishing the function areas.

Function Attribute.
Te function attributes can describe the functional characteristics and embody the autonomous operation logic. Every function has four attributes: (1) "Function provider": physical objects that provide the functions, including "user body," "system module," and "integration platform," such as vehicle-mounted equipment, infrastructures, and information platforms. (2) "Process object": information objects used in the process of function realization, such as road network information and emergency events. (3) "Service object": "user body" that can directly use functions, or "system module" and "integration platform" that directly use output results of functions, such as travelers, vehicle-mounted equipment, and information platforms. (4) "Operation stage": system operation stages that can refect the autonomous operation logic in the functions, including four stages of "perception," "learning," "decision-making," and "action".  Te function attribute values are the specifc contents of four function attributes, and the defnition of all function attribute values can be found on the ATS website [38]. Tis research only studies functions whose function attributes have a single value.
Te three function attributes, "function provider," "process object," and "service object," are used for clustering functions. Te texts of their function attribute values are relatively simple, composed of short noun phrases without redundant components. Terefore, they can be directly converted to a word set for similarity calculation.

Function Similarity
Calculating. Te similarities of function names and function attributes are calculated frst. Ten, the comprehensive similarities between functions are calculated based on the two kinds of similarities.
Te Jaccard coefcient can be used to calculate text similarity by measuring the overlap degree of phrases or words of two texts [37,39]. Tus, the similarities of function names and attributes are measured based on the Jaccard coefcient.

Function Name Similarity.
Te function name similarity is calculated based on the phrase sets of two functions' names with the Jaccard coefcient: where r (s) ij is the similarity between the function i and the function j about the function name, and W(s i ) is the phrase set of the function i about the function name.

Function Attribute Similarity.
Te function attribute similarity is calculated based on the word sets of two functions' attributes with the Jaccard coefcient: where r is the similarity between the function i and the function j about function attribute x n , and C(t ) is the word set of the function i about function attribute x n .

Comprehensive Similarity between Functions.
Te comprehensive similarity between two functions is obtained with a weighted average of the similarities of function names and function attributes: where r ij is the comprehensive similarity between the function i and the function j, r are similarities of three function attributes: "function provider," "process object," and "service object." w k is the weight of the k th similarity, and w 1 + w 2 + w 3 + w 4 � 1.
Te comprehensive similarity matrix of functions is R � r 11 r 12 where r ii � 1, r ij � r ji , and 0 ≤ r ij ≤ 1.

Function Area
Dividing. All clustering results are obtained by aggregative hierarchical clustering, and the optimal clustering result is found by silhouette coefcient, which is the fnal function area result. Ten, the function areas are named by analyzing the keywords of the function texts. At last, the functions in every function area are reclassifed by the "operation stage" attribute, and a three-level function list is formed.

Function Clustering.
Hierarchical clustering is a simple and efective unsupervised clustering method, which only needs the distances between samples. Specifcally, this research adopts an agglomerative hierarchical clustering algorithm to cluster functions into function areas. Te basic idea is to regard each function as an initial function area at the beginning and then continuously merge the two closest function areas until all functions are merged into one function area. Te fow of function clustering by utilizing the agglomerative hierarchical clustering algorithm is shown in Figure 3.
In function clustering, the comprehensive similarity matrix should frst be transformed into the comprehensive distance matrix, which is used as the initial distance between function areas. Ten, during iteration, the distance between function areas is updated using the average sample distance between function areas: where d avg (E s , E t ) is the average sample distance between function area E s and function area E t , and dist(i, j) is the distance between function i and function j.

Optimal Division Result
Determining. Te hierarchical clustering can obtain all possible clustering results, whereas the optimal division result is what we need. Terefore, clustering performance should be evaluated to determine the optimal division result as the fnal function area division results. Te clustering performance is usually evaluated with external and internal indicators [40]. Te external indicator compares the clustering result to the manual division result, whereas the internal indicator evaluates the clustering result directly. Compared with the external indicator, the internal Journal of Advanced Transportation indicator is more easily obtained and more adapted to the function area division work in diferent scenarios. Terefore, the internal indicator is selected to determine the optimal division result.
Silhouette coefcient [41] is a common internal indicator that considers both cluster cohesion and separation. Besides, it does not need to calculate the cluster center coordinates, so it is suitable for the clustering performance evaluation of our work with only sample similarity. Te value of the silhouette coefcient ranges from −1 to 1, and a high value indicates that the intra-area similarity is high and the interarea similarity is low, representing a good clustering performance. Terefore, the clustering result with the maximum silhouette coefcient is the optimal division result.
Based on the function area division in the ITS framework, the maximum number of function areas can be set to 15. In addition, one function area that includes all functions is meaningless, so the minimum number of function areas is set to 2. Te optimal division result is obtained by evaluating the silhouette coefcients of cluster numbers ranging from 2 to 15.

Function Area Naming.
After obtaining the optimal division results, the function areas are named based on manual work because the number of the function area is no more than 15. For each function area, the three phrases with the highest frequency in the function name and the attribute value with the highest frequency in each function attribute are extracted as the keywords to provide references for naming manually.
Te keywords and candidate names of function areas are concluded based on the function area research for the ITS framework, as shown in Table 2. Te appropriate names can be directly selected according to the keywords.

Tree-Level Function List Generating.
Each function area has many functions, which is poor to display functions clearly, so the functions in all function areas are further categorized.
In each function area, the "operation stage" function attribute is used to reclassify the functions into "perception function," "learning function," "decision-making function," and "action function." As a result, a three-level function list is generated, whose structure is shown in Figure 4.

Method Performance Evaluating.
Te method performance of function area division needs to be evaluated. Tus, we construct a vehicle automatic driving scenario and obtain a function set of this scenario. Ten, the function areas are divided manually as reference classifcation, and the method performance is evaluated by comparing the clustering result with the reference classifcation.
Two external indicators are used to assess the clustering performance compared with the reference classifcation: accuracy and purity [42].

Accuracy.
Te accuracy is the ratio of functions divided correctly. It indicates that the coherence between the clustering result and reference classifcation, and a value close to 1 represents the clustering performance is good. Te accuracy is calculated with where ACC is the accuracy, |F succ | is the number of functions divided correctly, and |F| is the total number of functions.

Purity.
Since function distribution is not uniform in specifc scenes, only the accuracy cannot refect the clustering performance well. Terefore, the purity is introduced to help evaluate the clustering performance. It indicates the precision of clustering, and a value close to 1 represents the clustering performance is good. Te purity of each function area is calculated with where PUR(C k ) is the purity of the k th function area, |C k | is the number of functions in the k th function area, and |C s k | is the number of functions belonging to the s th function area of the reference classifcation.
Te purity of the clustering result is calculated by the weighted average of the purities of all function areas where PUR is the purity of the clustering result, and PUR(C k ) is the purity of the k th function area.

Scenario Verification and Result Discussion
A typical autonomous transportation scenario is constructed to verify the performance of the function area division method. First, a function set supporting this scenario is constructed. Ten, the function areas are obtained by using the method. At last, the performance of the method is verifed by comparing the clustering result with the reference classifcation through two external indicators: accuracy and purity.

Scenario Hypothesis and Function Set Construction.
Vehicle automatic driving is a typical autonomous transportation scenario. In this scenario, we set an event that users want to drive from their homes to their workplaces on city roads. Before the travel, the users need to obtain travel information and plan routes. During travel, users need to drive cars and have safety requirements. Besides, they need to park their cars and pay trafc costs when arriving at the destination. Finally, after the travel, the system needs to provide travel evaluation and feedback to improve travel quality. Te completion of this event requires that the ATS framework has corresponding functions. As a result, a function set containing 174 ATS functions is constructed based on the autonomous operation logic (see Appendix I in Supplementary Materials (available here)), and the detailed contents of each function can be found in [38]. By referencing the traditional ITS framework, the 174 ATS functions are manually divided into 9 function areas as the reference classifcation (see Appendix I in Supplementary Materials (available here)). Te function distribution of these function areas is shown in Table 3.

Function Area Division Result.
Based on the method of Section 2, the functions can be clustered into function areas. Te weights in (3) are set to w 1 � 0.6, w 2 � 0.05, w 3 � 0.2, and w 4 � 0.15, which are the optimal values by experiments.
Te silhouette coefcients of cluster numbers ranging from 2 to 15 are calculated to judge the optimal cluster number, as shown in Figure 5. With the cluster number increasing, the silhouette coefcient increases frst and then decreases. It reaches a maximum of 0.1989 when the cluster number is nine. Tus, the optimal cluster number is nine, the same as the function area number of the reference classifcation.
When the cluster number is nine, the accuracy and purity are both 0.8966, indicating that the function area division method can work well for function clustering.
Ten, the keywords of the function name and three function attributes are extracted according to word frequency and combined with the candidate names in Table 2 to   Table 4. Finally, the three-level function list is obtained based on the function attribute "operation stage" (see Appendix I in Supplementary Materials (available here)).

Analysis of Experiment
Results. Te efect of diferent text similarity weights and the clustering performance of optimal division results will be analyzed. Weights. In (3), four weights need to be set when calculating the comprehensive similarity of functions. To explore the infuence of weight values on clustering performance, we conduct the clustering experiments by changing weight values with a step size of 0.05 every time. Te weight values and clustering performance of the top ten experiments by ranking the accuracy are shown in Table 5.

Efect Analysis of Text Similarity
Te accuracy and purity of the frst seven experiments are the same, and most of the seven have nine function areas consistent with the reference classifcation. Te last three experiments have more function areas than the reference classifcation, showing that the functions are divided into more details, resulting in higher purity and lower accuracy. Te clustering result with high accuracy should be selected frst because it is closer to the reference classifcation.
Te relative relation of the four weights in Table 5 is w 1 ≥ w 3 > w 4 ≥ w 2 , that is, the weights of the function name and "process object" are relatively large, while the weights of "function provider" and "service object" are relatively small. Specifcally, the function name has the largest weight and w 1 ≥ 0.3. It indicates that the semantic text information of the function name is rich to help most for function clustering. Among the three function attributes, the "process object" is the most important, the "function provider" is the least signifcant, and the "service object" plays a complementary role.
In Table 5, some experiments have diferent weights but the same clustering performance. In particular, experiment 1 and experiment 2 obtain diferent numbers of function areas, but their accuracies and purities are the same. Te function distributions of the two experiments are shown in Figure 6, and the clustering results are close to the reference classifcation. However, the result of experiment 2 lacks the function area of "manage trafc facilities," and functions of this function area are mistakenly divided into the function area of "manage trafc network." In addition, some functions are incorrectly divided into the "monitor and manage environment" function area in experiment 1, while these functions are divided into correct function areas in experiment 2. Although experiment 1 and experiment 2 have diferent focuses on function clustering results, their correct function numbers are the same, resulting in their same accuracy and purity. In this scenario, both "manage trafc facilities" and "monitor and manage environment" have few functions. Because of the limited functions, it is difcult to perform an accurate performance comparison, and the two experiments can be considered similar clustering performances.
Te clustering performance of the top 10 optimal experiments is not signifcantly diferent, especially since the index diference of the frst four experiments is only 0.58%. When considering the number of function areas consistent with the reference classifcation preferentially, the clustering performance of experiment 1 is the best, followed by experiment 3 and experiment 4. Additionally, the two weights of experiment 3 are 0, which can simplify the calculation well without losing precision.   In most experiments, the "function provider" weight is the smallest and even close to 0. Tus, we set this weight to 0 to explore the infuence of the weights of the other two function attributes on the clustering performance. Figure 7 shows the values of the accuracy and purity changing with the weight w 3 and w 4 . On the whole, the accuracy and purity increase with w 3 increasing, and they can maintain a relatively stable high value in the range of w 4 < 0.3 and w 3 ≥ 0.2. Besides, when w 4 � 0 and w 3 � 0.2, the purity reaches the highest value of 0.8908, which is the result of experiment 3 in Table 5. When w 4 > w 3 , the accuracy and the purity both drastically decrease, whereas the purity is generally more signifcant than the accuracy, indicating that each function area is still relatively accurate. In short, the "process object" is the most critical infuence factor and has the most weight.

Clustering Performance Analysis of Optimal Division
Result. Te accuracy and purity of the optimal division result (experiment 1) are both 0.8966, indicating that the function clustering precision is high and close to the results of the reference classifcation.
In the optimal division result, the purity of each function area relative to the reference classifcation is given in Table 6. Most of the function areas are divided relatively correctly; especially, the purities of C 1 (provide  However, the clustering performance of C 7 (monitor and manage environment) is relatively poor. It is because the functions of the weather environment in this scenario are very few, and most of them are for travelers, causing it to confuse easily with the functions of C 6 (manage traveler services). If functions about the weather environment are added, the distinction between these two function areas might be improved.
In short, each function area is relatively independent, and the function similarity within the same function area is high, meeting the requirements of the ATS function area division. In addition, the keywords in the function text can show the characteristics of the function area to some extent, and the naming result can be consistent with the reference classifcation combined with the candidate name list. Terefore, the function clustering and naming methods proposed in this research are helpful to practical function area division.

Conclusion
A division approach that can adaptively divide the function areas is proposed based on text similarity for ATS framework research. Based on it, the function set of the vehicle automatic driving scenario is constructed, and the functions are clustered into function areas using the method. Te experiment results show that the proposed method can efectively divide function areas, and the clustering results are relatively more accurate. Also, the text keywords can help to name function areas while reducing the dependence on subjective experience.
Even so, the ATS function area division research still has some limitations. First, the evaluation of the method performance depends on comparing the clustering result with the reference classifcation. Hence, the quality of the reference classifcation is vital, which still lacks inspection. Second, more realistic and complicated scenarios are necessary for the verifcation of the approach in the future. Last, the function areas are manually named here, while automatically generating function area names is worth studying, making it easy to verify a large number of scenarios. In the future, more efcient evaluation methods, more extensive scenario verifcation, and more appropriate function area name generation will be further studied.
Te function area division method helps reduce the dependence on subjective experience and increases the effciency of function area division work. In addition, the method is suitable for other ATS scenarios and clustering problems in other areas, such as text classifcation, whereas the clustering objects need to have multiple texts.

Data Availability
Te data supporting the current study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest.