Identifying Cross Section Technology Application through Chinese Patent Analysis

: Cross-domain technology application is the application of technology from one ﬁ eld to another to create a wide range of application opportunities. To successfully identify emerging technological application cross sections of patent documents is vital to the competitive advantage of companies, and even nations. An automatic process is needed to save precious resources of human experts and exploit huge numbers of patent documents. Chinese patent documents are the source data of our experiment. In this study, an identi ﬁ cation algorithm was developed on the basis of a cross-collection mixture model to identify cross section and emerging technology from patents written in Chinese. To verify the algorithm ’ s effectiveness, documents in three transmission-related technology subclasses and one application technology category were collected from WEBPAT Taiwan. The former subclasses consist of H04B: Transmission; H04L: Transmission of digital information; and H04N: Image communication; and the latter is G06Q: Patents for administration, management, commerce, operation, supervi-sion, or prediction by using data processing systems or methods. Growth rate detection was the most popular approach to forecast emerging technologies, our research de ﬁ ned the growth rate as the difference between the numbers of technol-ogy-containing documents published in different time. The emerging technology identi ﬁ ed using the proposed method exhibited an average growth rate of 95.08%. By comparison, two benchmark methods identi ﬁ ed emerging technology with average growth rates of 9.57% and 51.49%.


Introduction
Cross-domain technology application uses technology from one field in another, with impacts on human lives and commercial undertakings. Examples include the global positioning system (GPS), radio frequency identification (RFID), light-emitting diode (LED), and financial technology (fintech). The GPS project was launched by the U.S. Department of Defense in 1973 for military use, and became fully operational Since China's accession to the World Trade Organization (WTO) in 2000, the number of patents written in Chinese has increased exponentially [21]. Therefore, technology that can process text documents written in Chinese is essential. The problem in identifying technological terms in such documents is that Chinese characters have multiple meanings. To convey a specific meaning, characters are combined to form words. Analysis of word construction requires extensive linguistic knowledge, so word segmentation is generally conducted using tools developed by professional teams, such as the Chinese Knowledge Information Processing (CKIP) group [22]. However, because technology-related terms tend to be long, terminology identified using such tools tends to be too specific, reducing the possibility of identifying cross-class technology. For example, both "wireless device" and "wireless communication" are related to wireless technology, but they are identified as different words. As a result, a specific approach or human effort must be employed to identify the common technological term "wireless" in both words. This study develops a method to automatically understand terminology to identify cross-class technology applications.
In summary, to successfully identify emerging technological applications across classes of patent documents is crucial to the competitive advantage of corporations, and even countries. An automatic process to solve the language issue is critical to reduce the need for human experts and take advantage of the large numbers of available patent documents. However, to automate this process faces several obstacles: 1. No automatic methodology has been proposed to identify IPC cross-class technology applications. 2. The methodology developed must be able to automatically identify emerging applications of technologies. 3. A novel method is necessary to accommodate the clumsy results of current Chinese word segmentation technologies.
To resolve these issues, this study proposes a methodology based on common and specific theme analysis of patent documents. To avoid litigation, patent documents tend to use different words to describe similar technologies. A keyword-based approach is therefore inadequate. The cross-collection mixture model (CCMM) has been developed to identify common and specific themes [23][24][25][26][27]. Popular concepts among all documents are recorded as common themes, and subconcepts pertaining to particular collections of documents as specific themes. Because technologies developed within their source classes tend to be described with the original concept, they can be identified as common themes. In application classes, which adopt technology from other classes, the usage of these technologies tends to be annotated with a description of their original applications. As a result, these technological applications can be identified with specific themes.
To verify the effectiveness of the developed method, three subclasses belonging to class H04 were treated as source classes. These were H04B (transmission), H04L (digital information), and H04N (image communication). Class G06Q, belonging to another section, was selected as the application class. G06Q collects patents for administration, management, commerce, operation, supervision, and prediction using data processing systems or methods. The patents were collected from WEBPAT Taiwan [28].
A technology application map was developed to visualize the identified source and application technologies. In this study, three application technologies were identified as emerging technologies. Growth rate detection was the most popular approach to forecast emerging technologies, our research defined the growth rate as the difference between the numbers of technology-containing documents published in different time. The average growth rate of these technologies was 95.08%, whereas those of the technologies identified using two benchmark methods [29,11] were 9.57% and 51.49%.
The remainder of this article is organized as follows. Section 2 reviews the research on the introduction of the IPC, identification of emerging technological terminologies, identification of technological terms in Chinese patents, and cross-collection mixture (theme) models. The proposed model, research design, and methods are detailed in Section 3. The data acquisition process, experiments, visualization of cross-class technology, and identified emerging technologies are discussed in Section 4. Contributions, research limitations of this study, and recommendations for future research are described in Section 5.

Introduction of IPC
The world's most widely used patent classification system, the IPC was established in 1954 with the condition that it be updated every five years [20]. Each classification symbol has the form A01B 1/00. The first letter represents the "section". Combined with the two-digit number represents the "class". The final letter makes up the "subclass" and the following letter indicates the subclass. The subclass is followed by a one-to-three-digit "group" number, an oblique stroke and a number of at least two digits representing a "main group" or "subgroup". A patent examiner assigns classification symbols in a patent application in accordance with classification rules [20]. IPC was last revised in 2016 and consists of 8 sections, 120 classes, 628 subclasses, and 69,000 main items. Tab. 1 shows the main classifications [20].

Identification of Emerging Technology Terminologies
Emerging technology shows high potential whose value has not yet been demonstrated or agreed upon by a community of users [9]. Rotolo et al. [19] identified five attributes that feature in the emergence of novel technologies: (i) Radical novelty; (ii) Relatively fast growth; (iii) Coherence; (iv) Prominent impact; and (v) Uncertainty and ambiguity.
Most studies turn to text mining to identify the terminologies representing or symbolizing emerging technologies. This involves identifying n-gram words (is a contiguous sequence of n items from a given sequence of text or speech) [29], keywords with high frequency-inverse document frequency (tf-idf: is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus) [26], and words frequently used in titles and abstracts [30]. Corrocher et al. [29] proposed a 3-gram method to analyze patent abstracts. This involves collecting 3-gram words from patent abstracts at two different intervals and calculating momentum based on their frequency differences. Words with high momentum are identified as emerging technology terminologies. Momentum has also been used to screen and validate potential emerging technology terminologies in other research [12,14,29,31,32]. Shibata et al. [11] proposed that words with the highest tf-idf values in emerging clusters constitute emerging technology terminologies. Ma and Porter [30] proposed the clustering of keywords to identify emerging topics.
This study utilizes momentums and frequencies to identify terminologies from a set of words included in specific themes, which we explain below. Be reminded that words in specific themes must be further processed due to the inappropriate word phrasing of Chinese word segmentation systems.

Identification of Technology Terms in Chinese Patents
Research has shown that even among native Chinese speakers, only approximately 75% agreement can be achieved with regard to correct segmentation, and the percentage of agreement decreases as the number of people involved increases [33]. A dictionary-based method enhanced with lexical rules is most commonly used for Chinese word segmentation [33]. A classic approach to applying lexical rules is the maximum (longest term) matching method [34,35], which is based on the assumption that the most meaningful words usually comprise the maximum concatenation of Chinese characters. As a result, the method tends to produce the maximum number of valid words, and it sometimes fails to tokenize hidden subwords. However, because technology terminology is usually long, the technology identified using such tools tends to be highly specific, reducing the possibility of identifying cross section technology terminology in Chinese. Hence, this study proposes a method to further segment words derived from Chinese word segmentation systems.

Cross-Collection Mixture (Theme) Model
CCMM [27] has been applied to identify potentially relevant terms, and has been used in contextual text mining to identify topic evolution patterns [24] and summarize the history of a theme evolving on a news website. Mei et al. [25] applied this approach to mine spatiotemporal theme patterns on weblogs. They extracted common themes, generated a theme life cycle for each location, and created a theme snapshot for each time period. Mei et al. [36] claimed that this model simultaneously captured the mixture of topics and sentiments. Mei et al. [37] combined this approach with network analysis to summarize topics in text, map a topic onto a network, and discover topical communities within user networks.
Emerging technology application involves transferring popular technology from a source field to an application field to create new applications; therefore, the technology terminology identified in the source field should be associated with common themes, and that in the application field with specific themes related to a particular temporal interval to reflect the freshness of the emerging technology.

Research Methodology
In the proposed method, patents are collected from documents in IPC sections. At least one section should provide the source technology, and one should provide the application technology. The collected patent documents should have been published over at least two consecutive years. If the documents are written in Chinese, then a Chinese word segmentation system such as CKIP [22] is used. All segmented words and documents are analyzed using a CCMM. Representative words in a common theme are examined using n-gram methods to identify common technology, which is matched with terminology in specific themes related to the application technology. Subsequently, the common technology and identified cross section technology terminology are shown on a technology map. Terminology with high momentum-appears in years beyond the specified number of years (threshold) is considered emerging. The proposed methodology is diagrammed in Fig. 1.

Definition of Cross-Collection Mixture (Theme) Model
A theme is a concept derived from a collection of documents and represented by a set of words [23][24][25]27]. Following the approach of Zhai et al. [17], three theme types were adopted in this research: common, specific, and background. A common theme is a general concept originating from a document collection; a specific theme is a subconcept derived from a subset of documents; and a background theme refers to a group of general words that are closely related to stop words which are filtered out before or after processing of natural language document.
Words highly associated with a common theme are considered popular technologies in the IPC-classified section. Cross section analysis entails the identification of technology that is popular in its own (source) section and emerging in another section. A specific theme within a common theme represents a subconcept derived in a particular year from patent documents published in that year. Terminology for specific themes is considered to denote popular technology in a specific year. Such terminology potentially represents emerging technology in that year.
We define the following based on Zhai's definitions of themes [27].

Definition 1
a) A theme is a concept shared by a collection of documents. More than one document can share a theme, and a document can address several themes. b) A set of document collections can address a set of common themes, denoted by Â ¼ h 1 ; . . . ; h K f g : Figure 1: Process of proposed methodology c) A background theme is a special theme h B = 2 Â that includes popular stop words in the document collections. d) Given a collection of documents published at time t 1 ; . . . ; t M , each common theme h i 2 Â has a specific theme denoted by h ij , which represents the subconcept of h i derived from documents published at time t j . In a theme model, the main purpose of the background theme is to collect and remove words that appear too often as representative words. Words collected under the common theme are those with high probability in the document collections for the entire time frame. A specific theme collects only words with high probability in a certain time period, whose probabilities represent their intensities in a collection.
A document uses a sequence of words from a vocabulary set to describe concepts. Therefore, each document should include several themes. Each theme is associated with a set of words annotated with an intensity probability. Based on Zhai et al.'s research [36], we define the distributions of themes among documents and words among themes.

Definition 2 a) Given a document d and the set of all addressed common themes
Given a theme h i and one of its specific themes h ij , the vocabulary of all possible words represent the distribution models of w v being generated with themes h i and h ij , respectively. A model of words can be derived based on the distribution models of themes.

Definition 3
Given a document d published at time t j , the probability that a user reads a word w 2 d is where k B and k s are the weights of the inclination of a word to the background and specific themes, respectively.
The values of k B and k s remain for the user to determine. The higher the value of k B , the more likely the words are to be discarded; the higher the value of k s , the more sensitive the model is to short-term words. The model is shown in Fig. 2.
Given C 1 ; . . . ; C M as a set of documents issued at time t 1 ; …; t M , and c w; d ð Þ as the count of w in d, an expectation maximum algorithm was designed to maximize the objective function,

Parameter Estimation
We discuss the tuning of parameters to maximize data distribution likelihood. Three parameters are fixed before the estimation: kB and ks are manually selected, and p wjhB ð Þ is determined as An expectation maximization algorithm [38] is used to estimate the remaining parameters by maximizing the data distribution likelihood. The updating rules are as expressed in Eqs. (3)- (8).
Two hidden variables, Z d;w;t j and Z d;h i ;w;t j , are introduced in the updating rules. Z d;w;t j represents the theme that w addresses in document d published at time t j , and Z d;h i ;w;t j is a binary value denoting whether w addresses the common theme, but not the specific theme, of h i : We obtained the representative words of common and specific themes defined in Definition 4 to identify popular and potential technology in a given source and application section, respectively.

Definition 4
a) Given a common theme h i ; a set of representative words of the common theme is expressed as Given a specific theme h ij ; a set of representative words of the specific theme is expressed as where ' c and ' s are user-provided popularity thresholds.

Identifying Cross Section Technology Terminology
This study proposes to identify cross section technology terminology using representative words of common and specific themes according to the following observations. 1. Because a specific theme represents a subconcept that is popular during a particular time period, representative words that have high distribution values in a specific theme are candidates for emerging technology terminology.

Numerous cross section technology developments integrate popular technology in one section with
technology in another section to improve products or services.
An n-gram method is applied to divide Chinese representative words into terms. Terms that appear in sufficient numbers of representative words are considered to denote popular technology. Popular technology terms are then used to identify cross section technology terminology.
Assuming that a word is a sequence of characters, a set of n-gram terms is defined as follows.

Definition 5
a) A term t ¼ t 1 ; . . . ; t u f gis a subword of a word w ¼ s 1 ; . . . ; s v f gand is represented as t " w if t 1 ¼ s p ; t 2 ¼ s pþ1 ; …; and t u ¼ s pþuÀ1 . b) Given a set of words W , a set of n-gram terms of W is given by W n ¼ w n j9wEW; w n " w; w n j j ¼ n f g . c) Given an n-gram term w n and a set of n-gram terms W n , the support (w n ; W n ) is the count of w n in W n . d) Given a common theme h i , its representative words W i , and the set of the top h n-gram terms of a common theme h i , T n i ¼ arg top h ð Þ {support w n i ; W n i À Á g. Cross section technology terminology can be identified using the top (most popular) terms. Representative words of specific application section themes that include popular terms are identified as cross section technology terminology.

Definition 6
a) Given a specific theme h ij ; its representative words W ij , and the popular n-gram terms of the common theme l; 1 l K; l 6 ¼ i ð ÞT n l , the set of cross section technology terminology structures F is a structure given by pt; it; y h i , where pt is the popular term, it is the cross section technology terminology, and y is the year of patent publication: F ¼ , l; w ij ; j . jw ij 2 È W ij ; t n 2 [ 1 l K; l6 ¼i T n l ; t n w ij ; 1 j M g. b) Given a cross section technology terminology structure f , "." is a reference operator. Specifically, f Á pt, f Á it; and f Á y denote the popular terms, cross section technology terminology, and year, respectively, of technology terminology structure f .

Identifying Cross Section Emerging Technology Application
Cross section technology terminology that appears in years beyond the specified number of consecutive years is considered emerging. Cross section emerging technology application is defined in Definition 7, and Fig. 3 shows the pseudo-code of an algorithm to identify cross section emerging technology applications.

Definition 7
a) Given a momentum threshold q and a set of cross section technology terminology structures F, a set of cross section emerging technologies is given as

Data Description
Transmission and communication technologies critically affect daily life. Cell phones and other means of wireless communication have given rise to a generation with entirely new information technology consumption behaviors. Therefore, for this study, we chose the subclasses of transmission (H04B), transmission of digital information (H04L), and image communication (H04N) as source classes. To win or maintain competitive advantages, corporations have raced to adapt these technologies and create novel applications. Subclass G06Q includes patents for administration, management, commerce, operation, supervision, and predictions based on data processing systems or methods, and was therefore designated as the application class where the application of cross-class technology should be found.
In total, 1,562 abstracts of patent documents belonging to the four subclasses and published between 2006 and 2011 were collected. Among them, 378, 253, 324, 275, 265, and 67 were published in 2006,2007,2008,2009,2010, and 2011, respectively. The number of cases published in 2011 is low because only a portion of that year's patents had been submitted when data were being collected for the IPC. Subclasses H04B, H04L, H04N, and G06Q had 339, 448, 396, and 379 cases, respectively, from WEBPAT Taiwan [28].

Initial Settings of Theme and Word Distribution
Using the four subclasses and six years of data, in CCMM, the number of common themes is the same as the number of patent classes investigated, which is K. The number of specific themes, M, corresponds to the number of years from which the patents were collected.
The theoretical value of p 0 di should be between 1 (d 2 h i ) and 0 d= 2h i ð Þ. Because each document's class was clear in this study, the value should have been either 1 or 0. However, the value 0 prevented the function from updating rules in the CCMM, so we set small nonzero values of p 0 d;i when d= 2h i : where P i¼1;:::;K p 0 d;i ¼ 1. The initial value of the distribution model of each word (wÞ expressing the common theme h i is The initial value of the distribution model of each word (wÞ expressing the specific theme h ij is 4.3 4.3 Setting the Parameters λ B and λ S λ S and λ B , respectively, sort words into specific and background themes. Studies have set λ B as 0.95 [23,25,27]. The value of λ S affects the distribution of words among specific themes and common themes, in turn affecting the theme distribution among documents. Therefore, we chose the λ s value that ensured the greatest precision in assigning document themes. In this study, a document d was considered to address theme h if p dh had the highest value among p di , where i = 1,…, K: Studies have reported that λ S ranges from 0.2 to 0.8 [23,25,27]. In this study, CCMM performed with the highest precision when λ S = 0.6, as shown in Fig. 4. This value was chosen for the remainder of the experiment.
Tab. 2 shows the top five representative words of each common theme and their probabilities in the corresponding themes.

Evaluating the Quality of Discovered Words
We used the representative words of common themes to identify cross section technology terminology. Therefore, before identifying the terminology, the quality of the identified word distribution among the common themes was evaluated. Paradimitriou et al. [39] proposed an ε-separator index to evaluate the quality of common themes. In a successful model, each theme should be at least 95% distributed among words from the primary set, with the remaining 5% distributed among the other sets. The up-to-5% theme distribution outside the primary set is considered an E-separator index. Tab. 3 shows an E-separator index matrix for all theme combinations. All index values were lower than the 5% threshold. Therefore, the quality of topic distribution was acceptable.

Identifying Terminology Representing Cross Section Technology
The representative words in themes indicate significant terms in the corresponding section. In this experiment, ' c and ' s denote the percentage of words deemed to represent words in common and specific themes, respectively. The weight associated with each word and theme denotes the capability of the word to represent the associated concept. Hence, the higher the weight, the more representative the word. This study discovered as many popular technologies (identified by representing words) as possible. Therefore, the top 50% weighted words were chosen for each common and specific theme.
The n-gram approach was adopted to divide representing words into shorter Chinese words. Most Chinese words are one to four characters long [40], and 84.55% of the representative terms contained fewer than five Chinese characters.
Each term was counted to filter out meaningless or unpopular n-gram terms. Each term was associated with a count tracking the number of representing words including the term. We used a top-h methodology to select the h terms with the highest counts. Fig. 5 shows the total counts versus the value of h. The greatest decrease in average counts occurred when h increased from 3 to 4, so h was set at 3.
After the top popular terms were identified (Tab. 4), the cross section technology terminology could be identified. The representative words of specific themes in the application section that included the

Constructing the Cross Section Technology Application Map
The map shows the development of cross section technology through the following features.
1. The source and application section are identified at the top of the map. 2. The representative terms are shown on the left side of the map. 3. In the main part of the figure, the evolution of the cross section technology terminology is shown with the years marked at the top. Fig. 6 shows the cross section technology map derived from this experiment. The only four popular terms identified were "wireless," "digital," "image," and "network." The cross section technology terminology related to "wireless" comprised "wireless tag" (無線標籤), "wireless ID card" (無線識別卡), and "RFID" (無線射頻). These are all wireless devices that transfer information, and are therefore related to the term "digital," which includes "digital rights" (數位權利) and "digital database" (數位資料庫). The term "image" (影像) includes "image line" (影像線), which is related to "network" (網路). "Network" includes "network version" (網路版型) and "primary network" (主要網路).

Identifying Cross Section Emerging Technology Terms
Tab. 6 shows eight cross section technology terminologies, of which "wireless tag," "RFID," and "digital rights" exhibited momentum values higher than 2. Depending on the momentum threshold, these three terms may represent emerging technologies [10,11,13]. To gather as many terminologies as possible, all three terms were retained as emerging cross section technology terms.  Two other methods have also been proposed to identify emerging terminologies. Corrocher et al. [29] proposed a 3-gram method to identify all 3-gram words with significant frequency in patent abstracts. Shibata et al. [11] used tf-idf to select terminologies with high values. In our control experiment with the tf-idf method, we selected the top five terms from each specific theme. In both of these methods, following the suggestions of Corrocher et al. [29], terms with higher than average growth rates were identified as emerging terminologies. The growth rate is defined as follows: for each term w, the number of documents set (C 1 ) containing it and published in time t 1 is compared with another documents set (C 2 ) containing term w and published in time t 2 .

Conclusion
This study developed a methodology to systemically identify cross-section emerging technological applications between source sections and an application section. Concepts (themes) and representing words in each section were captured by methods revised from CCMM. Besides common themes, CCMM can also capture specific themes developed in each year. Representing words in common themes corresponding to source sections were the technologies that had been developed in that section and that could be adopted in the application section. The representing words of specific themes were the technologies being developed in the application section. These segmented representing words from the source sections were compared against the representing words in the application section to identify cross section technological terminologies. Those with high momentum values were further identified as emerging technologies. The proposed method also generated a technology map to illustrate the adoption of technological terminology in the application section.
To verify the effectiveness of the developed method, four subclasses of patent documents (IPC codes H04B, H04L, H04N, and G06Q) were collected from WEBPAT Taiwan [28]. Three transmission-related technology subclasses, H04B, H04L, and H04N, and one application technology subclass, G06Q, were treated as source and application sections, respectively. The average growth rates of the identified emerging technologies determined from the proposed method, 3-gram approach, and tf-idf methods were 95.08%, 9.57%, and 51.49%, respectively.
This study was explorative and had several limitations. First, the sections covered were limited. Future research should collect patent documents from more varied sections to verify the method's applicability. Second, CCMM is a traditional model, and other refined topic models may be applied to identify word and document distributions. Third, although designed for documents written in Chinese, the proposed model should not be restricted to a specific language.
Funding Statement: The authors received no specific funding for this study.

Conflicts of Interest:
We declare that they have no conflicts of interest to report regarding the present study.