Text Mining Model for Virtual Community User Portrait Based on Social Network Analysis

: With the rapid development of virtual communities, more and more customers participate in product innovation and knowledge sharing through virtual communities. Research on virtual community members, especially for community members, will help the community to manage the members and further promote community development and knowledge innovation. At present, the main difficulty in the study of community member user portraits lies in the user's grasp of user behavior data in the community. There is a large amount of structured data and semi - structured data in the community, which is crucial for the portrayal dimension of user portraits. This paper uses the association rule crawler algorithm to conduct community user behavior data association search, and uses text mining, social network analysis (SNA) and clustering technology to image users in the knowledge innovation community from the perspectives of professionalism, participation enthusiasm and network capability. The main result divides users into fancier, participator, and tourist.


INTRODUCTION
With the advent of the experience economy (Pin, Glimore 1998) [1], more and more users participate in the process of production, innovation, sales and other aspects.Due to the rapid development of the Internet and the accelerating diffusion of the information, the users could get more information and details about the products and features, which stimulates the enterprise to continually update the products [2].However, in the face of such changes in the market, the enterprise is difficult to grasp the potential needs of users, so that users directly involved in the product innovation process have become the best solution.According to the existing research, users participating in product innovation will reduce the risk of enterprise product innovation, reduce price sensitivity of users and enhance the degree of tolerance to product defects and further improve customer satisfaction of products.
Compared to the traditional user participating in product innovation, NPD based on the virtual community has many unique traits as follows: 1) the number of users is numerous; 2) there is no limit to the time and place of participation; 3) the methods of participation are rich; 4) user has a great attitude [3].How to identify the roles of these users and dig out the most useful users who can give the most contribution to the product innovation has become the hot point to the research field and enterprise.On the one hand, the existing theoretical basis for customer classification is derived from customer entities involved in innovation and the number of entities involved in the environment is limited and the means of participation are less; on the other hand, the form of the questionnaire survey is generally used and the data obtained in this form is subjective and the number of the collection is limited.
Through the web crawler technology, a large amount of data about the behavior of users participating in the product innovation like online hours, the number of posts, the number of responses and browses etc., is collected.Text mining could be used to extract the professional words and emotional words from message text to identify the profession and emotion of the users.The cluster theory and social network analysis are applied to identify the roles of these users and make the user portrait.This research method is to supplement the existing theory of customer classification, and to provide theoretical basis and operation method for the maintenance of customer relationship.

LITERATURE REVIEW
As more and more users participate in the process of product innovation, the customer willingness, innovation ability, effort and individual situation are different, so that some of the users will fit into the process of product innovation and some users will not [4].How to classify these users becomes the main concern of enterprises and scholars.The foreign scholars have carried out the research in this area earlier: Von Hippel (1986) distinguished the leading user from common users.Two specific characteristics were used to describe the leading users: 1) leading users meet new product or new demand a few months or years earlier than the general users; 2) the leading users can reap the benefits by solving their own needs [5].In the study of the open source software community, according to the contribution degree of user participation in the new product development process, Hemetsberger and Pieters (2001) divided the users in the community into three kinds: the major contributors, contributors and the ordinary users [6].In the virtual environment, the users who are involved in innovation are classified as: users with the high innovative ability to provide the most valuable innovations; early users, to provide valuable test input; heavy users will provide existing products defect information to help improve the next generation of products [7].Through the statistics and observation of the posts' number, content, quality and the content of users' communication in online basketball shoe community, Füller, Jawecki (2007) found that active users only accounted for 3% of the total users, 20% of which is called the driven users and the other 80% users is the type of stimulus-driven [8].Perks, Gemser (2015) identified the users as the mainstream users and leading user, which has a wealth of domain knowledge and advanced market demand [9].Gustafsson, Kristensson (2015) only divided users into leading users and ordinary users [10].Domestic scholars have done related research as follows: According to the type of participation, users can be divided to the guiding type, partner type, the proposed type and the report type [11].Based on the participation degree, customer value and customer behavior in product innovation, users are divided into leading users, growing customer, avoidant customer, lag type of customer [12].Sukhodolov et al. (2013) identified the leading customer by collecting auto forum data and combining with the leading users' features such as customer demand leading degree, professional knowledge level, and label praise degree and network connection strength through the cluster analysis [13].However, the defect of this method is division of the category of users by the advantages and disadvantages of the content of a topic; the amount of data is too small, and the recognition of the trust is low.Existing research also pointed out that users are often difficult to directly participate in the core technology research and development of products, and mainly reflected in demand for expression, creative generation, design, information feedback and sharing and other aspects [14].

THE BEHAVIOR OF USERS IN VIRTUAL COMMUNITY
In the virtual community, the behaviors of customer participation in product innovation include browsing, posting themes, response, reply to other users, participating in the official activities such as voting, testing, cooperative development organization [15].Several forms of customer participation in product innovation in the network community can be divided into the following situations: (1) In the community, users post themes to propose product demand, suggestions or complaints.They also discuss with others and hope to capture the attention of enterprises.
(2) Users rely on other's product innovation related topics and discuss with them.
(3) Users participate in the activities of relevant product innovation, such as questionnaire survey, product testing, collaborative development, etc.The customer's feature classification is based on the quality of the content, the release of frequency, professional level, and also includes the member's personal information.Füller, Jawecki (2007) [16] collected the data from basketball shoe forum and observed the behavior of members of the community to analyze the communication content between the members qualitatively and discuss with the industry experts.
By combining the above theoretical analysis and taking XIAOMI community as an example, the original index and derivative index for the customer participation the product innovation is proposed.The data of the original index could be collected directly in the community; the data of derivative index could be obtained through the calculation of the original index.The indicators are shown in the following Tab. 2 and Tab. 3. To reduce the impact of the unnecessary index for classification results, the index for cluster analysis includes IT, BF, IA, RIT, IF.

DATA ACQUISITION BASED ON ASSOCIATION RULE CRAWLER ALGORITHM
There are a large number of structured and semistructured data in the customer innovation community.The traditional crawling algorithm mainly uses the fuzzy Cmeans method, the PSO web crawling method, the hierarchical segmentation clustering crawling method, etc., and the collection effect of structured data is better [17].However, acquisition and matching with semi-structured data is less effective.In this paper, community network member attribute association mining algorithm is adopted, and adaptive learning method is used for data relevance search and clustering.Combined with hierarchical segmentation method, community user behavior, community attributes and scene information are extracted.
Based on the information transfer model of the community network and the mining of the characteristics of the community network user behavior information, the data crawling is performed.The initial network topology is as shown in the Fig. 1.The fuzzy attribute clustering method is used to cluster the user behavior attribute of the community network, and the preference attribute value of the community network user behavior feature reptile is recorded as where  out () is the starting point crawler link trajectory; u is the community attribute mixing recommendation association information set, that is, the u degree set, using the fuzzy decision making method, calculating the time trajectory set from the time T 0 to the crawler end user position, and obtaining the user behavior The eigenvalues of the association rules mining of attributes are: ) The feature vector of the cluster output is selfcorrelated feature template matching method to achieve information fusion.The fitness function of information fusion is: In the formula, (, )indicates the number of  →  paths, and   ∈ (0,1], the fuzzy directivity constraint control method is used to cluster the community user behavior attributes to improve the community discovery ability.The main functions are implemented in Tab. 4.
The customer innovation community behavior attribute data and text data obtained by the association rule crawler algorithm are shown in Tab. 5 and Tab. 6.

USER MESSAGE CONTENT TEXT ANALYSIS
This chapter uses text mining technology to perform text mining on member messages, and analyzes the professionalism, enthusiasm, and emotion of members' speeches.By using the Jieba Chinese word segmentation tool, the user's message is segmented, combined with the existing mobile phone professional vocabulary and emotional word vocabulary, the professional vocabulary and emotional vocabulary in the message are extracted, and the number of messages, professional vocabulary and emotional vocabulary is counted.Emotional vocabulary has positive vocabulary and negative vocabulary, and the enthusiasm of users is measured by the ratio of positive vocabulary to categorical vocabulary.The processing could be seen from Fig. 2.

Two functional recommendations
Just arrived today, the first feeling first to talk about!Is dialing.I like it.Let me talk about two points first, the first hope to add a dial button switch like the 2.3 system!Just press the dial button and do not appear the number button.

I hope to join the smart stroke dialing
Recently updated 2.10.12 joined the new feature phonetic dialing.Plus a friend in the forum said this smart stroke dialing function!Reminds me that smart stroke dialing is actually a feature I have been using before, in the past with Nokia In the era of N72, there is an input method called the national pen!The biggest reason for like the national pen is that you can dial the smart pen directly on the numeric keypad.The core text-mining algorithm is listed as Tab. 7.
The results of the professional words extraction and the emotional words extraction could be seen as Tab. 8 and Tab. 9.The sum of users' message words are counted and listed in Tab.10.

DATA PREPROCESSING
The establishment of the marine knowledge interest community is for fans who are interested in marine knowledge, communicating with each other.They could discuss with each other, share the knowledge or publish questions.It is one of the best virtual communityies to involve users to participate in innovation.
In the community, every member has a unique record with a nickname and a UID.It records users' registration time, online hours, behaviors data like the quantity of theme, response, etc.
The data used in this paper is collected from an interesting group, about 150 members.In-Degree centrality defines the number of comments received by each member of the network and indicates a user's popularity and prestige within the network.Out-Degree Centrality is defined as the amount of a node's outgoing relationships.The out-degree measures the number of comments each user has written on the ideas of other community members.
In the following matrix, the line represents the users, who post themes; the list represents the users, who reply to others themes.Xij is the number of columns J users replying to line I customer's themes.The relationship of the mutual responses between users is shown in Tab.13: K class customer's internal sum of deviation square is ( ) ( ) The specific approach is to take n samples into a class, and then each time to narrow a class, each of the differences between the sum of the square and to increase, select the S to increase the smallest class merger, knowing that all the samples belong to a class so far.From the results of cluster, the user's portrait could be divided to three categories as in Fig. 3.
(1) Fancier.For this type of users, the sum of theme, innovation proportion, and innovation frequency are obviously more than the other two types of users.It shows that the users have a strong willingness to participate in the innovation of the products and putting into action.The high efficient innovation frequency means that the main time can be used for product innovation.
The high number of friends and responses shows that such users' social relationships in the community are rich, and could be able to actively participate in the discussion of the topic.Users' average online time is much higher than the other two types of users, and the registration of the time is earlier.They are the earlier users of this product and continue to pay attention to follow-up products, of which 95% of the users continued activity to 2016.
From the average browse, reply and an average number of words of the themes, we can find that the index of this type of users is higher than of the other two types of users, especially the average number of the words of the themes.Through the detailed analysis of the samples, we also found that this type of users could describe the problems they found or new demands they require more clearly.For the questions, they could give their opinions or propose solutions.From the above data analysis, we could define this type of users as the positive users.
(2) Participator.Most of indexes are less than the third type of users, but more first type of users.This kind of users have the willingness to participate in the product innovation, and to a certain extent, to participate in the product innovation.They have weak social relationship with other members and are less involved in the topic.They are easily affected by internal and external factors and brand loyalty is low.
(3) Tourist.This kind of users have low willing to participate in the product innovation, they have less themes about innovation and rarely participate in the topic discussion.But they may join the community earlier and be positive in the community.Most of the themes posted in the community are not related to product innovation.Some of them are very concerned about social relations and very enthusiastic to participate in the discussion, which is nothing about product innovation.

Social Network Analysis
Social network is a collection of social actors and their relationships.Social network analysis is a quantitative study of the relationship between these social actors.Relationship graph and relational matrix are two common ways to describe social network relations.Based on the customer behavior of posting and replying in the community, relation matrix is established and the degree centrality is used to analyse the network construction.The centrality of a point is the number of nodes directly connected to a node, which is a pointer to measure the position of the actor in the network.It is used to measure the interaction ability of the moving person in the network.
In general, if the node's in-degree is large, it means that the node in the network has great influence and a more important position in the network; if the node's out-degree is large, it means that the node likes to communicate with other nodes in the network and is more active, belonging to the active actors in the network.

CONCLUSION
Through the data collecting from a marine knowledge interest community, cluster theory and SNA are used to identify the roles of users in product innovation in the community.Five kinds of users in the virtual community are classified, they include creative users, innovation users, discussion users, common users, guests.Different kinds of users have different behavior, influence on others and impact on the product innovation.They are in the different position in the network.
The deficiency of this paper is that the research object is a partial network, and does not cover the whole network in the community.Future work will be about the impact these users have on the product innovation, the relationship between the roles and influence.

Figure 4
Figure 4The relationship of users in community

Table 1
The classification of users

Table 2
The original index

Table 3
The derived index

Table 4
The main function of association rules

Table 5
The behavior attribute data

Table 7
The text mining algorithm

Table 8
The extraction of professional words

Table 9
The extraction of emotional words

Table 11
The data of the original index

Table 12
The data of the derived index

Table 13 The
The customer sample is divided into k class, that is G 1 G 2 , ... , G k .()represents the customer I.The number of the G t -type customer is the center of gravity of G t Sum of squares of deviations of G t is

Table 14
The result of Cluster

Table 15
In-degree and out-degree of part of users in the social network

Table 16
The proportion of non-isolated points and isolated points

Table 17
The classification of users in product innovation community