ScholarLens: Extracting competences from research publications for the automatic generation of semantic user profiles

Motivation: Scientists increasingly rely on intelligent information systems to help them in their daily tasks, in particular for managing research objects, like publications or datasets. The relatively young research field of Semantic Publishing has been addressing the question how scientific applications can be improved through semantically rich representations of research objects, in order to facilitate their discovery and re-use. To complement the efforts in this area, we propose an automatic workflow to construct semantic user profiles of scholars, so that scholarly applications, like digital libraries or data repositories, can better understand their users' interests, tasks, and competences, by incorporating these user profiles in their design. To make the user profiles sharable across applications, we propose to build them based on standard semantic web technologies, in particular the Resource Description Framework (RDF) for representing user profiles and Linked Open Data (LOD) sources for representing competence topics.To avoid the cold start problem, we suggest to automatically populate these profiles by analyzing the publications (co-)authored by users, which we hypothesize reflect their research competences. Results: We developed a novel approach, ScholarLens , which can automatically generate semantic user profiles for authors of scholarly literature.For modeling the competences of scholarly users and groups, we surveyed a number of existing linked open data vocabularies. In accordance with the LOD best practices, we propose an RDF Schema (RDFS) based model for competence records that reuses existing vocabularies where appropriate. To automate the creation of semantic user profiles, we developed a complete, automated workflow that can generate semantic user profiles by analyzing full-text research articles through various natural language processing (NLP) techniques.In our method, we start by processing a set of research articles for ABSTRACT Motivation: Scientists increasingly rely on intelligent information systems to help them in their daily tasks, in particular for managing research objects, like publications or datasets. The relatively young research ﬁeld of Semantic Publishing has been addressing the question how scientiﬁc applications can be improved through semantically rich representations of research objects, in order to facilitate their discovery and re-use. To complement the efforts in this area, we propose an automatic workﬂow to construct semantic user proﬁles of scholars, so that scholarly applications, like digital libraries or data repositories, can better understand their users’ interests, tasks, and competences, by incorporating these user proﬁles in their design. To make the user proﬁles sharable across applications, we propose to build them based on standard semantic web technologies, in particular the Resource Description Framework (RDF) for representing user proﬁles and Linked Open Data (LOD) sources for representing competence topics. To avoid the cold start problem, we suggest to automatically populate these proﬁles by analyzing the publications (co-)authored by users, which we hypothesize reﬂect their research competences. Results: We developed a novel approach, ScholarLens , which can automatically generate semantic user proﬁles for authors of scholarly literature. For modeling the competences of scholarly users and groups, we surveyed a number of existing linked open data vocabularies. In accordance with the LOD best practices, we propose an RDF Schema (RDFS) based model for competence records that reuses existing vocabularies where appropriate. To automate the creation of semantic user proﬁles, we developed a complete, automated workﬂow that can generate semantic user proﬁles by analyzing full-text research articles through various natural language processing (NLP) we by by mining the and LOD entity linking steps. We then populate a knowledge base in with proﬁles competences. We implemented our as an source library and evaluated our system through two user studies, resulting in mean average precision (MAP) of to 95%. As part of the we also analyze the of on accuracy of resulting Finally, we demonstrate ﬁnd


INTRODUCTION
1 Researchers increasingly leverage intelligent information systems for managing their 2 research objects, like datasets, publications, or projects. An ongoing challenge is 3 the overload scientists face when trying to identify relevant information, for example 4 when using a web-based search engine: While it is easy to find numerous potentially 5 relevant results, evaluating each of these is still performed manually and thus very 6 time-consuming. 7 We argue that smarter scholarly applications require not just a semantically rich 8 representation of research objects, but also of their users: By understanding a scientist's 9 interests, competences, projects and tasks, intelligent systems can deliver improved 10 results, e.g., by filtering and ranking results through personalization algorithms (Sieg 11 et al., 2007). 12 So-called user profiles (Brusilovsky and Millán, 2007) have been adopted in domains 13 like e-learning, recommender systems or personalized news portals (we provide a brief 14 background on user profiling in the 'Background' section). Increasingly, they also 15 receive more and more attention in scientific applications, such as expertise retrieval 16 systems. Constructing such user models automatically is still a challenging task and 17 even though various approaches have already been proposed, a semantic solution based 18 on Linked Open Data (LOD) (Heath and Bizer, 2011) principles is still missing. 19 We show that a semantically rich representation of users is crucial for enabling a 20 number of advanced use cases in scholarly applications. One of our central points is that  Bootstrapping such a user profile is an infamous issue in recommendation approaches, 30 known as the cold start problem, as asking users to manually create possibly hundreds 31 of entries for their profile is not realistic in practice. Our goal is to be able to create an 32 accurate profile of a scientist's competences, which we hypothesize can be automatically  Figure 1. 40 To evaluate our profile generation approach, we performed two user studies with 41 ten and twenty-five scientists from various research groups across Europe and North Figure 1. This diagram shows a high-level overview of our approach to semantic user profiling: Users can bootstrap their profiles by providing a set of their (co-)authored publications. The extracted knowledge is then stored in a knowledge base that can be incorporated in various scholarly applications. Researchers can then obtain personalized services through applications leveraging the semantic user profiles.
America. The participants were provided with two different user profiles each, which 43 were automatically generated based on their publications: One based on the articles' full 44 texts, the second restricted to rhetorical entities (REs), like the claims and contributions 45 in a paper (Sateli and Witte, 2015). In each study, we asked the participants to evaluate 46 the generated top-N competence entries in their user profiles. The results, provided in 47 the 'Evaluation' section, show that our approach can automatically generate user profiles 48 with a precision of up to 95% (Mean Average Precision for top-10 competences). 49 Finally, we illustrate in the 'Application Examples' section how semantic user 50 profiles can be leveraged by scholarly information systems in a number of use cases, 51 including a competence analysis for a user (e.g., for finding reviewers for a new paper) 52 and re-ranking of article search results, based on a user's profile. 53 2 BACKGROUND 54 In this section, we provide background information on user profiling, competence 55 management and its applications. We also briefly introduce semantic publishing and its 56 connections with natural language processing (NLP) techniques. Manuscript to be reviewed Computer Science

58
A user profile is an instance of a user model that contains either a user's characteristics, 59 such as knowledge about a topic, interests and backgrounds, or focuses on the context 60 of a user's work, e.g., location and time (Brusilovsky and Millán, 2007). Depending  (Gauch et al., 2007). Explicit user feedback actively requests interests from 68 a user, whereas implicit user feedback derives preferences from the user's activities. 69 Commonly used implicit profiling techniques observe the user's browsing behavior and 70 extract preferences from web or query logs, analyze the browser history and derive 71 interest weights from the numbers of clicks or the time spent on a page. According to 72 findings in (Gauch et al., 2007), there is no significant evidence that an explicit user 73 feedback mechanism results in better personalized content than implicitly recorded user 74 information. Therefore, personalized applications nowadays mainly employ implicit 75 profiling techniques, since they are less intrusive from a user's perspective. 76 In the context of scholarly applications, user profiles have been used in ad-hoc 77 approaches, such as the expertise retrieval system used at Tilburg University (UvT) (https: 78 //www.tilburguniversity.edu/), academic search engines like AMiner (https://aminer.org) 79 or personalized paper recommendations in Google Scholar (https://scholar.google.com). 80 The most dominant representation of user characteristics in this type of application is 81 a weighted vector of keywords. This simple mathematical description permits classical 82 information filtering algorithms, such as cosine similarity (Manning et al., 2008), in 83 order to measure item-to-item, user-to-user and item-to-user similarity.     129 We focus our review on two core aspects: Firstly, existing semantic vocabularies that 130 describe scholars in academic institutions with their publications and competences, in 131 order to establish semantic user profiles. And secondly, we examine existing approaches 132 for automatic profile generation through NLP methods.

134
In the area of user modeling, a multitude of semantic approaches have emerged in the 135 last decade that go beyond representing users interests with keywords in favour of using 136 concepts of domain ontologies, for example in a vector-based model (Sieg et al., 2007;137 Cantador and Castells, 2011). In addition to providing a common understanding of 138 domain knowledge, using semantic technologies also fosters the evolution towards more 139 generic user models. An important goal of generic user modeling is facilitating software 140 development and promoting reusability (Kobsa, 2001)

Computer Science
Other ontologies aiming to unify user modeling vocabularies in semantic web 189 applications are the Scrutable User Modeling Infrastructure (SUMI) (Kyriacou et al.,190 2009) and the ontology developed by (Golemati et al., 2007). Besides general user 191 information such as contact, address, preferences, education and profession, (Golemati 192 et al., 2007) also provides a vocabulary for a user's activities in a given timeline.

193
In contrast, SUMI (Kyriacou et al., 2009) models user interests from the profiling 194 perspective, which can be either explicitly given by the user or implicitly recorded by the 195 system. The user model in SUMI is divided into four categories. The first two categories 196 contain the manually provided user information: (i) generic personal user data and (ii) 197 interests that are only specific for a certain application, e.g., preferences that are only    One of the use cases we had in mind when designing our ScholarLens methodology 229 was its application within sophisticated information filtering system for scholars that 230 consider a user's research background. Therefore, we explored the generic user models 231 and scholarly ontologies reviewed above, in order to determine how well they can 232 express features of scientific user modeling. The outcome of our study is summarized in  Medium n/a Low n/a n/a n/a AIISO Medium High n/a n/a Low n/a n/a Low FOAF Low n/a Low n/a Low n/a n/a Low GUMO Low Low n/a n/a n/a Medium n/a High IntelLEO Low Low Low n/a Low High Medium Medium

251
Generic user modeling requires new methods for user profiling. Merely observing 252 a user's browsing behavior is not enough for various tasks a scholar is involved in.

253
More complex user information can be obtained from, e.g., context resources, such 254 as affiliations a scholar is associated with, but also from content sources, for instance,  UvT collection, the ground truth was explicitly given by the users, as they provide a 315 description of their research areas together with keywords from a topic hierarchy.

316
Another notable example for an expertise retrieval system is AMiner (https://aminer.   ing the profiling method, they took into account the following filtering methods: CF-IDF, 376 an adapted TF-IDF algorithm using concepts of ontologies instead of full-text terms,

391
In the last decade, using social media platforms for implicit user profile generation     Manuscript to be reviewed Computer Science   ments that we process, the generated annotations and their inter-relationships. Figure 2   495 shows a minimal example semantic profile in form of an RDF graph. components of our ScholarLens approach that satisfy these requirements. An overview 500 of our system architecture is illustrated in Figure 3.

502
Our goal is to automatically populate semantic user profiles by mining the users' publi-503 cations for competence topics. Therefore, we identify the following requirements for 504 our workflow: 505 Requirement 1: Access to Scholarly Articles' Full-Text. The workflow should be 506 able to accept a set of documents written by an author as input, which may be in various 507 publisher-dependent, formatting styles. The documents must be machine-readable, that 508 is, the workflow must be have access to the textual content of the entire article.   The pre-processed text is subsequently passed onto the semantic processing phase for 533 user competence detection.

535
Since it is not feasible to manually construct and maintain a knowledge base of all    document's content in various use cases, such as retrieving semantically related articles.

565
Additionally, we determined that storing the named entities within REs requires an order 566 of a magnitude fewer triples, compared to exporting the topics of the entire document.

567
To test whether the same assumption can be made about the authors' competence topics, 568 we additionally export all rhetorical entities in a document and add an additional property 569 to each topic (NE) that is mentioned within an RE. We will revisit our hypothesis in the 570 Section 'Extended Experiments'.

572
In this section, we describe how we realized the semantic user profiling of authors 573 illustrated in the previous section.

575
We developed a text mining pipeline (Figure 4) Manuscript to be reviewed Computer Science  We performed two rounds of evaluations: In a first user study, described in detail in 625 (Sateli et al., 2016) and summarized below, we tested an initial version of our profiling 626 approach. To investigate the reasons for generated 'irrelevant' profile entries, we went 627 back to a number of users from our study and performed a post-mortem error analysis.

628
Our finding are presented in Section 'User Study: Error Analysis'.

629
Based on the detected issues, we refined both our profile generation pipeline and the order to evaluate whether we found a pertinent competence.

638
To evaluate the effectiveness of our system, we utilize common retrieval evaluation which is defined as: where k denotes the rank of the competence that is considered and rel(c) marks the rating for the iterating position c, which is either 0 for irrelevant or 1 for relevant topics. While the Precision@k is focused on the result for a certain rank of an individual user,

Manuscript to be reviewed
Computer Science the MAP is a metric that expresses the mean average of competence rankings over all users in one value. MAP (3) indicates how precise an algorithm or system ranks its top-k results, assuming that the entries listed on top are more relevant for the information seeker than the lower ranked results. Precision is then evaluated at a given cut-off rank k, considering only the top-k results returned by the system. Hence, MAP is the mean of the average precisions at each cut-off rank and represents a measure for computing the quality of a system across several information needs; in our case, users with competences. For all relevant competences c ∈ C per user u, we compute the Average Precision (AP) of a user u as follows (2): where rel(c) is 1 if the competence c is relevant and 0 for the opposite case. C r,k is the set of all relevant competences up to a certain cut-off rank k. Finally, for every user u ∈ U, the MAP is then defined as follows (3): In contrast to the MAP, which only considers binary ratings, the DCG (4) computes the 645 ranking based on Likert scales (Likert, 1932). Given a list of competences, rel c is the 646 actual rating of each single competence c. For example, in our second user study we 647 assign the competence types 0 (irrelevant), 1 (general), 2 (technical) and 3 (research), 648 as defined below. Similar to the precision, the DCG assumes that higher ranked items 649 are more relevant for the users than lower ranked. In order to take this into account, a 650 logarithmic decay function is applied to the competences, known as the gains.

651
For a set of users U, let rel(c) be the relevance score given to competence c ∈ C for user u ∈ U. Then, the DCG for every user as defined in (Croft et al., 2009) is the sum over all |C| competence gains (4):    Table 3 shows the evaluation results of our first user study. In this study, a competence 677 was considered as relevant when it had been assigned to one of the three levels of Manuscript to be reviewed Computer Science

686
In order to refine our approach, we went back to the users from the first study to 687 understand the root cause of competences they marked as irrelevant. We asked a number

712
The results of this analysis are summarized in Table 4. As can be seen, the majority

718
With the lessons learned from our first experiment, we enhanced our competence topic 719 detection pipeline to remove the error types iterated in the previous section. In particular, 720 to address Type 1 error, we excluded exporting entities with surface forms like "figure" 721 or "table" from newly generated profiles, as these were consistently linked to irrelevant 722 topics like "figure painting" or "ficus". To address Type 3 and Type 4 errors, we refined 723 the task description shown to participants before they start their evaluation. Additionally,

729
For our revised experiment, we set up a web-based user profile evaluation system. In  Zones, with respect to the overall ratings across the four different competence levels.

758
The results for the precision metrics are displayed in Tables 6 and 7.

759
Since Precision@k and MAP are based on binary ratings (relevant/non-relevant), it 760 has to be specified which competence levels to take into account. Therefore, we defined Manuscript to be reviewed Computer Science single terms, for instance, "service" and "service provider". It finds successful matches 794 for both terms and thus produces general topics. 795 We also evaluated the rated competences with respect to their ranking in the result      Semantic user profiles can be incredibly effective in the context of information retrieval 862 systems. Here, we demonstrate how they can help to improve the relevance of the 863 results. Our proposition is that papers that mention the competence topics of a user are 864 more interesting for her and thus, should be ranked higher in the results. Therefore, the 865 diversity and frequency of topics within a paper should be used as ranking features. We 866  shows the ranked results based on how many times the query topic was mentioned in a 892 document. In contrast, the R6 and R8 profile-based columns show the ranked results 893 using the number of common topics between the papers (full-text) and the researchers'  Manuscript to be reviewed Computer Science We presented semantic user profiles as the next important extension for semantic pub-