Ontology-based Integration of Web Navigation for Dynamic User Profiling

The development of technology for handling information on a Big Data-scale is a buzzing topic of current research. Indeed, improved techniques for knowledge discovery are crucial for scientific and economic exploitation of large-scale raw data. In research collaboration with an industrial actor, we explore the applicability of ontology-based knowledge extraction and representation for today's biggest source of large-scale data, the Web. The goal is to develop a profiling application, based on the implicit information that every user leaves while navigating the online, with the goal to identify and model preferences and interests in a detailed user profile. This includes the identification of current tendencies as well as the prediction of possible future interests, as far as they are deducible from the collected browsing information , and integrated expert domain knowledge. The article at hand gives an overview on the current state of the research, the developments made and insights gained. Introduction "Big Data" is one of the big buzzwords of our time – culminating in the creation of various congresses and conferences focusing on only that topic during the recent years (e.g. IEEE Congress on Big Data, starting from 2011). The handling of immense amounts of data brings scientists and analysts in a dilemma: On the one hand, using sophisticated analysis techniques might bring best results, but usually come with a higher processing complexity and time that is just not tolerable for most applications. On the other hand, methods known for their efficiency may fail to exploit the data sources in all their depth. Several research works proposed distinct criteria to define the nature of "Big Data" (e.g. [1]). The definition largely converges towards the following five:  volume: massive amounts of data have to be treated,  velocity: those data arrive in high speed,  variety: data types and formats are heterogeneous ,  veracity: data are not always sound and have to be verified,  value: they have an inherent value that has to be discovered by the application.


Introduction
"Big Data" is one of the big buzzwords of our time -culminating in the creation of various congresses and conferences focusing on only that topic during the recent years (e.g.IEEE Congress on Big Data, starting from 2011).The handling of immense amounts of data brings scientists and analysts in a dilemma: On the one hand, using sophisticated analysis techniques might bring best results, but usually come with a higher processing complexity and time that is just not tolerable for most applications.On the other hand, methods known for their efficiency may fail to exploit the data sources in all their depth.Several research works proposed distinct criteria to define the nature of "Big Data" (e.g.[1]).The definition largely converges towards the following five:  volume: massive amounts of data have to be treated,  velocity: those data arrive in high speed,  variety: data types and formats are heterogeneous,  veracity: data are not always sound and have to be verified,  value: they have an inherent value that has to be discovered by the application.
Applications acting in a Big Data context have to handle all of them in an efficient manner, balancing analysis depth and performance time.
For that very reason, the application of semantic technology is often discarded for a Big Data context.Semantic analysis seems too complex, too costly to be affordable in an environment in which often already very efficient techniques do not come up to the performance necessities.We want to make a case for ontology-based knowledge representation, even when handling vast data amounts.By employing an ontology that has been customised for the application domain to the very detail, the information is limited to those bits and bytes that are actually relevant.Furthermore, we make an effort to avoid performance issues, by decoupling costly analysis steps from the actual, realtime user profiling process (please refer to Section 0 for details).Furthermore, costly analysis steps have been decoupled from the final system purpose to avoid performance issues.We demonstrate this approach based on an application in digital advertising.Publishers nowadays have detailed information about their user's navigation behaviour: servers DOI: 10.12948/issn14531305/19.1.2015.01 capture not only the web pages that were requested by a certain ID, but also the respective time stamps, device information etc.These elements allow insight in usage patterns, but also a deduction of the various contexts, a user might be active in (a distinction between the working environment and private surfing, for example).In the development of our system, we explore the integration of semantic technology to the process, with a close eye on keeping the system in the range of satisfactory performance.

Related Work
Traditionally, profiling approaches (following the methodologies applied in document indexing) use a keyword-based representation to summarise source documents and user interests in an economical way.The limitedness of this data structure raises problems when treating natural language.Synonyms, ambiguities and simple spelling errors cause the system to discover relations where there are none (e.g. when encountering synonyms) or to not discover those that are imminent (in the case of homonyms or spelling errors).First attempts to alleviate this shortcoming explored the construction of semantic networks.Starting from a base of fixed keyterms, semantic relations between terms were estimated based on co-occurrence in the document base and, new terms added if a relationship was judged likely [2].The main focus of the researchers was to tackle the problem of homonymy, that refers to terms in natural language that are written the same, but refer to different semantic concepts depending on their context of usage [3].However, even the combination of elaborate algorithms based on semantic networks did not finally solve the issue [4], at least not without adopting external knowledge sources.For example in [4], WordNet [5] has been used to create links to semantically defined entities.
As a result, more and more researchers explore the usefulness of ontologies for profiling purposes, e.g.[3], [6], [7].The integration points vary from the usage of ontologies as a background knowledge repository to the usage of ontology-shaped profile constructions.Numerous approaches use structured open linked data [8] as domain knowledge.A lot of them refer to WordNet [9][10][11], DBPedia [12] or the Open Directory Project (ODP) [13].Those resources differ in the degree of structure that they induce to the contained concepts.WordNet's relational structure, for example, was obtained by manually grouping terms into "synsets", words that bear a synonymous relationship.The ODP is a community-driven project to classify web content into a given set of categories.There are several works that refer to the ODP as an ontology.However, even though the set of possible relationships extends the mere taxonomic ones, those are not widely used -in consequence, ODP could, at most, be considered a light-weight ontology [14].Above methods make use from structured knowledge resources, but are far from exhausting their potential.
In turn, there are two recent approaches that make use of full-fledged ontologies for reference: [3] proposed a ontology-based profiling system that relies on the Yago ontology [15].In [7], the researchers assume a not specified domain ontology as a base for their abstract system.Finally, the chosen knowledge structure may be used in different ways.First techniques aimed to maintain the keyword-based representation, but use the ontologies for specification and alteration of the keyword space.On the one hand, the analysis of the whole keyword set allowed disambiguating some of them.On the other hand, "synsets" or similar relations can be used to extend the keywords space by synonymous or closely related terms [16].Chua et al. [17] use them to group semantically close terms to clusters and reduce the feature space for more efficiency.
More recent propositions advance with respect to these approaches by using an ontological structure for the profile itself.This enables them to not only extract additional concepts from the knowledge resource, but also use the relationships to enhance the user's characterization.Calegari et al. [3] ex-DOI: 10.12948/issn14531305/19.1.2015.01tract parts of the Yago ontology that are related to the terms that were found in the documents of the user, the choice of concepts is based on the Spreading Activation algorithm [18].
From the above represented approaches, especially the most recent ones [3], [7] show similarities to our vision.However, they both rely on keyword-based representations for the characterisation of manifold resources of the users, that, when included to the user profile, are enriched with semantics.

System Design
The final goal of our research and development is a system that integrates with numerous, contradicting demands: The application needs it to make reliable deductions at runtime, which explore the depth of the input data to the maximum.The considerate design of the central data structure has thus an important significance.The result of the design process that has been accompanied by domain experts is presented in the following sections.However, the limitation of the analysis focus to highly relevant concepts alone does not bring the system to comply with the standards of Big Data.Hence, we also adapted the system design to avoid performance losses due to costly working steps.A description of the taken adaptations will follow the description of the ontology in Section 0.

The Ontology
The design principles that guided the conception of the ontology are straight-forward:  include all concepts that are relevant to the profiling process;  limit the conception to the essential;  adapt to the application domain to minimise the complexity of the data structure;  keep the design modular in the customised parts to allow transfer to other domains.The result is a data structure that bears highly generic parts (e.g. a user identification string, certain widely used profile attributes), but also modules that tailored to fit the needs of digital advertising (e.g. the topic categoriza-DOI: 10.12948/issn14531305/19.1.2015.01tion scheme).We engage in a closer look on the included entities in the following paragraphs.For a graphical overview, 0 shows a schema of the upper-level classes in the conceived ontology.In light grey, the entities that contribute the navigational history of the user, in green the ones that capture the semantic information about each web resource, in yellow the elements that constitute the final user profile.Please note that the figure is limited to the object properties connecting the concepts among each other.Each of the concepts has several data type properties attached which describes it in more detail.These data type properties will be mentioned in the later passages that take a closer look at the included concepts and their functions.

BID.
The "BID" stands for "browser identification" and is the central unit in the ontology.Included in every cookie, a BID identifies a user whenever he re-visits the website.However, in detail this functioning relates a certain browser on a certain machine, therefore the name "browser-" and not "user ID".The BID is used to group the log entries that belong to a single ID and to assess the sequence of pages that is visited.In the ontology, the BID is the class that is connected (a) to the effectuated hits, (b) to the visited web pages, (c) to the classes defining the profile, (d) to the segment classes that apply for the respective user.0 shows the OWL definition of the concept BID and its related data type properties, in Turtle syntax.Listing 1. Definition and datatype properties for the class "BID" (Turtle)

Context Entities
Some of the entities within the ontology are defined by the commercial ecosystem.The company concludes a contract with online publishers to provide enhanced analysis of their usage logs.Thus, the content that has to be included in the analysis process is determined by the partner, the domains he owns and the web pages that are reachable below this domain.Each partner has different amounts information appertaining to him, possibly extended by collaborations with other actors on the market.Hence, information about the partner and his possible coalitions are crucial to determine which facts to include in the analysis process.Logs are grouped by partner, to avoid any information leakage.We thus included the following entities for the context: Partner: The "Partner" is a society that has signed a contract for the treatment of their data.Each partner is identified by a WID, "web identification".All domain and web pages that belong to a partner will have this ID attached.0 shows the OWL definition of this concept along with its data type properties (in Turtle syntax, shortened, as the prefix definition as in 0 Listing 2. Definition and datatype properties for the class "Partner" (Turtle) Domain: Referring to the official Domain Name System (DNS) [19], the domain in the project context means the string that results from the combination of second-level domain and top-level domain.All web pages and sub-domains subordinated to the domain will be related to it.For example: the URL "http://lentreprise.lexpress.fr/inde x.html" refers to the domain "lexpress", "lentreprise" is the respective sub-domain, "index.html"identifies the specific page to display.The definition of this concept is given below (see 0). Listing 3. Definition and datatype properties for the class "Domain" (Turtle)

Data Entities
The entities in this section stem from internal data treatment of the collaboration partner.For that, they provide additional contextual information to the web pages extracted from the navigation logs.However, their origin are basic analytic steps (as computing the duration of a page view from the time stamps accompanying it, for instance), as opposed information deduced from enhanced statistics or the application of machine learning techniques.

Hit:
The "Hit" comprises all information about a single user action.That is, whenever a page is requested from the server, this is logged as one hit.Included in the class is all information related to that entity --the time stamp, the requested URL etc.Thus, it capsules all information that can be related to a single event connected to a user.Via relations, the "Hit" connects the user with a set of web pages; several hits are grouped in a session (see 0 Listing 4. Definition and datatype properties for the class "Hit" (Turtle) Session: A "Session" is a sequence of hits, grouped by the fact that the distance between the time stamp of one page view and the sub-sequent page does not exceed thirty minutes.The class definition and related datatype properties in Turtle syntax can be found in 0.
### http://www.checksem.fr/MindMinings/mm_base#Session:Session rdf:type owl:Class ; rdfs:comment "Session: two hits of a BID belong to the same session if the difference of their time-stamps does not exceed 30 minutes."@en .

Analysis Entities
The content analysis process demands the integration of the following classes (greencoloured in Error!Reference source not found.):Keyword: A "Keyword" is a term that describes one concept contained in a web page.The Keyword class will be used to handle their disambiguation using external knowledge sources, by means of external URIs that link to external knowledge sources such as DBPedia and WordNet.The instances of the "Keyword" class are related with the web pages that they appear in and the universes that they are semantically related to.Furthermore, both of these relations may be attributed with a weight, measuring the degree of membership.

Listing 7. Definition and datatype properties for the class "Webpage" (Turtle)
Universe: The term "Universe" refers to a certain content category and the keywords that are related to it.Thus, every Universe will carry the name of the category it depicts, and bear close relations to the reference keywords that are associated with the respective content domain.Listing 8. Class definition and data type properties for the class "Universe" (Turtle)

Profile Entities
The final goal of all computing efforts is to build a semantically enhanced profile representation for every considered user.
Profile: "Profile" is the main class containing the attributes that define the user profile (0 gives the complete definition of the concept).This comprises the elements stemming from the content analysis of the web pages, by linking it with the universes that were discovered therein; but also attributes that may be deduced from those content attributes.In consequence the Profile class contains two sub-classes that group the elements into socio-demographic attributes (such as age, location etc.) and behavioural attributes (such as the browser used or the affinity to certain brands).We chose to divide each of those sub-classes into a number, as to signify commercially interesting divisions of the attributes.For example, the value of the property "hasAge" is defined by choosing to link a profile with one or more of the individuals "Age 15-24", "Age 25-34", "Age 35-49", "Age 50-64", "Age above65", "Age Child", "Age Pre-Teenager", or "Age Teenager".Listing 9. Definition and data type properties for the class "Profile" (Turtle) Segment: One of the key features of the ontology lies in its capability to automatically infer the attribution of an individual to a certain segment (0).Using the attribution of a user individual to certain of the above described tranches, more complex notions can be specified.The segment class captures exactly those more complex profile entities that may be constructed using profile features ("a female person living in a household with children belongs to the segment "mother"), content features ("a person reading 90% of the times on pages that treat sports-related topics is a sports-fan") or a combination of both.The individuals assigned to a class of type "Segment" are those that comply with the constraints or rules that were imposed to define the segment Listing 10.Definition and datatype properties for the class "Segment" (Turtle)

Constructional Entities
In order to allow weighting relations between certain entities, an additional class was added to the ontology.The concept "Weight" may be used to specify a numerical value of membership to a certain property, as for example the relation between a web page and a universe.A web page may cover a set of various topics, each of them to a certain degree.The class Weight enables us to model this fact within our ontology.The individuals from the class "Weight" (0) carry a data type property containing the nu-merical value that quantifies the weight of the relation (namely the "hasWeightValue" property).The relation between the web page and the universe, named "hasUniverse" at the moment, is then specified as being a composition of two other relations: "hasWebpageWeight", relating the web page with the weight concept and "weightHasUniverse" that then concludes the relation to the respective universe (see 0 and 0).Listing 11.Usage of "Weight" in example object property (Turtle) The same has been done for the relation between a keyword and a universe (quantifying how much a keyword is actually associated to a certain category), between a profile and a universe (quantifying how much importance the universe in question has for the description of the profile).As such, the "Weight" concept allows us to put a measurement of importance on some of the relations within the ontology.In consequence, we are able to not only express binary relations ("mother AND some web pages that talk about sports" means "SportyMom"), but insert a new level of expressiveness by allowing quantification: "to a certainty of 0.8 a mother AND more than 90% of pages treat topics related to sports" means "SportyMom".This enhanced expressiveness extends relationships whenever it seems useful.Apart from above example this includes, for instance, the relationship between a topic universe and a web page (quantifying the importance of a certain topic for the message of the resource's content) or the relationships among "Keywords" (quantifying the degree of semantic relatedness between them).
Listing 12. Definition and datatype properties for the class "Weight" (Turtle) The above-described ontology comes to use in a completely automatic, web-based system for the treatment of web resources and user data.The web resources are monitored and qualified continuously, decoupled from the user profiling process.This procedure enables to keep up with huge amounts of data and to have the content information neces-sary for the profile construction ready whenever a user surfs an indexed page.The final user profile is constructed using the information obtained from the indexation process, information that can be directly retrieved from the cookie (e.g. the user agent) and the knowledge that stems from the company's internal evaluation processes.DOI: 10.12948/issn14531305/19.1.2015.01Fig. 2. Integration of the "Weight" concept in the MindMinings ontology

System Architecture
The main focus of the article at hand has been the description of the conceived structure for knowledge representation within the profiling system.For the sake of completeness, we want to give a short overview on the design of the surrounding profiling/analysis architecture.Enhanced syntactic and semantic analysis can be computationally expensive.For this reason they have been avoided when handling vast amounts of data up to this point.In an industrial context, the profiling application will have to handle a multitude of data instances that may arrive in high speed.One of our collaboration partners specifies the number of arriving user events to at least 150 million per month, for a single publisher site.Hence, in a simplistic calculation, the system will have to treat about 60 events per second for each single client.(Of course, user activity is not evenly distributed throughout the day.Periods of higher activity, such as the early evening, will account for a huge percentage of user events.)Performing semantic analysis at runtime might cause the system delays that are not acceptable for the application context.Hence, we decoupled the semantic analyses from actual user activity.In doing so, we benefit from the practical setting in the industrial context: Due to privacy concerns, every online publisher has only access to those parts of the navigation logs that happen on her websites or those of collaborators.
Even though this might involve a considerable amount of web resources, it is still a limited set of contents.Those can be continuously monitored and analysed, the relevant semantic information be kept in the system.In consequence, the actual profiling task that is performed at runtime is reduced to the connection of the already available semantic page information according to each user's individual behaviour, and the deduction of inherent patterns.0 shows an overview on the high-level building blocks and work-flows.On the left hand side, the web pages ("WP") enter the asynchronous analysis process.The results of their semantic qualification are directly added to the ontology.On user activity, that information are related with a user ID, user agent and session information and ontological inference is used to deduce the relevant customer segments.The semantic page information within the system will be updated on a regular basis, based on the lifecycle of the indexed pages.To preserve a maintainable knowledge base, contents that are vital have to be identified; contents that are outdated or uninteresting for the user base have to be discarded.The concrete measurement for site vitality is still to be developed and tested.However, incoming and outgoing links (and the vitality of the pages they lead to), the age of the page and the reappearance of its core concepts in novel resources seem to be a good starting point for a pertinent, automatic measurement.

Conclusion and Future Work
The core focus of the article at hand is the presentation of a data structure that has been conceived to support user profiling in digital advertising.The ontology represents expert knowledge about the entities, their relationships and surrounding information flows that are essential to the profiling process.
The ontology is integrated in a system that performs semantic analysis based on the structured input files that contain the navigation history of a multitude of users.Critical design decisions have been taken in consideration of the specific profiling needs of digital advertising and the pragmatics of the final application context.The system will have to perform in a Big Data environment and has thus been subject to focussed adaptations with respect to the before cited criteria of Big Data contexts: Volume: In a realistic working environment, the system will have to cope with vast amounts of entering data, stemming from a multitude of commercial partners.A single publisher site accounts for about 150 million user events per month in average -and all those events have to be processed and integrated.To respond to this issue, the underlying ontology has been designed to capture only the information relevant to the profiling task, discarding additional available information that has no value for the specific application domain.
Velocity: Semantic analysis is often considered too costly to be applied in a Big Data environment.To avoid it to delay the process, the time-consuming information has been decoupled from the actual profiling on run-time.Web documents are monitored and analysed in the system when they appear online.When the actual user activity happens, only the respective relationships have to be added to the ontological structure.Moreover, the used RDF database [20]  The system is designed highly modular to allow later extension by further information sources.One could imagine additional modules for the semantic annotation of images or videos that feed their information to the same central repository.However, we chose to focus on one main information source for the deployment of the initial system, to allow fast validation of the presented approach.
Veracity: Ontologies have their strengths in the validation of information and the discovery of contrasting statements.The reason functionality is specifically designed to test all possible deductions, discover contradictions, and to propose strategies for their dissolution.Furthermore, the explicit interconnection with external knowledge repositories allows leveraging their contents for this purpose.Each concept within the ontology carries a unique identifier (a URI) that is mapped to its counterpart in an external knowledge resource by means of predicates such as "owl:sameAs".This enables to evaluate the information obtained from the analysis of a document for its meaningfulness.Links and statements that seem too far off the information in the knowledge repository will be discarded from the profile; additional relationships from the external repository added as necessary.
Value: The data we are working on are the navigation logs of actual users as they are already used to determine interests for content recommendation.Several publications outline methods for their statistical analysis and evaluate the gain in user involvement (e.g.[21]) In the close future, we will engage in detailed performance testing, preferable in direct comparison to a similar profiling system in industrial usage.First tests have been realised in a laboratory environment and turned out promising, with an average response time well below one second.Those, however, have to be verified in a more realistic setting.For the moment, the integration of external resources is limited to one, namely DBPedia.Currently, we are experimenting with alternative connectors to repositories such as Yago, Freebase and Wolfram Alpha [22].These references differ in their structuring of the knowledge, their query language, the frequency of updates and, importantly, response time.The goal will be to find a good balance between the extensiveness of the accessible knowledge and the time that has to be invested to extract it.Furthermore, when relying on diverse information sources for our analyses,

Fig. 3 .
Fig. 3. Overview of the work ow within the MindMinings profiling system ("WP" means Web page) we will have to decide how to balance their influences on the final relationship structure in the dynamic ontology and what theoretical framework to use for their aggregation.DOI: 10.12948/issn14531305/19.1.2015.01Ana ROXIN is an associate professor at the department of Informatics, Electronics and Mechanics (IEM) from the University of Burgundy.She is a member of the Checksem research team, part of the LE2I (Laboratory of Informatics, Image and Electronics) UMR CNRS 6306.She was involved in several national and European projects (e.g.EU FP7 ASSET, TELEFOT) addressing information recommendation to the user.Her main research interest concern: Semantic Web technologies, knowledge engineering, data interoperability and user profiling in a Big Data context.Christophe NICOLLE is professor in the Computer Science department at the University of Burgundy.He received his Ph.D. in Computer Science in 1996.Since, he is a member of the LE2I laboratory (Electronics, Informatics and Image) at the University of Burgundy.His research interests include interoperability of heterogeneous information systems and the optimization of process and resources using semantics, combinatory and logical rules.Since 2001, he works on the Active3D project dedicated to the development of a semantic framework for the management of buildings during their life-cycle.In 2005, he participated in the creation of the Active3D Company.The company develops a web collaborative platform for facility management.Currently, the Active3D platform manages more than 60 millions of square meters of buildings.