Data in Brief

This data article describes user-generated data of Funda.nl, the largest online housing market website of the Netherlands. The data contain the inﬂow and outﬂow of hits (mouse clicks, opening of webpages, etc.) at the municipality level. The municipality of the user deﬁnes the origin and the municipality of the property that is viewed deﬁnes the destination. The data capture real behavior of the platform users. The ﬂow data are based on 1.1 billion hits that are made by the users of the website in the ﬁrst six months of 2018. The underlying data are collected by Google Analytics, the web analytics tool of Google. Funda utilizes the data for platform stability, security, product development, etc. The proprietary data of Funda are used to generate the information ﬂows be-tween municipalities. In the full sample we have 148,216 information ﬂows between municipalities in the Netherlands, among which 313 zero ﬂows. The data include subsamples for different types of platform users as user search intentions range from serious to fully recreational. The data enable researchers to analyze housing search behavior from a novel perspective. The data are, for instance, relevant for housing market researchers, digital economists, and economic geog-raphers.

ata differentiate between serious searchers and recreational searchers, so that they can be compared and analyzed.• Researchers can use the datasets to gain insights in housing search processes, both online and offline.The data are particularly suited for studies that try to explain bilateral flows between locations, for instance through so-called gravity models.Beyond housing market applications the data are valuable for researchers interested in information flows in gen ral.• From a cross-sectional perspective, these datasets seem amongst the most extensive in existence for gravity model applications (for a thorough literature overview, see [4] ).Missing values and zero flows are relatively rare as the flows are based on big data, i.e., over a billion user hits.• The data demonstrate the research potential of novel data sources, such as user-generated platform data.What is more, the data demonstra

that web analytics of online p
atforms can be transformed into meaningful datasets for academic use.• It is likely that housing search flow data can be used to predict future residential mobility.

Therefore, these data seem valuable

n, for instance, l
nd-use and spatial planning.

1. Data Description1


Information flows

The information flow datasets, or hit flow datasets, described in this data article are based on user-generated data of Funda.nl , the largest online housing market website of the Netherlands.The Dutch Association of Realtors (NVM) created Funda as an online real estate platform in 2001 [3] .Since then, Funda has become one of the most innovative digital housing platforms in the world.Archived versions of the Funda website from 2018, which thus correspond to our data, can be found on The Wayback Machine, which was founded by the Internet Archive.See, for instance, the archived version of January 2, 2018 or the archived English language counterpart of October 2, 2018.

The datasets consist of the information (and search) flows of the users of the Funda housing platform website in the first six months of 2018.The geolocation of the platform user determines the origin of the flow while the location of the property that is viewed determines the destination of the flow.The in total 1.1 billion hits (i.e., the full sample dataset), made by users in the Netherlands on listed objects in the Netherlands, aggregate into 148,216 information flows between municipalities.Approximately 95 percent of the hits relate to properties for sale, the remaining 5 percent concern hits on renta

between 382 origi
municipalities and 388 destination municipalities; no origin data is observed for six small municipalities (see below).The descriptives for the full sample dataset are found in column 1 of Table 1 .The descriptives show that the 1.1 billion hits of the full sample aggregate into 147,903 non-zero flows and 313 zero flows.Notes : Account, users that have registered and created a user account; Mortgage, users that have done the online mortgage calculation; Telephone, users that have clicked the button for the real estate agent's telephone number; Email service, users that have signed up for an email service of new listings within their preferences; Message, users that have ontacted the real estate agent through an online form; Viewing, users that scheduled a viewing with the online tool; Buyer, users that registered themselves as the buyer of a property.See also: Steegmans and de Bruin [3 , p. 11].


Information flows split by user type

Information flow datasets are also available for different types of users.The users are classified based on their website behavior (i.e., their actions).We provide these data as the intentions for using the platform range from serious house searching to fully recreational search.Flow datasets are provided for users that have registered and created a user account, users that have done the online mortgage calculation at least once, users that have clicked the button for the real estate agent's telephone number, users that have signed up for an email service of new listings within their preferences, users that have contacted the real estate agent through an online form, users that scheduled a property viewing

th the online tool, and users that re
istered themselves as the buyer of a property (although which property that is is not registered) [3] .Columns 2 to 8 of Table 1 show that the number of zero flows increases as the number of hits in the subsamples becomes smaller.Table 1 also shows that the relative size of the subsamples, expressed as a percentage of the full sample, are similar for different measures.Details on these measures, i.e., hits, events and time spent online, will be provided in the methodology section.


Information flows split by device and network domain

Separate flow datasets are also provided for devices and network domain.The devices are split into desktop (including laptops), tablet, and mobile [3] .Table 2 shows that, for the first six months of 2018, the users of the website most often use a computer (almost 50 percent), followed by mobile phones (about 30 percent), and tablets (over 20 percent).Columns 2, 3 and 4 show that the device splits have few additional zero flows.The network domain is split into set (i.e., not not set) and not set.For 15 percent of the cases the network domain is not set, for the remaining 85 percent the network domain falls within one of the other categories identified by Google Analytics (generally, a specific Int

net Service Provider).For the 'not set' observations
he number of non-zero flows increases to 20,453 compared to 127,763 non-zero flows.


Data format and variables

The datasets are formatted as Comma-Separated Values (CSV; encoding ISO 8859-1/Latin-1).Besides, the data are also available as Stata data files (DTA files; encoding version 11/12).Stata statistical software is, for instance, commonly used in economics [5] .The latter also contain variable labels (see Table 4 ).The full sample information flow data and the flow data for each subsample are stored in separate datasets.

The list of datasets is provided in Table 3 .The seven user types are each split into two datasets based on whether of not the specific condition is met.For instance, there are users with an account (condition is 'true') and users without an account (condition is 'false').The same holds for the remaining user types.

r these user types the con
ition should be met at least once to be true.For instance, in the dataset 'Flows_2018H1_viewing_true.csv'all users are included that online scheduled a viewing, regardless of the number of viewings that they planned.'Flows_2018H1_viewing_false.csv'includes the users that never used the tool to schedule a property viewing.The device used splits the full sample into three subsamples: desktop The full sample and subsample da asets have the same structure and contain the same variables.They each contain the origin municipality, the destination municipality, and the hit flow between them.The municipality codes correspond with those of Statistics Netherlands [6,7] in order to facilitate interoperability.Apart from the hit flow, flows in terms events and time spent are provided.The datasets include the (Euclidean) distance between the centroids of the municipalities, whether flows are internal (within the same municipality), between neighboring municipalities, or between municipalities within the same province.Finally, the datasets include (endogenous) indicators of the size of the origin and destination municipalities: that is, the number of unique users in the origin and the number of objects (i.e., prope

list of vari
ble names, variable types and variable labels can be found in Table 4 .


Experimental Design, Materials and Method Funda's Google Analytics data

Funda uses Google Analytics, Go gle's web analytics tool, to registers website users' activity.The Google Anal tics data is thus, as noted before, proprietary data of Funda.Funda monitors the anal tics for purposes related to, for instance, platform stability, security, and product development 8] .Google Analytics registers the data at different observation levels: the hit level, the session level, and the user level.A hit is a website interaction where data is s nt to Google Analytics.Hits are the lowest observation level.They are divided int events and pageviews.Events can include link clicks, form submissions and video plays while a pageview occurs when a new page is loaded [3,9] .Hits by the same user are grouped together in session.After thirty minutes of user inactivity a session is closed.Website activity that ollows afterwards is registered in a new session.Session information includes, for example, the user's device (desktop, tablet or mobile) and the IP address-based geolocation of the user.Sessio s, in turn, can be linked to individual users.The users are identified through cookies [8] .

The Google A alytics data consist of dimensions and metrics.Dimensions are attributes of the data, me rics are quantifiable measurements.The default dimensions are t e same for all Google Analytics customers, among which Funda.The above-me tioned geolocations and the device used are examples of default dimensions.Apart from that, custom dimensi ns can be added by the Google Analytics customer.For Funda the custom dimensions include information on the property that is viewed, such as the property's location and whether it is a rental or sale object.Metrics include, for instance, the number of hits within a session and the time spent on the platform.


The origin and destination of hits

The Google Analytics data described above are the basis of the information flow data described in this article.The basis of the flow data is the individual hit.We are interested in the origin of the hit, i.e., the (approximate) user location, and the destination of the hit, i.e., the location of the property that is viewed.The user's municipality is derived from the IP address-based geolocation, which is registered at the session level in the Google Analytics data.

Google Analytics provides 1181 distinct geolocations (villages, towns, and cities) for our 1.1 billion hits.Google Analytics refers to this geolocation as 'city' although it includes,

nt Kruis.These villages had 20
, 313, and 329 inhabitants on January 1, 2018, respectively [3,10] .The 1181 distinct Google Analytics geolocations aggregate into 382 (origin) municipalities. 2There are no Google Analytics geolocations within the borders of the six remaining municipalities: Ameland, De Marne, Haarlemmerliede en Spaarnwoude, Rozendaal, Schiermonnikoog, and Vlieland.The flows from these municipalities are thus missing; however, it does not mean that these flows are equal to zero.Hits from these municipalities are, likely, incorrectly contributed to other locations instead. 3he most important reason to treat the Google Analytics geolocations (i.e., the user locations) somewhat cautiously is because Funda makes use of Google Analytics' IP anonymization tool.Funda uses this tool in order to comply with the General Data Protection Regulation (GDPR), which protects the data and privacy of individuals in Europe.The anonymization tool replaces the last bits of an IP address with zeros before it is sent to the Google Analytics server [11] .For instance, using dot-decimal notation, the IP address 131.211.209.183 would become 131.211.209.0 bef re the geolocation is added.For a thorough report on the impact of the Google Analytics IP anonymization tool, see Clifton and Wan [12] .

Using the data of Clifton and Wan [12] , Steegmans and de Bruin [3] estimate that at the 'city' level 66.40 percent of the cases are unaffected by the anonymization tool for users in the Netherlands over time period covered by the flow data.At the municipality level that increases to an accuracy rate of 69.6 percent.As IP addresses are distributed in blocks to Internet Service Providers, accuracy rates show differences amongst them.There remains, for instance, a 98.4 percent municipal accuracy rate after 'anonymization

for the Dutch observations where th
network domain is 'not set' [3] .


The role of users

Users play an important role in the construction of the flow data.First, their geolocation determines the origin of the flow.Second, user actions are used to generate the user type (subsample-based) datasets.For the latter, hits and sessions from repeat visits have to be assigned to the correct user.This requires a unique user identifier.The Google Analytics user identifier (fullVisitorId) and the Funda user ident fier for logged-in users allow for this.Importantly, anonymity is guaranteed as the fullVisitorId cannot be matched to other sources; it has only purpose within Funda's Google Analytics environment.Furthermore, no personally identifiable information (PII) is included; full nor anonomized IP addresses are stored in Google Analytics.Only the corresponding geolocation is registered.Apart from that, we do not have access to user accounts, which are likely to contain PII for at least part of the users.Similarly, houses and apartments have a unique Funda ID so that address information of the objects viewed does not have to be linked.For the registered buyers we only observe that an (unidentified) property is being registered.Users are free in their choice to register themselves as buyer or not.


Variable construction and measurement

Hits are the basis of the information flows described in this article.Nevertheless, two flow alternatives are available.The first are the flows in terms of events, a subcategory of the hits.Both hits and events have been described at the beginning of this section.The second alternative are the flows in terms or time spent viewing properties.In other words, the total time that users from a given municipality spent viewing properties from another municipality (or their own).This variable, however, contains a particularly large measurement error.The default Google Analytics 'time on site' variable provides information at the session level.It is impossible to derive the time flow data from this measu e.After all, the time spent must be contributable to individual properties, instead of sessions in which a multitude of properties are viewed, for meaningful flows to be constructed.Therefore, we must rely on the time stamps of individual hits that relate to a given property.In this approach, however, it is not possible to unequivocally determine when the view of a property ends.Furthermore, users switching between properties/webpages also distort the measurements.

The municipalities are classified by their Statistics Netherlands (CBS) identifier code, the 'GM code' [6] .The GM code is used b

ause it facilitate
matching the flow datasets to public data from Statistics Netherlands.Most notably, shape files -for the creation of maps -and municipality figures, such as population statistics and surface areas [7] .

The distance variable is measured as the Euclidean distance between the centroids of a municipality pair.The centroids are RD ( Rijksdriehoek ) coordinates from the spatial reference system, industry standard EPSG:28992 [13] , that is used by the Cadastre and government organisations in the Netherlands.As the x and y coordinates are expressed in meters the distances follow directly from them.For internal flows, i.e., flows within a municipality, the distance between the centroids would be zero.As this underestimates the true (internal) distance, we use the Head and Mayer approximation [14] instead.The internal distance is therefore given by: 2 3

A/π where A is the surface area of the municipality.The internal distance approximations are thus used when the origin and destination municipality are the same.


Ethics and Privacy Statement

Funda has to comply with the General Data Protection Regulation (GDPR) when collecting website analytics.According to Funda, Google Analytics is set up "as privacy friendly as p