Synonyms

Location-based recommendation; Positional or layout effect in recommender systems; Spatiotemporal collaborative filtering; Time-sensitive recommendation

Glossary

Recommender:

A system that recommends items (e.g., news articles, blog posts) to users

Response Rate:

The probability that a user would respond to (e.g., click, share) a recommended item

Feature:

Information (about a user, an item, and the context in which the item may be recommended to the user) that can be used to predict the response rate

Page:

A web page on which recommended items are placed

Context:

The situation (which includes time, geographical location, location of a web page, etc.) in which recommendations are made to a user

Graph:

A set of nodes connected by a set of edges

Definition

Social media sites (like twitter.com, digg.com, blogger.com) complement traditional media by incorporating content generated by regular people and allowing users to interact with content through sharing, commenting, voting, liking, and other actions. Since the number of content items is usually too large for a person to manually examine to find interesting ones, it is important for social media sites to recommend a small set of items that are worth looking at for each user. To satisfy each individual user, recommended items have to match the user's personal interests and be relevant to the user's current spatiotemporal context. For example, a content item about the user's hometown is usually a better choice than an item about an unknown foreign country, and a content item on a fresh trending topic is usually more interesting than an item on a stale topic.

Spatiotemporal personalized recommendation of social media content refers to techniques used to make personalized recommendation based on:

  • The geographical location of a user and an item (the location of an item can be the location that the item is about or the location of the author of the item)

  • The location of a user in the social space (e.g., the neighborhood of a user in a friendship graph)

  • The position of an item placed on a page and the layout of the page

  • Temporal evolution of user interests

  • Temporal behavior of the popularity of an item

  • Identification of trending topics

Introduction

Social media usually refers to a group of Internet-based applications that allow creation and exchange of user-generated content (Kaplan and Haenlein 2010). For example, weblog sites like blogger.com provide regular people the ability of publishing any article (called blog) on the web, microblogging sites like twitter.com facilitate fast distribution of short messages of any topic posted by any one, and social news sites like digg.com allow their users to vote news articles (and other web content) up or down in order to present popular and interesting news stories based on the wisdom of the crowd (i.e., votes from users), just to name a few. Because of the success of such social media sites, almost all online media sites now provide their users with the functionality of sharing and commenting on content items (e.g., news articles, photos, songs, movies), no matter whether the content items are generated by regular users. Since sharing and commenting are usually considered as social activities, the distinction between social media and online media (which includes social media) blurs. Thus, in this article, we discuss recommendation methods suitable for any online media with a special emphasis on spatial, temporal, and social characteristics of users and content items.

The large amount of content generated by social media makes it difficult for users to find personally relevant content. To alleviate such information overload, many social media sites recommend a small set of content items to each user based on what they know about the user and the items. We use the term “ item” to refer to any candidate objects to be recommended to users, which include (but are not limited to):

  • Publisher-generated items like articles, songs, and movies, which are not generated by regular users, but are voted, shared, liked, or commented on by them

  • User-generated items like blogs, tweets (short messages posted on twitter.com), photos, videos, status updates, and comments on other items

Good recommendations help social media sites keep their users engaged and interested.

Key Points

When recommending items to users, it is important to consider whether an item is relevant to a user in the spatiotemporal context in which recommendations are to be made. A few key reasons are listed below. Notice that we take a broad view of the spatial aspect that includes locations in geographical space, social space, and positions on a web page:

  • Users are likely to be more interested in items about their current geographical location than items about a random location, which is especially true for mobile applications (see, e.g., Zheng et al. (2010)).

  • In some applications, users tend to have similar preferences to those who are close to them in the social space, which is especially true when closeness is defined based on a trust network (see, e.g., Jamali and Ester (2010)).

  • It is generally true that items placed at prominent positions (e.g., top) on a page generate more responses from users than same items placed at non-prominent positions (see, e.g., Agarwal et al. (2009)).

  • Users change their interests in topics over time (see, e.g., Ahmed et al. (2011)).

  • Popularity of items also changes over time (see, e.g., Agarwal et al. (2009)).

Many methods have been developed to exploit these spatiotemporal characteristics to improve the performance of recommenders. A comprehensive review of these methods is beyond the scope of this article. Instead, after providing a brief historical background, we illustrate key ideas in spatiotemporal personalized recommendation through a generic supervised learning approach, which handles spatiotemporal characteristics by (1) defining features that capture those characteristics and (2) learning a function that predicts whether a user would respond to an item positively based on these features from a dataset that records users' past responses to items. This approach generally applies to recommendation of any kind of item.

Historical Background

There have been many approaches developed to make personalized recommendations. When the items to be recommended are text articles, which may be represented as a bag of words, an early approach is to also represent a user as a bag of words. The user's bag of words can be constructed by including representative words in the articles that the user likes to read. Then, we can recommend a user the articles which bags of words are most similar to the user's bag of words through Salton's vector space model (Salton et al. 1975). For items that are not easily representable as bags of words, how other users respond to an item may provide a clue as to whether to recommend the item to a user who has not yet responded to the item. Agrawal et al. (1993) proposed that, in a retail store setting, products can be recommended based on customers' co-buying behavior. For example, if the majority of customers who buy product A also buy product B, then we may recommend product B to a customer who only bought product A. This idea was then extended by incorporating a notion of similarity of users or items. For example, when we decide whether to recommend item B to user i, we look at whether users “ similar” to user i respond to item B positively. Notice that Agrawal's method is based on the similarity definition that if two customers buy the same product, then they are similar. A different definition of similarity between users leads to a different method. Furthermore, we can also exploit similarity between items in a similar way - when deciding whether to recommend item B to user i, check whether user i liked items that are “ similar” to B in the past. Here, similarity between two items can be defined by looking at whether most users responded to the two similarly. Adomavicius and Tuzhilin (2005) provided a good review of such methods. This kind of methods is generally referred to as collaborative filtering, because the recommendations that a user receives depend on other users' responses to candidate items - this process can be thought of as a collaboration among users to help one another find interesting items (although users may not be aware of the collaboration).

Conceptually, one can put users' past responses to items into a matrix. Since this matrix-oriented approach is popular in movie recommendation (Koren et al. 2009), we use it as an example in the following discussion. In a movie recommender system, users rate movies. Let y ij denote the rating that user i gives to movie j. For example, y ij may be a numeric value ranging from 1 to 5, representing 1 star to 5 stars. Let Y denote the m × n matrix such that the value in the.(i,j) entry is y ij , where m is the number of users and n is the number of movies in the system. Notice that there are many entries with missing (i.e., unknown) values in matrix Y because most users only rate a small number of movies. For user i, if we can predict the missing values in the ith row of matrix Y accurately (where the entries with missing values correspond to movies that have not yet been rated by user i and are thus candidate items to be recommended to him/her), then we can recommend user i the movies having the highest predicted rating values. One popular way of making such predictions is through matrix factorization - approximate matrix Y as the product UV′ of two low rank matrices U of size m × r and V of size n × r, where V′ denotes the transpose of matrix V and the rank r of matrices U and V is much smaller than the numbers m and n of users and items, respectively. Let u i denote the ith row of matrix U, v j denote the jth row of matrix V, and Ω = {(i,j)} : user i rated movie j} denote the set of observed entries in matrix Y. This approximation then can be mathematically formulated as the following optimization problem.

$${\rm{Find }}U{\rm{ and }}V{\rm{ that minimize}}\sum\limits_{(i,j) \in \Omega } ( y_{ij} - {\bf{u'}}_i {\bf{v}}_j )^2,$$
(1)

where \(u_i ^\prime {\rm{ }}v_j\) is the inner product of two vectors u i and v j . Notice that \(u_i ^\prime {\rm{ }}v_j\) is the (i,j) entry of matrix.(UV′) and is also the predicted value of y ij . Thus, the above optimization seeks to minimize the difference between matrix Y and matrix (UV′) over only the set Ω of observed entries of Y. Sum of squared differences is a common choice, while other choices are also available for different problem settings. Recent studies, such as Koren et al. (2009), Agarwal and Chen (2009), and many others, suggest that matrix factorization usually provides superior recommendations than more traditional methods.

A survey of a wide range of approaches to rec-ommender systems can be found in Jannach et al. (2010) and Ricci et al. (2011). Here, we focus on how to make use of spatial, temporal, and social information to make good recommendations of social media content. In particular, we illustrate key ideas in spatiotemporal personalized recommendation through a general supervised learning (or statistical modeling) approach, which generally applies to recommendation of any kind of item.

Supervised Learning Approach

In general, a recommendation problem can be formulated as follows. A recommender is given:

  • A user, who is associated with a vector of user features, e.g., age, gender, and location

  • A context, which is associated with a vector of context features, e.g., day of week when the recommendation is to be made

  • A set of candidate items, each of which is associated with a vector of item features, e.g., topics and keywords

The goal of the recommender is to rank and pick the top few items from the set of candidate items that best “ match” the user's interests and information need in the context. The supervised learning approach exploits the fact that, in many recommenders, a dataset of users' past responses (e.g., click, share) to items can be collected and defines the degree that an item matches a user as the response rate of the user to the item (e.g., the probability that the user would click the item if he/she sees the item on a web page). Such predictions can be made by using a statistical (regression or machine learning) model, which “ learns” the user and item behavior that allows accurate predictions from the dataset, where users' past responses in the dataset “ supervise” the learning process via giving desired (e.g., click) and un-desired (e.g., no click) examples. When such a model is available, recommendations for a user can be made by picking the top few items having the highest response rates among the set of candidate items. This supervised learning approach applies to recommendation of any kind of item, where spatiotemporal and other characteristics can be incorporated by defining features that capture those characteristics.

To use this supervised learning approach, a developer of a recommender needs to make the following three decisions:

  • What response should the model try to predict?

  • What features should the model use to capture the characteristics of users, items, and the spatiotemporal context?

  • What class of model do we want to use?

After introducing a running example, we discuss how to choose the response, provide a number of useful features, and then introduce two commonly used classes of models, namely feature-based regression model and latent factor model. See Jannach et al. (2010) and Ricci et al. (2011) for other classes of models. See Hastie et al. (2009) for a general introduction to supervised learning.

Example Recommender

For concreteness, we use blog article recommendation as a running example. Consider that we want to develop a recommender for a blog service provider (e.g., blogger.com) that seeks to recommend each user with a set of interesting blog articles posted by other users. To make modeling more interesting, assume that a user can declare friendship with other users and such friendship connections between users are available to the recommender. In this example, the set of candidate items for each user consists of all of the articles posted within a 1-week time window (to ensure freshness) by any user of this service provider. Notice that the set of candidate items changes over time. For simplicity, we only need to recommend 10 articles for each user, once per day, and the recommended articles are displayed in a list on the sidebar of each user's homepage (they are only visible to the owner of the homepage, not the visitors of the homepage, since the recommendations are made to the owner).

Choice of Response

The choice of response depends on the objective that a recommender is developed for and availability of user feedback that the recommender receives. A common objective is to maximize clicks on recommended items because the fact that a user clicks an item indicates that the user is interested in knowing more about the item. Note that clicks are user feedback that can easily be made available to a recommender through logging whether each user clicks the recommended items. In this case, a natural choice of the response is whether a user would click an item if he/she sees the item being recommended. Here, the goal of learning is to predict the probability that a user would click an item based on a dataset that records what items each user clicked and what items each user did not click in the past.

Beyond clicks, a recommender may be developed for other objectives. For example, if the objective of recommendation is to encourage users to make comments on recommended items, then a natural choice of the response would be whether a user would comment on a recommended item or not. On some sites, users can explicitly rate items (e.g., using one star to five stars); then, a natural choice of the response would be the rating that a user would give to an item. For simplicity, we only consider methods that seek to achieve a single objective and model the response rate of a single type of choice (e.g., modeling either click rate or explicit star rating, but not both). See Agarwal et al. (2011a) for an example of multi-objective recommendation, and see Agarwal et al. (2011b) for an example of joint modeling of multiple types of responses.

Let y ijk denote the response that user i gives to item j in context k. For concreteness, assume that we choose to model whether the user would click the item.

Feature Engineering

Having good features is essential to an accurate model, but one usually does not get good features automatically. It requires domain knowledge, good intuition, and experience in the application to define good features. Here, for illustration purposes, we only show a number of example features that can potentially capture different kinds of spatiotemporal characteristics for our example recommender. Real-life recommenders usually need to use much more features than the following ones.

User Features

Let w i denote the vector of features of user i. For simplicity, we mostly consider binary features, meaning each element in the vector is either 0 or 1. Example features are as follows:

  • Gender: From the user's registration record when he/she signed up on the site, the rec-ommender obtains the gender of each user. The numeric value of the feature is 1 if the user is a male and 0 if the user is a female.

  • Age: Also from the user's registration record, the recommender obtains the age of each user. For example, we can group age values into 10 age groups, which give 10 age features. If the user's age is in an age group, the value of the feature corresponding to that age group is 1, and the rest age groups get feature value 0.

  • City: From the IP address of a use r, the recommender can guess the city that the user is in. Here, we use a set of features, one for each city, to represent the user's geographic location. For example, assume the user lives in New York City. Then, the value of the New York City feature is 1 and the values of the rest of the city features are all 0 for the user. It is common to only include cities that have at least n users, where n is a threshold that a developer of the recommender can choose to reduce the number of features.

Item Features

Let x j denote the vector of features of item j. Example features are as follows:

  • Bag of words: It is common to represent the text content of an article as a bag of words, which corresponds to a set of features, one for each keyword. For simplicity, we only consider binary keyword features. The value of a keyword feature is 1 if the article contains the keyword and 0 if the article does not contain the keyword. Since the total number of words in all articles is usually too large, it is also common to reduce the space of all keywords to a relatively small number of important words, e.g., location names or other named entities.

  • Topics: Another way to reduce the space of words in articles is to group words into topics and then assign topics to articles based on the words in articles. This process can be automated through topic models like latent Dirichlet allocation (Blei et al. 2003). One output from such a model is a vector of topic membership for each article, where each element in the vector represents the probability that the article is about a particular topic.

Context Features

Let z ijk denote the vector of features of the context in which user i is (to be) recommended with item j in context k (which include time and location). Example features are as follows:

  • Day of week: This is the day of week (weekday vs. weekend) when the recommendation is to be made. User behavior during the weekday can be quite different from that during the weekend. The value of this feature is 1 for weekday and 0 for weekend.

  • Article age: This is the age of an article (not to be confused with the age of a user), which is the number of days since the article was posted. We put it into the category of context features, instead of item features, because it depends on both the article and time, instead of the article alone. For example, assume the article was posted 2 days ago; then, the value of the feature corresponding to 2 days ago is 1, and the other days get feature value 0. To model finer-grained temporal effect, one may choose a finer time resolution (e.g., hour, instead of day).

  • Position on page: It is well known that the click rate of an item put on the top of a list on a page is usually higher than that of the same item put in the middle or the bottom of the page. To capture this positional bias, we define a set of features, each of which corresponds to a position in the list. For example, assume the article is put at the third position, the value of the feature corresponding to the third position is 1 and all other positions have feature value 0.

  • Friendship: This feature is 1 if user i is connected to the author of item j through a friendship connection and is 0 otherwise.

  • Same city: This feature is 1 if user i is in the same city as the author of item j and is 0 otherwise.

Note that the above features are only simple examples. The goal here is to provide concrete examples of features for illustration purposes, instead of suggesting good features for practical implementation.

Feature-Based Regression Model

After defining the response and features, we have a standard supervised learning problem. When the response is binary (e.g., either click or no click), we can use logistic regression. See Hastie et al. (2009) for an introduction to logistic regression. Let p ijk denote the probability that user i would respond to item j when he/she sees it in context k. There are many ways in which one can define a function that predicts p ijk based on features. A useful prediction function is as follows:

$$p_{ijk} = \sigma ({\bf{w'}}_i {\bf{Ax}}_j + \beta '{\bf{z}}_{ijk} ),$$
(2)

where \(\sigma (a) = {1 \over {1 + \exp ( - a)}}\) is the sigmoid function that transforms an unbounded value a into a number between 0 and 1 (since p_{ijk}=p ijk is a probability), A is a regression coefficient matrix, β is a regression coefficient vector, and \(w_i ^\prime\) and β′ are the row vectors after transposing the two-column vectors w i and β, respectively. Given a dataset of users' past responses to items, where each record is in the form (y ijk ,w i ,x j ,z ijk ), off-the-shelf logistic regression packages can be applied to learn the regression coefficients A and β.

To better understand this model, we take a closer look at the prediction function. Let A mn denote the.(m,n) entry of matrix A, w im denote the mth user feature in vector w i , and x jn denote the nth item feature in vector x j . By definition, we have

$${\bf{w'}}_i {\bf{Ax}}_j = \sum\limits_m {\sum\limits_n {A_{mn} } } w_{im} x_{jn}.$$
(3)

For example, assume w im is the feature that indicates whether user i lives in New York City and x jn is the feature that indicates whether article j contains keyword “new york.” Then, the regression coefficient A mn would try to capture the propensity that users living in the New York City would click an article that contains keyword “new york” after adjusting for all other factors. Now, assume that the mth and nth context features in z ijk indicate whether article j is posted 1 day ago and whether j is posted 5 days ago, respectively. Then, the difference between regression coefficients β m − β n would try to quantify how much the popularity of an article drops from day 1 to day 5 when all other conditions being equal.

Latent Factor Model

Although feature-based regression models are useful for predicting users' response rates to items, they depend highly on the availability of predictive features, which usually requires a significant feature engineering effort with no guarantee of obtaining predictive features. Also, feature vectors may not be sufficient to capture the differences between users or items. For example, when two users have identical feature vectors, feature-based regression models would be unable to tell the differences between the two. One way of addressing these issues is to add latent factors into the prediction function; i.e.,

$$p_{ijk} = \sigma ({\bf{w'}}_i {\bf{Ax}}_j + \beta '{\bf{z}}_{ijk} + {\bf{u'}}_i {\bf{v}}_j ),$$
(4)

where u i and v j are two r-dimensional vectors both to be learned from data like regression coefficients A and β, where r is much smaller than the number of users and the number of items. Recall that we have seen \(u_i ^\prime v_j\) in the matrix factorization method in the historical background section. The difference is that, instead of factorizing the response matrix, here we factorize the residual (i.e., prediction error) matrix of feature-based regression in order to capture the behavior of users and items that the features fail to capture.

Intuitively, one can think of u i and v j as “latent feature” vectors of user i and item j, respectively. We do not determine the values of these r latent features per user or item before learning the model. Instead, u i and v j are treated as variables that can be used to reduce the error of predicting the responses in the dataset used for learning. The inner product \(u_i ^\prime v_j\) then represents the affinity between user i and item j; the larger the inner product value, the higher the probability that user i would click item j. After the learning process, we simultaneously obtain the values of these latent features and also the regression coefficients A and β. See Agarwal et al. (2010) for an example of such a latent factor model.

Spatiotemporal contexts can also be involved in a latent factor model. For example, assume we want to model a temporal effect through latent factors. Let context index k represent the kth time period (e.g., day). One way of capturing user or item behavioral changes over time is through the following model:

$$p_{ijk} = \sigma ({\bf{w'}}_i {\bf{Ax}}_j + \beta '{\bf{z}}_{ijk} + \langle {\bf{u}}_i,{\bf{v}}_j,{\bf{t}}_k \rangle ),$$
(5)

where \(\left\langle {u_i,v_j,t_k } \right\rangle = \sum\nolimits_\ell {u_{i\ell } v_{j\ell } t_{k\ell } }\) is a form of tensor product of three vectors u i , v j and t k . Note that u iℓ denotes the ℓth element of vector u i and so on. Similar to the previous model, u i , v j , and t k are all latent feature vectors, which values are to be learned from data. Unlike the previous model where the affinity \(u_i ^\prime v_j\) between user i and item j is fixed over time, now the affinity <u i ; v j , t k >is a function of time period k, which means this model captures the changing behavior of user-item affinity. Specifically, in this model, the user and item latent feature vectors are fixed over time, but the affinity between the two is a weighted sum of the element-wise product of the two latent feature vectors u i and v j , where the weight vector t k changes over time. See Xiong et al. (2010) for an example of such a temporal latent factor model.

Summary

Personalized recommendation is an important mechanism for surfacing social media content. The spatiotemporal context in which a recommendation is made provides a key piece of information that helps a recommender to recommend the right item to the right user at the right time. While many methods have been proposed in the literature, the supervised approach is attractive because of its generality, where spatiotemporal characteristics can be incorporated as features or latent factors. In this article, we introduced a number of example features and two example models. In practice, many features need to be evaluated and a number of different models need to be tried, so that a good recommender can be built.

Future Directions

Personalized content recommendation is currently an active research area in data mining, information retrieval, and machine learning. A lot of progress has been made in this area, but challenges remain.

  • Improving response rate prediction accuracy: Although many models have been proposed to predict response rates and we have seen prediction accuracy improve over time, accurate prediction of the probability that a user would respond to an item is still a challenging problem, especially for users and items that the recommender knows little about. What are the spatial, temporal, social, and other kinds of features that can further improve accuracy? How can a recommender actively collect data to achieve better model learning and evaluation?

  • Multi-objective optimization: A recommender usually is designed to achieve multiple objectives. For example, many web sites put advertisements on article pages to generate revenue. In addition to recommend articles that users like to click, we may also want to recommend articles that can generate high advertising revenue. How can a recommender optimize multiple objectives in a principled way?

  • Multi-type response modeling: In social media, users respond to items in multiple ways, e.g., clicks, shares, tweets, emails, and likes. How can we jointly model such different types of user responses in order to find out the items that a user truly want to be recommended?

  • Whole-page optimization: On a web page, there can be multiple recommender modules. For example, one recommends news articles, another recommends updates from a user's friends, and yet another recommends online discussions the user may be interested in. How can we jointly optimize multiple recommender modules on a page to leverage the correlation among modules and to ensure consistency, diversity, and serendipity?

  • Collaborative content creation: Wikipedia demonstrated high-quality content creation through massive collaboration. However, in most recommender systems, items to be recommended are created by a single party (e.g., a publisher or a user). How can we synthesize items at the right level of granularity to recommend to users in a semiautomatic collaborative way?

Cross-References

Data Mining

Friends Recommendations in Dynamic Social Networks

Link Prediction

Matrix Decomposition

Mining Trends in the Blogosphere

Probabilistic Graphical Models

Recommender Systems: Models and Techniques

Recommender Systems Using Social Network Analysis: Challenges and Future Trends

Recommender Systems, Semantic-Based

Regression Analysis