Elsevier

Data & Knowledge Engineering

Volume 117, September 2018, Pages 252-263
Data & Knowledge Engineering

What you use, not what you do: Automatic classification and similarity detection of recipes

https://doi.org/10.1016/j.datak.2018.04.004Get rights and content

Abstract

Social media data is notoriously noisy and unclean. Recipe collections and their manual categorization built by users are no exception. However, a consistent and transparent categorization is vital to users who search for a specific entry. Similarly, curators are faced with the same challenge given a large collection of existing recipes: They first need to understand the data to be able to build a clean system of categories. This paper presents an empirical study using machine learning classifiers (logistic regression and decision trees) for the automatic classification of recipes on the German cooking website Chefkoch.de. The central question we aim at answering is: Which information is necessary to perform well at this task? In particular, we compare features extracted from the free text instructions of the recipe to those taken from the list of ingredients. On a sample of 5000 recipes with 87 classes, our feature analysis shows that a combination of nouns from the textual description of the recipe with ingredient features performs best in the logistic regression model (48% F1). Nouns alone achieve 45% F1 and ingredients alone 46% F1. However, other word classes do not complement the information from nouns. Decision trees constantly underperform the logistic regression, however, lead to an interpretable model. On a bigger training set of 50,000 instances, the best configuration shows an improvement to 57% highlighting the importance of a sizeable data set. In addition, we report on the use of these feature vectors for similarity search and ranking of recipes and evaluate on the task of (near) duplicate detection. We show that our method can reduce the manual curation with precision@3 = 0.52.

Introduction

In 2012, 63.7% of Germans used the Internet as source of inspiration for cooking Media [1]. One popular cooking website is Chefkoch.de,1 where every user can contribute to a shared database of recipes and discussions. The result of this social network approach is a large data set of diverse and potentially noisy information.

Commonly, a recipe consists of at least three major parts, exemplified in Fig. 1: the list of ingredients, whose entries consist of an ingredient type, an amount, and a unit; the cooking instructions wherein the steps for preparing the dish using the ingredients is described in natural language; and meta data which supplies for instance information about the preparation time and difficulty. Each recipe is assigned to a number of categories, for instance of subtypes regional (e.g., Germany, Malta, USA and Canada), seasonal (e.g., Christmas, spring, winter), or course (e.g., vegetables, pork, dessert) (see http://www.chefkoch.de/rezepte/kategorien/). When submitting new recipes, both users and curators may not understand the full range and structure of the category system. Thus, each new recipe may introduce additional noise into the database. Therefore, contributors as well as database curators would benefit from automatic support in choosing appropriate categories for a recipe.

Similarly, a contributor might not be able to find a specific recipe and therefore opt for adding it, though it might exist already. This includes additional noise. A method for discovering potentially similar or even equivalent recipes can help in keeping the number of near duplicates low and, on the query side, help in finding variations of a recipe.

We address both tasks in this paper. For the categorization part, we estimate a statistical model of category assignments based on recipes in the Chefkoch.de database. This model will be beneficial for database completion, adjustment, and consolidation of existing recipes and will help users and curators by suggesting categories for a new recipe. Our main contributions are experiments to investigate the performance of the model: (1) We compare logistic regression and decision tree classification models taking into account different types of information from the ingredient list and textual description. In particular, we make use of ontological information to generalize over specific ingredients and we investigate different subtypes of word classes. (2) Our evaluation of different feature sets shows that nouns are more important than verbs and the order of ingredients in the list is only of limited importance for classification. (3) We provide a visualization of the recipes with using dimensionality reduction to contribute to a better understanding of the data. This also highlights which subset of categories are specifically challenging. We work with German data which is characterized by rich morphology, e.g., regarding the variety of plural forms and compounds. However, we do not incorporate any specific handling of German.

To discover similar recipes, potentially duplicates, but also variations of recipes, we propose to use the feature vectors built for the classification task in an unsupervised retrieval setting. We evaluate this method on (near) duplicate detection.

Section snippets

Recipe as subject of research

Recipes have been the subject of several previous studies. We focus on text-oriented research here (as opposed to for example the classification of image data Cadene [2]. Most related to this paper is prior work on recipe classification by Su et al. [3]. They analyzed the correlation between recipe cuisines and ingredients for recipes from Food.com (http://food.com). They trained support vector machines to predict a single cuisine, using ingredients as features. Overall, they achieved a

Models and feature sets

We frame the task of automatically categorizing a recipe as a multi-label classification problem. Each recipe is represented as a high-dimensional feature vector which is the input for multiple binary prediction models, one for each category. The output of each model corresponds to an estimate of the probability of the recipe being associated with this category. Throughout this paper, we report our results with binary logistic regression models Cox [34]. Furthermore, we also briefly discuss

Experimental setup for classification

We use a database dump of 263,854 Chefkoch.de recipes from June 2016. The minimum number of ingredients in any recipe is 1, the maximum 61, the average is 9.98. The overall number of unique ingredients is 3954. The number of recipes varies across categories (on average 7825.3, however, the median is only 1592). The categories on Chefkoch.de are structured hierarchically with up to four levels. Within this hierarchy, recipes associated with any leaf node are assumed to also belong to all parent

Discussion, conclusion & future work

In this paper, we have shown that logistic regression can classify recipes on the Chefkoch.de database with up to 57%F1. The decision tree classifier underperforms but allows for gaining insight in the structure of recipes and the influence of ingredients on the category, as shown on the example of the category “Baking”. Feature analysis revealed that ingredients alone are nearly as good an indicator as the recipe description. Information from both sources complements each other.

We expected the

References (38)

  • T.F. Media

    Wenn Sie kochen, woher beziehen Sie Anregungen für Ihre Gerichte?

    (2012)
  • R. Cadene

    Deep Learning and Image Classification on a Medium Dataset of Cooking Recipes

    (2015)
  • H. Su et al.

    Automatic recipe cuisine classification by ingredients

  • J. Naik et al.

    Cuisine Classification and Recipe Generation

    (2015)
  • I. Hendrickx et al.

    Very quaffable and great fun: applying nlp to wine reviews

  • W. Min et al.

    Being a supercook: joint food attributes and multimodal content modeling for recipe retrieval and exploration

    IEEE Trans. Multimed.

    (May 2017)
  • J. Oberländer et al.

    Distributional gastronomics

  • L. Wang et al.

    A personalized recipe database system with user-centered adaptation and tutoring support

  • H. Maeta et al.

    A framework for recipe text interpretation

  • L. Wang et al.

    Substructure similarity measurement in Chinese recipes

  • C. Kiddon et al.

    Mise en place: unsupervised interpretation of instructional recipes

  • S. Mori et al.

    Flow graph corpus from recipe texts

  • S. Mori et al.

    A machine learning approach to recipe text processing

  • Y. Yamakata et al.

    Feature extraction and summarization of recipes using flow graph

  • M. Wiegand et al.

    Data-driven knowledge extraction for the food domain

  • M. Wiegand et al.

    Web-based relation extraction for the food domain

  • M. Reiplinger et al.

    Relation extraction for the food domain without labeled training data–is distant supervision the best solution?

  • E. Donalies

    Himmel und Erde – wie wir Gerichte benennen und warum wir das tun

    Sprachreport

    (2017)
  • H. Xie et al.

    A hybrid semantic item model for recipe search by example

  • Cited by (6)

    View full text