Development of food fraud media monitoring system based on text mining

.


Introduction
It is accepted within the EU that food fraud covers cases where there is a violation of EU food law, which is committed intentionally to pursue an economic or financial gain through consumer deception. Food fraud in the food supply chain can arise as a result of misrepresentation associated with: product integrity (e.g. counterfeit product, expiration date), process integrity (e.g. diversion of products outside of intended markets), people integrity (e.g. characterizations such as the cyber criminals and hacktivist) and data integrity (e.g. improper, expired, fraudulent or missing common entry documents or health certificates) of information accompanying the food item throughout the supply chain (Manning & Soon, 2018;Manning, 2016).
Food fraud incidents have been reported in many countries with a direct link to public health problems and even deaths. In Norway in 2002-2004, a liquor adulteration with methanol killed 9 people and 51 people were admitted to hospital with methanol poisoning. The liquor responsible for these cases contained 20% methanol and 80% ethanol and probably came from the same producer in southern Europe (Hovda et al., 2005). In China in 2008, melamine was used to adulterate protein levels in infant powder milk produced locally in China. The adulteration resulted in illness in 294 000 individuals, in hospitalisation of 50 000 infants and in six deaths (Domingo, Tirelli, Nunes, Guerreiro, & Pinto, 2014;Liu, Liu, Zhang, & Gao, 2015;Qiao, Guo, & Klein, 2010;Xiu & Klein, 2010;Zhang & Xue, 2016). In Europe in 2013, several EU countries found traces of horsemeat in products fraudulently labelled as beef (O'Mahony, 2013;Stanciu, 2015). The Rapid Alert System for Food and Feed (RASFF) (RASFF, 2017) was used to exchange relevant information during this incident and to react quickly to protect consumers. Traceability checks began on the same day and the European Commission (EC) started an implement action plan to fight against food fraud to strengthen the European Union (EU) system and to restore consumer confidence (EU, 2013).
1 Recently, various developments have taken place in the field of food fraud systems to understand and document food fraud incidents. In the UK, HorizonScan database was created by Fera. 1 HorizonScan focuses on global food and feed integrity issues such as incidents related to adulteration, substitution and fraud, as well as microbial contaminants, allergens, pesticides and drug residues. In EU, the Food Fraud Network (FFN) was established to handle requests for crossborder cooperation and to ensure the rapid exchange of information between national authorities and the Commission in cases of suspected https://doi.org/10.1016/j.foodcont.2018.06.003 Received 29 March 2018; Accepted 2 June 2018 fraudulent practices. In 2017, 597 cases have been exchanged by the FFN (EC, 2018) using a dedicated IT system, the Administrative Assistance and Cooperation (AAC) system. In the USA, other databases have been created, such as the Economically Motivated Adulteration database (EMA) (EMA, 2017) and the USP food fraud database (USP, 2018). EMA database contains food fraud incidents since 1980 and is housed at the National Center for Food Protection and Defence. The database provides information about the food product, fraud incident year, adulterant, type of fraud, health consequences, country of origin and how the incident was discovered Marvin et al., 2016). Food fraud systems have a very different fraud type classification when compared to each other. RASFF includes both intentional and unintentional food fraud notifications, which are classified in six different fraud types (Table 1): improper, fraudulent, missing or absent health certificates; illegal importation; tampering; improper, expired, fraudulent or missing common entry documents or import declarations; expiration date; and mislabelling.
EMA database proposes nine different types of food fraud (Table 1), of which the most important are: substitution, artificial enhancement, dilution, transhipment, counterfeit, and misbranding. HorizonScan database includes mainly six types of food fraud (Table 1): adulteration/substitution, fraudulent health certificate/documentation, produced without inspection, unapproved premises, expiry date changes, unauthorised/unsuitable transport.
None of the above mentioned systems provide cases of food fraud reported in media globally and it is advocated that this information may be a useful additional intelligence source that will help the authorities and industries to design their control strategies to increase the chance of detecting fraudulent activities. Therefore, the aim of this study was to develop this extra information source by exploiting a powerful infrastructure that is already in place (i.e. the MedISys portal of the European Media Monitor (EMM)). MediSys 2 is a media monitoring system providing event-based surveillance to rapidly identify potential public health threats using information from media reports. MedISys is a text mining system that continuously monitors about 900 specialist medical sites plus all generic EMM news, i.e. over 20 000 RSS feeds and HTML pages sites from 7000 generic news portals and 20 commercial news wires in altogether 70 languages. This system is used by a number of Health Agencies, including ECDC, 3 EFSA 4 and WHO 5 (Mantero, Cox, Linge, van der Goot, & Coulombier, 2010;Rortais, Belyaeva, Gemo, van der Goot, & Linge, 2010). However, MedISys does not collect media reports on food fraud. Therefore, we developed a filter in MedISys dedicated for food fraud (MedISys-FF). The developed MedISys-FF collects food fraud reports in the media worldwide in eight different languages and was tested for 16 months (September 2014to December 2015. In this period all reports collected by MedISys-FF were retrieved and compared to food fraud reports published in three food fraud systems, which are: RASFF, EMA, and HorizonScan. It was concluded that the newly developed system collects complementary information and therefore may be a useful additional intelligence source to combat food fraud.

Material and methods
In this section, we present the development steps of the MedISys-FF and introduce three food fraud systems (RASFF, EMA, and HorizonScan).

MedISys-FF tool
The construction of the food fraud filter consists of the following steps: (i) identification of food fraud keywords, (ii) validation of keywords by experts in food fraud, (iii) development of the MedISys food fraud filter, (iv) and evaluation and improvement of the filter (Fig. 1).

Step 1: Identification of food fraud keywords
A literature research was carried out to obtain an overview of the most recent trends in food fraud and to identify the list of food fraud keywords. Several digital libraries were used to search for scientific articles in food fraud, such as Scopus, Science Direct, Springer Link, and Google Scholar. In addition, food fraud databases were analysed to determine the keywords, such as the Rapid Alert System for Food and Feed (RASFF) 6 and the Economically Motivated Adulteration database (EMA) 7 and HorizonScan.

2.1.2.
Step 2: Validation of food fraud key words by experts Three senior experts with long term experience in this topic from the Netherlands and USA validated the identified keywords. The validation resulted in 59 English food fraud keywords, which are listed in the following table (Table 2).
All the keywords cited in Table 2 were translated into 8 different languages to collect articles in various languages. The following languages were used: Arabic, Chinese, Dutch, French, German, Italian, Portuguese, and Spanish. This translation resulted in 531 food fraud keywords.

Step 3: The design of MedISys food fraud filter
The keywords method was used to create the food fraud filter by using the defined list of keywords (Table 2) and the keywords that should be excluded, such as artificial food, how to make fake milk, how to make fake alcohol, and counterfeit medicine. This keywords method was preferred because of its effectiveness and precision. If a keyword consisted of two or more words, the symbol "+" was used. For example, the keyword "food fraud" was converted to "food + fraud", which means that the words "food" and "fraud" should appear together and in the same order in the article.

Step 4: Evaluation and improvement
The newly built filter with keywords related to food fraud was set operational for 2 weeks. All articles collected by the filter in this period were retrieved from the hyperlink provided by the system and assessed for relevance to 'food fraud'. The relevance was about 30%. An article was considered relevant if it describes an event about food and fraud.
To improve the relevance of the filter, the set of keywords were adapted and tested for relevance after another week. The relevance had increased to 67%. This process was repeated twice until a stable level of 80% relevance was reached. Once this optimal performance was achieved, the keywords combinations were kept constant and the food fraud articles were collected from 15/09/2014 to 31/12/2015. The collected articles are presented on the MedISys filter as a hyperlink to the original publication including a short summary (in the original languages and translated into English) generated by the system itself. This link and summary stays on the filter as long the original article is online. Once it is removed by its publisher, the link and summary disappears. In order to be able to evaluate the articles found by the MedISys filter for a long time period it is preferable to extract these from the filter and store them in an own (not publically available) database. To this end the RSS feed functionality of MedISys was used to automatically extracted the articles (hyperlink and summary) and store them in a local database using MongoDB and Elasticsearch (see Fig. 2).

Food fraud reporting systems
2.2.1. Rapid Alert for Food and Feed (RASFF) RASFF portal is a dynamic online alert tool that aims to rapidly share any information concerning health risks derived from food or feed (RASFF, 2017) between the European Commission (EC) and the control authorities of Member States (MS).
RASFF ensures that urgent notifications are sent, received and responded collectively and efficiently. RASFF gives public access to summary information about the most recently transmitted RASFF notifications and also allows users to search for any notification issued in the past. This portal is compiled from many sources: i) official control on the market, ii) border control, iii) and consumer complaint. Notification details can include the notification type, the basis of notification, the product in question, the countries involved, the action taken, the distribution status, and the type of hazard identified. All fraud/adulteration notifications under the product category food reported in RASFF in the period 01/01/2000 to 31/12/2015 were extracted and analysed.

Economical Motivated Adulteration database (EMA)
The EMA database focuses on the intentional adulteration of food for economic gain or for food fraud. This database is compiled from literature and media searches of EMA incidents in food products since 1980. Sources include LexisNexis, PubMed, Google, FDA Consumer and FDA recall records, state reports, and RASFF portal. Food fraud reports collected by EMA in the period 01/01/2000 to 12/2015 were analysed in this study.

HorizonScan
HorizonScan focuses on global food and feed integrity issues and food fraud is one of the food safety hazards. Food fraud incidents   Bouzembrak et al. Food Control 93 (2018) 283-296 recorded in the database are related to adulteration, substitution and fraud. HorizonScan tracks more than 536 commodities in 202 countries from official websites of more than 65 countries with more than 100 data sources that are scanned daily. Every day on average more than 30 new records are added to the database. All food fraud reports collected by HorizonScan in the period 01/01/2000 to 31/12/2015 were extracted and analysed.

MedISys-FF
3.1.1. Food fraud publications (2014)(2015) In the period from 15/09/2014 to 31/12/2015, 1144 newspapers articles on food fraud were automatically collected by MedISys-FF. The relevant articles (851 articles) were assessed manually and the irrelevant ones (293 articles) were removed. The percentage of articles reporting a food fraud incident was 58% of the relevant articles (Fig. 3).

Product category and origin country
The relevant articles were further analysed for the following characteristics: (i) the food products reported as fraudulent, (ii) the type of fraud detected, and (iii) the country in which the media report appeared. The main topics in these media articles were food fraud in general followed by fraud with meat and fraud with seafood (Table 3). Food fraud media articles were collected from 78 countries. The number articles per country is shown in Fig. 4. The countries with the highest number of food fraud articles in the media in the period Y. Bouzembrak et al. Food Control 93 (2018) 283-296 investigated are Egypt (94 articles), United States (85 articles), United Kingdom (84 articles) and Saudi Arabia (73 articles). However, we did not succeed in collecting food fraud publications from several countries in Latin America (e.g., Argentina, Chile, Uruguay, Peru, and Bolivia) or in Africa (e.g., Angola, Zimbabwe, Mozambique, and Congo-Kinshasa). This may be due to the lack of geographical locations in the articles, language barriers or no media coverage on food fraud in the analysed period. The vast majority of the media reports (> 95%) dealt with food fraud conducted in the country of publication. Interesting difference between countries was observed. For example, in US the most publications deals with fraud related to seafood whereas in United Kingdom reported fraud is mainly related to alcohol and meat (Appendix Table  A1).

Types of food fraud
The type of food fraud used in the various system reporting food fraud varies (see Table 1) and clearly there is a need for harmonisation. For the purpose of this study, we used the classification of food fraud based on RASFF and defined by Bouzembrak and Marvin  and are shown in Table 4. However, not all food fraud reports collected by MedISys-FF could be classified in this way and therefore we added three other categories being: food fraud (in general, hence not specified), regulation and disrespect of regulation. The distribution of the food fraud reports collected by MedISys-FF over these categories is shown in (Table 4). Most of the reports were on food fraud in general (28%), followed by expired food (23%), mislabelling (13%) and regulation (13%).
The number of fraud notifications related to meat and meat products increased from 1 notification in 2000 to 59 notifications in 2013. This is partly explained by the horse meat scandal in Europe in 2013. Another example is the increase of nuts products and seeds fraud notifications from 1 notification in 2001 to 40 notifications in 2015 (Appendix Table A2).

Food fraud incidents (2000-2015)
There are no records of food fraud prior to 2006 in HorizonScan. In the period of 2006-2015, the number of recorded food fraud incidences increased from 7 to 377 with the peak being in 2015 (377) (see Fig. 7).

Product categories and origin country
In the period 2000-2015, 1515 incidents were reported for 149 distinct food commodities. The main food categories and their frequencies are recapitulated in Table 9. The 10 reported commodities in Table 9 account for 55% of food fraud incidences. Each of the remaining commodities constitutes less than 2% of total reported commodities subject to fraud.

Types of food fraud
Table 10 provides the fraud categories of HorizonScan database which are adulteration (60%), fraudulent health certificates (20%), produced without inspection (11%), unapproved premises (4%) and changing expiry date (4%).   In this section, the result of MedISys-FF is compared with three existing food fraud systems that collect and report food fraud issues being: RASFF, EMA and HorizonScan (Table 11).
The origins of the reports presented in the four different systems range from media publications (MedISys-FF) to official notifications from authorities (RASFF) and this is also reflected in the commodity most frequently found, the type of fraud reported and the country of origin of this fraudulent product. Table 11 shows for the same time span (September 2014-December 2015), the four most frequent reported commodities, type of fraud and the origin country of the fraudulent product. Note that in the table these three characteristics not necessarily are linked. Interestingly, the two systems that collect food fraud reports from authorities (RASFF and HorizonScan) have different commodities in the four most frequently reported ones. In addition, the top four countries of origin of the fraudulent product are different between the two systems. Note that for this study the reports in Hor-izonScan that originate from RASFF could be identified and were removed. This was not possible for the EMA data and therefore overlap due to duplication will exist between RASFF and EMA. MedISys-FF and EMA both have fish products in common with RASFF. Overlap between EMA and RASFF is expected since EMA also collect food fraud reports from the RASFF. In the period analysed RASFF reported 90 fraud cases with seafood which accounts for almost 50% of the fraud cases in seafood in the EMA database. Such duplication does not exist between the MedISys-FF and RASFF.

Discussions
In this study, a tool has been developed that specifically collects articles on food fraud from the media world-wide and the collected articles were compared with three existing systems that report on food fraud. The four systems analysed in this study showed little overlap regarding the fraud incidents they report.
Such differences may be due the non-consistency in the systems (i.e. products categories, type of fraud, origin country), trade between countries (e.g. EU countries are not importing milk from India or Brazil), relatively short period used for this comparison, lack of information (e.g., EU cases of food fraud are missing in RASFF). Furthermore, it may be partly due to the origin of the data and the purpose of the data bases. RASFF is a European portal that is maintained by the European Commission. The objective of RASFF is to provide food and feed control authorities within EU, including Norway, Liechtenstein, Iceland and Switzerland with an effective tool to exchange information about measures taken responding to serious risks detected in relation to food or feed. HorizonScan, on the other hand, collects reports from the websites of food safety authorities (worldwide) and obviously authorities not necessarily report on their website cases that are also disclosed in RASFF since these often deals with incompliance in documentations. MedISys-FF collects specifically reports on food fraud that are discussed in the media. The majority of these reports deals with food fraud in the country where the media report appears and generally does not involve discussions on fraudulent documentations, such as mainly reported in RASFF. Based on the results, RASFF seems to tackle most of the cases originating from outside the borders of the EU by mainly preventing document fraud, such as HC, illegal importation or improper or missing documentation (Tähkäpää, Maijala, Korkeala, & Nevas, 2015). Furthermore, a bias is expected in the RASFF data since often the testing is risk-based and is performed by the border inspections on food products imported into the EU. The focus is on fraud that may impact human health and less on fraud that has only economic effects (cheating). However, the latter may receive much media attention which is picked up by MedISys-FF.
As determined by the newly developed food fraud tool, seafood in the US is the most often reported fraudulent product (Appendix Table  A1). Fraud in seafood in US has been reported in the literature as well. Manning and Soon (2014) published a report on the types of food fraud incidents identified in US restaurants and retail outlets in 2012 and concluded that 58% of the sampled retail outlets (81 retail outlets) sold mislabelled fish with small markets having a higher incidence of fraud (40%) than national chain grocery stores (12%). Furthermore, all sushi bars sampled (16 bars) sold mislabelled fish and 94% of the tuna tested .

Table 9
The main product categories reported in HorizonScan (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015).  Y. Bouzembrak et al. Food Control 93 (2018) [283][284][285][286][287][288][289][290][291][292][293][294][295][296] was not tuna at all (Manning & Soon, 2014). In India in 2011, a survey of adulteration in liquid milk (1791 samples) found that 68% of the randomly collected samples tested were fraudulent milk. In urban areas, 69% of milk sold was tampered milk (detergent addition to milk (8%); skimmed milk powder (45%) and glucose addition to milk (27%)). In seven Indian states all samples taken were found to be impure. Interestingly, fraud in milk apparently continued to be an issue in India since the MedISys-FF detected high number of media articles on milk fraud in India in the period analysed (2014-2015) (see Appendix  Table A1). These two cases demonstrate that MedISys-FF can contribute to the identification of food fraud issues occurring in a specific country in the world. Food fraud is reported in various systems but have, as discussed in this paper, different objectives and uses food fraud classifications and product categories which are not harmonised. Therefore, a full picture of the extent of fraud (e.g. actual level of food fraud) is difficult to obtain. This is further hampered by the often risk-based approach used to by the authorities to select the samples to be tested. Nevertheless, trends in fraudulent products observed in these systems may be useful for the development of predictions systems especially when fraudulent findings are linked to economic and/or human behaviour factors. Linking food fraud reports of RASFF and EMA to economic and other data using Bayesian Networks, Marvin et al., 2016 could predict the type of fraud with more than 80% accuracy demonstrating the practical use of the current data.
The newly developed MedISys-FF system collects information on food fraud cases mentioned in the media and therefore potentially provides additional information since the other systems do not cover the media. Some media coverage of food fraud may be driven by publications of the authorities and therefore picked up by systems that scan the authorities' websites such as the HorizonScan. However, media reports on food fraud due to research of newspapers themselves will not be found by HorizonScan. Differences in technology used in HorizonScan and MedISys-FF could also cause differences in detection of reports between both systems. A systematic comparison over a long period (2-3 years) could help to clarify this point.

Conclusions
Within MedISys a food fraud tool has been constructed that collects food fraud media reports world-wide every 10 min 24/7 and therefore provides an actual overview of media articles on food fraud on a global level.
The accuracy of the newly developed tool MedISys-FF was high (75%). The most reported fraudulent commodities in the media were i) meat, ii) seafood, iii) milk and iv) alcohol. The analysis of these articles can facilitate the development of control measures to protect the food supply chain. Furthermore, the comparison performed in this study revealed a lack of consistency in the terminology between the data sources studied (MedISys-FF, RASFF, EMA and HorizonScan). For instance, there is no clear definition or presentation of food fraud types or product categories in EMA and RASFF. This type of information is essential to detect fraud, to improve control policies and programmes, and to evaluate any actions taken.
The results shows that the media based food fraud filter adds to the current systems in place and that it may be a useful input source for quality managers and food safety authorities to inform their control programmes.

Acknowledgement
The research leading to this result has received funding from the European Union Seventh Framework Programme (FP7/2007(FP7/ -2013 under grant agreement no. 613688 (FoodIntegrity project) and from the Dutch Ministry of Economic Affairs (KB-15). The authors would like to thank Mr Jordina Farrus Gubern (Fera Science Ltd, UK) for extracting the data for this study from HorizonScan.