Social sensing of flood impacts in India: A case study of Kerala 2018

.


Introduction
Flooding is a major global hazard. Floods are expected to increase in frequency and severity as a consequence of climate change, as well as urbanisation and land use change, especially in the developing world [1,2]. In this paper, we will examine the Kerala flood event in 2018. The flood is generally agreed to have peaked on 16 th August 2018, but heavy rains before this date caused floods and landslides in hilly areas, while floodwaters did not recede until many days later [3]. A major relief effort was launched, with more than Rebuild Kerala database was put together by the Government of Kerala after the 2018 event to manage the distribution of financial aid related to property damage [46]. Reports were collected by citizen surveys, validated by local officials, and then either approved/rejected for aid by the state government. More details on these data sources and how the data from each was processed is given below.
The experiment performed in this study is a triangulation between the different data sources. While accurate "ground truth" information may be constructed for the extent of floodwaters [20], it is very difficult to collate accurate ground truth information for the social impacts of flooding. A disastrous flood event such as Kerala 2018 affected many people in many ways, with each individual having their own story. As such, the true social and economic impacts of the flood may never be known completely. Sources of economic information such as the financial losses and compensation, property damage, damage to agricultural land, and disruption to economic activities and livelihoods, are diverse and difficult to unify into a single metric. Measurement of social impacts beyond the tragic loss of life is also hard to perform; the level of distress, loss of health and wellbeing, and impacts on families and communities are almost impossible to quantify. Therefore our study does not seek to use the social sensing approach to accurately recover some known ground truth observation. Instead, it compares and contrasts the view of the flood impacts that are offered by several different sources: two social media platforms, a citizen relief app, and a governmental database. The rigorous analysis of Telegram in this study is novel, providing clear evidence that the platform can and should be used for social sensing. Additionally, social sensing studies are rarely validated to this depth. This systematic comparison of four very different data sources is a unique approach for investigating and validating flood impacts and social sensing itself. Each of the sources might be expected to reveal a part of the overall situation during the flood event. Their agreement in when and where flood impacts are observed, and the nature of the information they contain, will help to evaluate the potential for such data to be used in future flood events to support decision-making and planning.
The paper is structured as follows. Section 2 covers the Data Collection & Methodology by which information was obtained from each source. This is followed by Results in section 3, covering the findings from the study, before the Discussion in section 4 highlights some of the main implications and limitations of the work.

Data Collection & Methodology
Here we explore the potential for detecting the impacts of the Kerala floods using publicly accessible data from Telegram and Twitter. Data from both platforms is collected and filtered for relevance, then used to create time series plots and spatial maps of floodrelated activity. These are compared to similar figures based on public submissions to the Kerala Rescue app, a citizen-generated web service set up during the flood event to allow affected people to submit requests for help [47]. A further comparison is made to damage data collected after the event by the government-sponsored Rebuild Kerala initiative [46]. While it is known that WhatsApp was widely used during the event to coordinate rescue and relief efforts [48], we are not able to study WhatsApp data due to privacy restrictions on the platform. As well as considering the ability of Telegram and Twitter data to accurately map the social impacts of the flood event, we also examine content from each platform as a source of data about what kinds of impact were experienced.

Telegram
Telegram is a social media messaging platform which, like many other messaging services, allows for end-to-end encrypted conversations and groups. The feature that differentiates Telegram from its competition is public group chats called supergroups. With up to 200,000 members [40], supergroups promote conversation within local communities [49]. For this work, conversations from the supergroup 'KeralaGram' were downloaded using the inbuilt 'Export Chat' function in the Telegram desktop application. Messages were received in JSON format, and shared media was received in its original format (i.e. MP3 for audio and JPEG for images). With around 15,000 members, 'KeralaGram' is the largest public community discussion group in Kerala. The data used within this study is publicly accessible, with all processing and analysis following the Telegram terms of service.
During the flood period, a rise in the amount of posts sharing other media was detected, with a 450% increase in shared images compared to the same date range within the previous month. These images consisted largely of flood photos, alongside screenshots of warnings and messages from other platforms. To utilise this media increase, basic computer vision was applied, automatically adjusting the image color and contrast to optimise the extraction of English and Malayalam text using the OCR library Tesseract [50]. Next, the language detection Python library "langdetect" was applied to the messages [51]. While this package occasionally misclassifies text, its purpose here is to provide a rough overview of the languages present. With only 28% of the dataset being detected as English, machine translation was needed. To translate the remaining text, the XLSX document translation Google Translate 5 feature was used. Bots are often found in social media data, however, due to the presence of admins and strict rules within this chat, the conversation post-translation was highly regulated and did not require bot filtering. Additionally, after location inference (see section 2.5), relevance filtering was not required as the flood caused a large shift in messaging content and structure, with a 98% relevance of located messages (see Fig. 1). A post was only considered relevant if it was directly discussing the 2018 Kerala flood, for instance requesting help or sharing information about the event. If there was any ambiguity, the entry was classified as irrelevant. A sample filtered Telegram message is shown in Table 1. The final dataset contained 13,614 messages and 1188 images containing text between 1st August -23rd August.

Twitter
Twitter is a social media microblogging platform, where users produce character restricted messages called tweets. Tweets are stored in JSON format and can be collected for free in real-time or purchased retrospectively from Twitter's Historical PowerTrack service. For this work, the PowerTrack service was used to obtain all tweets identified by Twitter as originating from India (that is, having an 'IN' country code in their location metadata) between 1st August to 23rd August. This collection contained 1,363,659 tweets. All tweets in this collection contain either exact tweet coordinates from GPS enabled devices or a user-defined tweet 'place' attribute, which is used by Twitter to designate their country code. The following analysis and investigation is consistent with the Twitter terms of service.
The first filtering step was extracting English language tweets. While machine translation of tweets in other local languages is possible, it is slow to perform and therefore infeasible for the high volume of tweets in the dataset. Despite this, the language attribute within the tweet JSON objects highlighted that English was the most common language in our dataset, with 90% of tweets in English; therefore, simply extracting the English tagged tweets was deemed sufficient. Location inference (see section 2.5) was used to locate and identify tweets that both concerned and originated from Kerala. This decreased the dataset to 24,414 tweets. Inspection of tweet volumes suggested that bot removal was not required as no account had tweeted excessively (following the >1% rule successfully implemented by Arthur et al. [28]. The tweet collection was geographic, rather than thematic, so to extract relevant tweets a list of keywords was produced, containing both common English-language flood terms and words frequently used in Kerala Rescue requests (Appendix A.1). If the tweet text did not overlap with this list (i.e. contained no words in the keyword list), the tweet was ignored. This resulted in 7097 tweets related to floods.
Next, a manual inspection of the remaining tweets highlighted numerous duplicates, with only minor differences such as the tagged user or a hashtag. The majority of these posts originated from automated and institutional Twitter accounts and did not include useful on-the-ground observations of flood impacts. To remove these, tweet texts were vectorised by word frequency using the scikit-learn package [52] in Python and the cosine similarity between them was calculated. To decrease computation times, stop words and punctuation were removed from tweets before vectorisation. If any two vectors had a 97% or above cosine similarity between them, then they were deemed as duplicates, and only the tweet posted first was kept. This successfully removed obvious duplicates. However, large groups of messages that differed by only a few words remained. To remove these, if at least 10 vectors showed a cosine similarity above 90%, the corresponding messages were classified as duplicates and once again only the original tweet was kept in the dataset. While the 97% and 90% thresholds were heuristically chosen, manual inspection showed they were appropriate for removing unwanted messages. Removal of duplicates left 6936 tweets for further analysis. Manual inspection (using the same criteria as section 2.1) of a randomly selected 20% sample of the remaining tweets showed a 96% relevance indicating that further relevance filtering was not required. A sample filtered tweet is shown in Table 1.

Kerala Rescue
Kerala Rescue is an emergency request platform, allowing those impacted by the flood to submit a request for services including rescue, food, water, clothing, and medicine. It was created by a college student from Kerala during the early stages of the flood [47]. After 10 days, an additional 200 software developers were actively developing the site, with 50,000 registered Kerala volunteers responding to and validating the 45,000 requests [47]. The data was collected with permission from the developers in JSON format from a public Slack channel active during the floods. Since the data is potentially sensitive, data was anonymised prior to analysis and no identifiable results are displayed.
The Kerala Rescue database contains help requests formatted as records that each contain 27 columns including location fields, rescue details, and requester contact number. The first processing step was the removal of duplicate entries. Due to the ambiguity between duplicate requests and follow up requests, if two entries with the same contact number were posted within an hour of each other, and if either their location or requester attributes were duplicated, then only the original entry was kept. This removed over 5000 entries, which upon manual inspection were correctly classified as duplicates. Location information was provided as both coordinate fields and as entries in a text field. To remove inaccurate location coordinates, if the 'coordinate_accuracy' attribute was over 1000 m, the corresponding coordinates were removed and other location data was favoured. The aforementioned language translation technique was applied to the text-form location field, before applying location inference to provide accurate coordinates for mapping (see below).

Rebuild Kerala
The Rebuild Kerala Initiative (RKI) was set up by the Government of Kerala to assist in the rebuilding and recovery efforts after the flood event [46]. The RKI includes a diverse range of projects, including the development and promotion of the Rebuild Kerala Table 1 Overview of Telegram and Twitter messages after filtering.

Platform
Fictitious posts using similar language to the datasets Average post length (words) Percentage of messages in English pre-translation

Telegram
We have 50 students trapped in the Munnar engineering college. While we are safe, we need water and food urgently. Please call us on the number below.

28%
Twitter @PMOIndia @CMOKerala, please help us to arrange rescue for over 5000 people stuck in Kuttanad, Alappuzha. We need rescue boats and a helicopter immediately.

90%
Development Programme, the strategic roadmap for reconstruction across the state. Relevant to this study, RKI collected data by surveying affected citizens about property damage and made it publicly available as a database, with each report then verified by a local overseer. From this database, we extracted records for verified flood-damaged houses that were accepted for financial aid by the state government. This data is stored in a comma-separated format with each row corresponding to one of the 1034 Kerala local administrative bodies (Municipalities, Corporations, and Gram Panchayats). Each record is the number of damaged properties approved for financial support within that local body. Manually assigning each location to a corresponding map polygon was the only preprocessing required. A few assumptions were necessary due to alternative spellings and duplicate place names in the data; however, these are unlikely to make an impact due to the volume of the data. In the dataset, most local bodies had corresponding values in the 'Total', 'Verified', 'Approved' and 'Rejected' surveys columns. For this work, the difference between the 'Verified' and 'Rejected' columns was taken as a conservative estimate of the true 'Approved' value, as many entries did not have a value for 'Approved'.

Location inference
To find on-the-ground observations rather than long-distance "news" posts, the data was geolocated, keeping only the data located in Kerala. Location inference is an automated procedure that identifies place names in the post and then assigns a map polygon (shapefile) for each location. To perform this location inference process, a package developed by Arthur et al. [28] was used, which is based on the work of Schulz et al. [53]. This involved cross-referencing words within a tweet/message/request to various gazetteers (GADM, DBpedia and Geonames [54][55][56]), before inferring the most likely location. Previous work has found this approach to provide highly accurate locations but, due to the lack of ground truth, this is difficult to quantify. Manual checks of outcomes were performed on an ad hoc basis to ensure validity throughout the process. Due to differences in the platforms, this final filtering step was applied as follows: • Telegram: Location inference was applied to the extracted text from translated messages and images. If this approach was unsuccessful, and if a Kerala landline phone number was present in the message's 'phone' attribute, the location could be inferred using the area code as sourced from the Kerala Government website. 6 • Twitter: Despite the dataset being a geographic collection, manual inspection showed that the exact location discussed in the tweet text frequently did not correspond to the (often large and unspecific) bounding box within the 'place' field in tweet metadata. While the cause of this is unknown, two potential explanations are from the incorrect entry of the 'place' attribute from the user, or from tweets being posted on behalf of someone, with the place tag related to the user location, rather than where they are referring to. Therefore, location inference was applied to the tweet text, with a manual sample showing that this approach had a higher accuracy in determining the location referenced by the post than when the 'place' attribute was used. If the tweet had geotagged coordinates, and if no location was inferred from the text, these coordinates were taken as the inferred location. • Kerala Rescue: A similar approach to that used for Twitter was taken, with the user-entered 'location' field prioritised over the automated 'coordinate' field due to occasional coordinate inaccuracies. These inaccuracies were primarily from requests on behalf of those affected, with the coordinates corresponding to the individual sending out the request, and the 'location' field corresponding to the individual in need of help. If geolocating with this approach was unsuccessful, phone area codes were used as with Telegram posts. • Rebuild Kerala: Location inference was not required as the data was reliably located and verified.
Within the administrative divisions of Kerala, there was ambiguity regarding duplicate place names, resulting in assumptions being made during location inference for the social platforms. For instance, there is a district, taluk (administrative subdivision) and city all called Thiruvananthapuram. As a general rule, unless the administrative level was stated, it was assumed that people were always being specific about their mentioned location (i.e. Thiruvananthapuram city in the previous example). The final filtered and geolocated data counts are shown in Table 2.

Characterising flood impacts
When a request was made through Kerala Rescue, the requester had the option to select 'True' or 'False' concerning several specific needs they may have had. The options were for rescue, food, water, medicine, clothes, kitchen utensils and toiletries, with each having a follow-up section for more information. For this investigation, the food and kitchen utensils columns were combined due to their similarity, as were the medicine and toiletries columns. To compare the specific needs of those using Kerala Rescue to the other platforms, all of the location inferred Telegram messages and Twitter posts (tweets) were manually categorised. This categorisation removed messages that were not directly requesting help (i.e. messages providing information/offering help), before categorising the remaining data into the five groups: rescue, food, water, medicine, and clothes. As Kerala Rescue users had the option to select multiple requests (i.e. food and medicine), the categorisation of tweets/telegrams was not limited to single classes. The data tended to be relatively easy to categorise, so a single human coder was employed to perform this task. To ensure validity, a sample of 50 labelled messages was independently categorised by 3 human coders, finding an almost perfect agreement (Fleiss's Kappa: 0.897, Krippendorff's Alpha: 0.865). This was seen as good evidence that the manual coding process was robust. Fig. 1 shows a time series for the volume of filtered and located messages from Telegram, Twitter, and Kerala Rescue binned into 12-h time intervals from 1st August until 23rd August. Kerala Rescue was not created until 12th August so that platform only has data after that date. The data from Rebuild Kerala was not timestamped so is not plotted. There is a large difference in the volume of records from each platform, with Kerala Rescue having by far the highest number of relevant posts. All time series follow a similar trend, with a sharp increase in messages/requests from 15th August, peaking around 18th August, before decreasing again. This is consistent with the general consensus that the rainfall peaked in severity around 16th August [3]. Note that the 12-h period shows apparent decreases in activity during night time (2400-1200), seen most sharply on 17th August. Fig. 2 shows filtered and located posts from the four data sources as a choropleth map (heatmap) of Kerala. Data is grouped at both district and taluk level to capture administrative units relevant to the region. To reflect the uncertainty in location inference at the scale of taluks, a moving average filter was used to assign to each taluk the average value across itself and its neighbouring taluks within the same district. This was done for all four data sources. As there was only uncertainty between place names within individual districts and not between multiple districts (as discussed in section 2.5), this filter was applied to each district separately. Data from the three timestamped platforms has been restricted to the main flood period between 12th -23rd August. Since the counts between platforms vary considerably, values for each map were linearly scaled into the range [0,1] to create the color scale. Qualitatively, there are clear spatial similarities between maps, with two main clusters of activity seen in the taluk-level comparison -the first around Kochi, and the second between Alappuzha and Kollam. There is a visual agreement with these hotspots and the flooded areas identified using satellite imagery by Tiwari et al. [20]. The activity on Telegram and Twitter is more evenly distributed throughout Kerala, whereas on Kerala Rescue and Rebuild Kerala the activity hotspots are more tightly focused. Fig. 3 quantitatively summarises the relationships between the variables plotted on the district and taluk maps in Fig. 2. Fig. 3 is a logarithmic scatter plot, with each subplot comparing the data from the corresponding column and row. Statistical analysis shows that all pairs of data sources have strong positive correlations at both taluk scale (Pearson's r > 0.69, p < 0.001) and district scale (Pearson's r > 0.76, p < 0.001). District-level correlations are stronger than taluk-level correlations. However, this was anticipated as location inference is more reliable at the larger district scale, where small errors are less important and the data less noisy. To check whether these results are explained as an artefactual consequence of population, with highly populated areas producing a stronger 'flood signal' J.C. Young et al. due to larger user populations, we calculated correlations with population size. Appendix A.2 shows that observation counts for both taluks (Fig. 2) and districts (Fig. 3) are uncorrelated with population (p > 0.24). Fig. 4 shows time series for help requests communicated on the timestamped platforms, alongside time-aggregated distributions of request types. Despite the distribution of needs on the platforms being different, the timing of requests across platforms is similar. For instance, requests for rescue peak before requests for provisions on all platforms, with the exception of medicine requests through Twitter, where they peak on the same day. From the bar charts, we can see that excluding rescue requests, the distribution of request types on Twitter and Telegram is similar. On all platforms, clothing is the least requested, and excluding Telegram, rescue is the most requested. It is worth mentioning that both Telegram and Twitter contained two additional categories, namely, requests for volunteers and links for financial aid. As these categories were both lower in volume than the others, and were not present in Kerala Rescue, they have been omitted from this analysis. Rebuild Kerala has also been excluded as it does not provide additional impact information beyond property damage. Table 3 shows a qualitative assessment of the strengths and weaknesses of each platform for measuring flood impacts. It should be noted that these platforms were not used independently of each other, for instance, messages from the same people were found on Telegram, Twitter and Kerala Rescue. Additionally, links to Kerala Rescue were widely distributed on Telegram and Twitter. Furthermore, these platforms are not the only platforms that were used to report impacts of the Kerala 2018 floods, with others such as WhatsApp being widely used.

Discussion
In this paper, we have compared social media data from two platforms (Twitter and Telegram), as well as one citizen-produced web application (Kerala Rescue) and a governmental disaster relief survey (Rebuild Kerala). The aim was to map and characterise the social impacts of the major flood event in Kerala in 2018. Results show that there is good agreement between the outputs from social sensing using Telegram/Twitter, the semi-formal relief requests database of Kerala Rescue, and the formal property damage assessment performed by Rebuild Kerala. Data from Twitter and Telegram also showed similarity between the types of impact reported, to the types of help request collated by Kerala Rescue. These findings build confidence that observation using unsolicited social media data can be an effective way to understand the effects of flooding. While Kerala Rescue was a platform rapidly created by volunteers specifically to coordinate requests for help during the flood, and might therefore be expected to perform well in this scenario, it is noteworthy that data derived from the general-use social media platforms Twitter and Telegram also show strong correlations with the structured observations provided by Rebuild Kerala. This study, therefore, adds to the evidence base (e.g. Arthur et al. [28,29] demonstrating the potential for informal social media data to assist with flood response and impact assessment. Here the use of Telegram data offers additional novelty, since (to our knowledge) this platform has not previously been used for this purpose.
The Kerala Rescue dataset contains a large amount of highly relevant data about flood impacts, relative to the smaller (post- filtering) volumes and greater ambiguity of data available from Twitter and Telegram. However, Kerala Rescue was a one-off platform, created in the early days of the flood by volunteer software developers. This ad hoc and spontaneous aspect to Kerala Rescue may add to its authenticity as a source of on-the-ground observations, but also brings uncertainty in that we do not know how widely it was adopted and by whom. By contrast, Twitter and Telegram both have established user populations in Kerala with an observable history of activity on those platforms. This means that while Kerala Rescue is very useful for situational awareness and community organisation during the event, it does not offer an effective "baseline" of activity against which observations can be situated. For example, it is hard to understand why more requests were made on Kerala Rescue in the Kollam region relative to activity on Twitter and Telegram. This could be due to Kerala Rescue genuinely capturing more impacts in that region, or could just reflect greater adoption of the application in that area; without a historic baseline of platform usage in that area, it is hard to determine. For citizen-generated apps such as Kerala Rescue there is no guarantee that the skilled volunteer workforce that created the app will be replicated in future floods or similar extreme weather events. However, Kerala Rescue shows that such applications can be effective in coordinating relief efforts during the event and also as a data source for post hoc analysis. While the social media platforms Twitter and Telegram generate fewer observations of flood impacts in the present case study, they offer a number of contrasting advantages. As referenced above, their established user populations, geographic coverage and generality permit the analysis of the 2018 Kerala flood event in a wider context. This might allow the 2018 Kerala floods to be assessed against other flood or weather events at different times or places, or against other types of social disruption, with a reasonably consistent underlying study population. The use of a temporal baseline allows the severity of an event to be estimated based on the level of social media activity relative to the long-term expected level; cf. identification of storm events using a percentile threshold for social media activity [34]. The establishment of social media as a routine form of communication avoids any special effort to set up a platform when an event occurs since events are observed within the continual flow of communications. Another interesting feature of Twitter and Telegram is that they are relatively unstructured in the ways they support communication between users. They enable conversational interactions between pairs or groups of users, in which a greater depth of understanding can be achieved about an unfolding event. This more natural form of discussion may provide insights that more constrained platforms such as Kerala Rescue do not. The converse of this lack of structured communication is that a greater effort is required to derive robust observations. It is not feasible to manually analyse large datasets that may contain hundreds of thousands of messages, while natural language processing can be difficult to apply successfully for short-form messages such as social media posts.
In summary, all of the platforms studied here offer distinct opportunities but also unique drawbacks to resilience planners and flood responders. These are summarised in Table 3 above. Twitter and Telegram allow the potential for routine monitoring and can provide highly relevant impact information. Kerala Rescue was a crucial platform during the flood, generating flows of highly structured and highly relevant data. However, it was created out of necessity and the next flood, or floods in other areas, could easily use a different platform or coordinate through channels like Telegram or WhatsApp. Finally, Rebuild Kerala is probably the most reliable and systematic data collection, but is retrospective, much slower, and focuses on only a single impact type.
One important consideration for any systematic usage of social media or other citizen-generated data sources for flood monitoring and emergency response is that of sample bias. While Internet penetration and smartphone usage are rapidly increasing in Kerala, India and most other parts of the world, it is still more frequent amongst younger, more affluent and urbanised populations. Reliance on such data sources could result in an unintended bias towards their user populations and away from those without access to digital resources. Careful management of data and appropriate consideration of bias might help to alleviate some of the harmful effects of such "digital divides", but should be considered from the outset of any project to operationalise these methodologies.

Conclusion
This paper shows that social sensing via popular social media platforms like Twitter and Telegram, or bespoke platforms like Kerala Rescue, can be a useful way to gather validation data and study flood resilience in India. While social media (typically Twitter) has been studied previously to show that it can generate useful insights around natural disasters, here we have focused on the comparison of different platforms, including novel sources such as Telegram and KeralaRescue, and validation against qualified governmental sources. We have shown that social media sources allow the rapid collection of data from people on the ground which is difficult to gather in other ways. A combination of channels could provide excellent coverage across a range of flood impacts -e.g. routine monitoring of Twitter and Telegram for situational awareness and to pick up early impacts. Kerala Rescue or similar could then be used as a central place for the coordination of community action and relief. Post-disaster analysis of all data sources, including formal efforts like Rebuild Kerala, could then provide a detailed record of the range of impacts and enable the creation of better flood resilience planning and infrastructure. As the global penetration of the internet and social media increases, we believe this study's methodology is not geographically bound to Kerala and can be of real value in other flood-prone areas. Additionally, as the platforms are not specifically tailored to flood events, this study's approach could be applied to other extreme weather hazards. Table 3 Qualitative comparison of data types.

Platform
Format Strengths Weaknesses

Telegram
Social Media: Messaging -Data is available for both real -time and retrospective collection -Conversations are often regional i.e. from those directly impacted -Low volume of unsolicited content/"fake news" in messages as admins often regulate chats -Group chats have a trade-off between volume and quality -Conversational nature of messages makes it hard to extract robust measurements -Users often delete their messages, leaving large gaps in conversations/data Twitter Social Media Microblogging -Data can be collected for free in real -time through the Twitter API -Original posts are mostly independent providing separate observations -High volume of raw data -Relevant data is mixed with larger volumes of irrelevant data, requiring multiple filtering steps -Expensive to obtain tweets retrospectively without a research account -The retweet function and trending topics can distort the signal, preventing balanced impact analysis Correlation between district and taluk population and the flood signal from the four platforms ( Figs. 2 and 3). The weak correlations are not statistically significant, showing that the levels of observed activity related to floods is not explained by underlying variation in population density across taluks/districts.