NewsCompare -- a novel application for detecting news influence in a country

The concept of ``fake news'' has been referenced and thrown around in news reports so much in recent years that it has become a news topic in its own right. At its core, it poses a chilling question -- what do we do if our worldview is fundamentally wrong? Even if internally consistent, what if it does not match the real world? Are our beliefs justified, or could we become indoctrinated from living in a ``bubble''? If the latter is true, how could we even test the limits of said bubble from within its confines? We propose a new method to augment the process of identifying fake news, by speeding up and automating the more cumbersome and time-consuming tasks involved. Our application, NewsCompare takes any list of target websites as input (news-related in our use case, but otherwise not restricted), visits them in parallel and retrieves any text content found within. Web pages are subsequently compared to each other, and similarities are tentatively pointed out. These results can be manually verified in order to determine which websites tend to draw inspiration from one another. The data gathered on every intermediate step can be queried and analyzed separately, and most notably we already use the set of hyperlinks to and from the various websites we encounter to paint a sort of ``map'' of that particular slice of the web. This map can then be cross-referenced and further strengthen the conclusion that a particular grouping of sites with strong links to each other, and posting similar content, are likely to share the same allegiance. We run our application on the Romanian news websites and we draw several interesting observations.


Introduction
Motivation The topic of fake news is in the collective consciousness for some time now, due to its alleged impact on swaying public opinion on important issues, going so far as to potentially influence election results [3] in some cases. We find entire articles devoted to studying their impact [35], and methods of detection [10]. While some of the conclusions in these articles may be merely tentative, there are still some hard-to-dispute facts we can start using as a basis. For instance, we know that more than two thirds of Americans report getting at least some of their news on social media according to a Pew Research study [15] from 2017. Worldwide, 48% of people surveyed reported believing a fake news story was real before finding out it was fake, according to an Ipsos report [32]. Interestingly, the same report finds that 63% of people are confident in their own ability to identify fake news, while only 41% are confident that the average person can do the same. Are people in general overly confident about themselves, or too cynical about others? Hard to say, but nevertheless an interesting idea to explore.
Enterprising research out there has already found insightful characteristics of fake news, with one paper going so far as to draw parallels between fake news and satire [28]. This could not be easily done without the appropriate technology to gather large quantities of data, and analyzing it in new and creative ways. Taken to its logical conclusion, such research could eventually lead to heuristic algorithms able to detect and filter out fake news, a monumental breakthrough in and of itself. While not being naive enough to ignore all the challenges (one could easily imagine the rise of an "arms race" between fake news manufacturers and detectors, akin to the current system of viruses and antiviruses), this is one idea that we have found immensely motivating in our quest to push the boundaries of what can currently be done.
Relatedly, examining how news sources disseminate their content, how this fits in to their respective ecosystem, and how they continuously adapt in order to keep up a working business model, are all intriguing subjects in their own right. We know from existing research that newspaper publishers are aggressively trying to expand into the digital realm, going as far as adopting a "digital first" approach, but the data shows they are still heavily reliant on print in terms of revenue [38]. Exclusively online news outlets on the other hand do not have the luxury of print to fall back on, so we expect them to make that much more of an effort in establishing a foothold in the online market to draw revenue from. This is actually supported by some of our findings, see Section 4.1 for a specific example.
Since the topic of fake news is a complex one, it can hardly be expected to be tackled end-to-end over the course of a single article. More research is always welcome, and our understanding of it can only deepen in proportion with the number of researchers shining a spotlight towards it. Of course, any new research should ideally be done in a non-partisan fashion so that new studies can present objective conclusions, which are less likely to be dismissed offhand (especially by laypeople) in an increasingly polarised world [22]. That being said, it may be hard to even know how to begin tackling this issue, considering the sheer amount of data out there that needs to be collected, stored and whittled down into manageable chunks, to fit the scope of various investigations. As such, we want to do our part in reducing this barrier to entry, to build upon the works of others and at the same time provide a stepping stone for other people coming up with innovative research ideas that would otherwise be difficult to implement on account of technical challenges.
Related work We find similar work already out there, albeit with slightly a different application and purpose. Of course, we are not the first to consider the potential of data analysis, and the usefulness of providing enthusiastic people with investigative acumen with tools they could put to good use. Gray et al [24] offer a particularly accessible guide aimed at journalists wishing to take charge and initiate their own data-heavy investigations. There are also repositories [13] dedicated to collecting large troves of documents and other data sets, opening them up to be analyzed by interested parties. What we try to offer is a slightly "meta" spin, by enabling investigations into the supposed investigative outlets themselves. Keeping tabs on the behaviour of entities tasked with shaping public opinion, either deliberately or unwittingly, should arguably rank fairly high as far as research topics go.
The issue of scraping social media data is explored in some detail by Marres and Weltevrede [37], who note that scraping is currently a prominent technique for the automated collection of online data, promising to offer new opportunities for digital social research. There is a fair amount of hype surrounding scraping as a herald of the coveted "revolution" in social research brought on by the advent of the Internet. What makes the technique special is allowing research to be done as an ongoing process, rather than a finished process. Of course, their application involved scraping just a handful of pages and charting very specific changes on said pages over time. Our application's current focus is a lot more generic, aiming to target a large number of distinct websites, and tries to avoid any kind of specialization that could prove restrictive for a general use case. Of course, future development can still be done to address various special cases with some minor tweaks.
The same article by Marres and Weltevrede [37] mentions a service used at the time, ScraperWiki [9], aiming to serve as a platform for developing and sharing scrapers. It has since been renamed to QuickCode, as it "isn't a wiki or just for scraping any more". ScraperWiki is mentioned a handful of times among the various works we have looked at in preparation for this article, but not so much since its rebranding as QuickCode. It is not entirely clear if the platform remains as accessible as it once was for the casual researcher at the time of writing. We could not find other similar platforms worth noting, therefore if web crawling/scraping research is indeed an underserved niche, our proposed solution should help plug that gap.
Other interesting research seeks to employ scraping to analysis with a more predictive application in mind. Lerman and Hogg [34] have tried come up with a model that is able to predict future news popularity starting from a data set acquired from scraping entries on a popular social media platform. Their work is greatly helped by the particular structure of their chosen platform (i.e. digg.com), where it is to pick up on early user voting results on new entries, extrapolating from there and estimating future popularity based. This should be easy to replicate on other sites with similar voting systems (e.g. reddit.com), but a great deal more creativity is required to do something similar on a more generic set of websites. That is, unless we can distill our set of target websites to include only ones with a very well defined set of characteristics, or choose some other metric to apply statistical modeling on and derive predictive benefit out of.
Yet another direction of research is sentiment analysis, as explored by Balahur and Steinberger [4] specifically for the use case of news articles. They employ the freely accessible Europe Media Monitor (EMM) family of applications [29], which at the time was retrieving between 80,000 and 100,000 articles per day in about 50 languages, scraping about 2,200 hand-selected online news sources and a few specialist websites (these numbers have increased in recent years). A fairly impressive data set, unless it happens that our target websites fall outside of these news sources, which is where our application fills in the gap by allowing any number of custom entries to scrape on a regular basis. We estimate that some fairly involved tweaks would be required to add a similar sort of functionality to the processing side of our application, but the website content as currently gathered by our scraper should already lend itself well to the task.
A more niche approach, coming from what looks like fledgling research from Vargiu and Urru [52], involves figuring out how to pick out the most relevant contextual ads, based on insight gleaned from from scraping existing web pages. This does not necessarily apply solely to news sites, but it does give us an idea of at least one of the lucrative directions this kind of research can develop into. The amount of automation already out there in the advertising world should give us pause for thought, however. A solid business model right now could prove to be overinflated and unsustainable in the long term. According to a 2014 study by Association of National Advertisers [39], bots now comprise an estimated 23% of all online video ad viewers, and 10% of all static display ads. Rushkoff presents an eloquent, yet grim (and possibly somewhat alarmist) view in his book [49] on the topic: Consider the irony: malware robots watch ads, monitored by automated tracking software that tailors each advertising message to suit the malbots' automated habits, in a human-free feedback loop of evernarrowing "personalization". Nothing of value is created, but billions of dollars are made.
With that in mind, we should be far more interested in creating something of value, rather than chasing ephemeral gains.
Our results What we try to add to the existing body of work is effectively a new solution in the form of a fast, efficient, mostly automated application able to gather vast amounts of information about websites, in as generic a form as possible. Our aim is to have an information dump that is easily to compile, and greatly simplifies the work of future researchers who need large sample sizes to interpret and derive conclusions from, according to various specific use cases. Some of these use case ideas have already been at least tentatively explored in articles mentioned in this introduction. We are confident that a good deal of research endeavors would have benefited from the kind of data dump we can now provide, and yet more research can benefit from it going forward.
We also put the application to the test on an individual use case to start with (i.e. Romanian news websites), to at least overcome the most glaringly obvious issues and challenges before releasing to the general public. A good deal of effort has been made in ensuring the application has more than just a niche appeal about it, and that it can be run reliably for long stretches without much manual interference. However, we also expect (and welcome) any constructive criticism and bug reports that get us closer to a flawless product. Despite not coming from a sociology background, we try our hand at interpreting the results we get from our use case, at least to the extent that we are aware of what characteristics to look for (see Section 4 for more details).
In the process of developing the app, some of the biggest hurdles that had to be handled were caused by the flaky and unpredictable nature of web content in general. By far, the element of human error involved in setting up websites seems to be the biggest source of issues with setting this sort of automated solution. Simple typos can lead to cascading failures (sometimes in spectacular fashion) when improperly interpreted by our heuristic algorithms. These failures are typically only obvious when they get to the point where they manifest among a noticeable segment of our result set. As such, there is a wide range of special case handling baked into our application code. While probably not fully exhaustive, we can reasonably expect that scenarios that are yet to be discovered should not have a statistically significant impact on results.

Overview of the application
The back-end runs as an executable JAR file, so the machine running it needs to have Java runtime installed (version 8 or later). We also need to set up a PostgreSQL database for it to use, which can be done by following the steps listed on the GitHub project page readme file [44]. Once started, it will start automatically start crawling any sites listed in its database (this will be empty to start with). While crawling, it scans for new links to visit, and download the text content from every website visited to a local folder, where it will be indexed and processed to find similarities. More technically-inclined users will be able to connect to the back-end database directly to view real-time changes and make any low-level tweaks where it makes sense to do so.
The front-end is a Javascript-based single-page app (SPA) serving a number of key functionalities: • Listing all websites discovered by the web crawler • Allowing specific websites to be toggled as special interest, Romanian news websites for our use case, causing them to be queried more often for snapshots (at minimum every 1 hour) • Drawing a graph to visualize links to and from our special interest websites, allowing nodes to be added or removed in order to minimize clutter • Listing instances where similar text content was detected on different websites • Displaying various statistics about the web crawler's activity Note that the front-end can only be accessed while the back-end is running. A fully featured implementation for our use case is available online at http: //www.newscompare.tech [43] for demonstration purposes.

Technical details
All the code written for the application is freely available on GitHub [44,45] for anyone to examine or make use of, either as-is or by building upon it to suit some other purpose. In its current form, it should be accessible to most developers (particularly coming from a Java background), by following the instructions listed in the "Readme" files. For reference, the desktop machine used for all development and testing work is running a 64-bit octa-core (16-thread) CPU with 64 GB of RAM, and an NVMe solid-state drive for storage. The application is not particularly memory-intensive, but due to its multithreaded nature it does benefit significantly from multiple cores and high speed CPUs. Depending on the number of websites targeted and the frequency of snapshots taken, available storage could start to become a concern. For just over 100 websites each snapshot folder seems to add up around 1 gigabyte, including website text content and generated indexes.
Note that the back-end project is the most important component, and will be referred to interchangeably as "NewsCompare" or "the application" throughout this article. It can be picked up from scratch, and used on its own just by tweaking configuration values and keeping an eye on logging output. The frontend is effectively just a convenient way of interacting with the data set and provides some visualization of the application results. A prebuilt version of the back-end component is available on GitHub [42], with all the default settings we used throughout our testing configured at compile time.
In Subsection 3.1 we try to give a comprehensive run through the complexities of developing a web crawling solution nearly from scratch, which may prove helpful to anyone interested in rolling their own implementation. Subsection 3.2 similarly deals with how we set up an inverted index [7] for our set of text documents in order to perform fast searches and comparisons between them. This whole section should be a good starting point for anyone trying to better understand our publicly available code, to either modify or improve it. We try to note various issues and improvements already tackled, and also lay out some potential quality-of-life improvements for the future.

Web crawling
According to the comprehensive primer by Olston and Najork [40], a web crawler (also known as a robot or a spider) is a system for the bulk downloading of web pages. Web crawlers are used for a variety of purposes, and the one we are most interested in here is an application of data mining, where we analyze web pages for statistical properties, and try to perform various data analytics. The web crawler starts off with a list of URLs to visit, otherwise known as seeds. This list can be quite small to begin with, as we expect it to grow exponentially. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, known as the URL frontier in some publications [36]. URLs from the frontier are recursively visited according to a set of policies. If the crawler is performing archiving of websites, as we do, it copies and saves the information contained within as it goes.
In our particular case, we try to visit every hyperlink at least once, but place a much higher emphasis on a manually curated list of websites where we want frequent snapshots saved. How often we are able to take these snapshots largely depends on how quickly we can run through this list on every iteration. As long as we keep it relatively short, and only visit a small set of websites on every iteration, we can afford to schedule our crawler to run fairly often. This in turn allows taking frequent snapshots, which are useful for record-keeping or auditing purposes. The main graphs and reports that we generate should typically be based on the most recent snapshot, unless a specific comparison between snapshots is otherwise required.
One of the immediately useful side effects of web crawling is that we automatically get to compile a list of directional links between the websites we start off with, and the ones discovered along the way. This allows us to effectively map a limited section of the visible web, and visualize it as a directed graph, with the websites serving as nodes and links as directed edges. This can serve as a basic sanity check on whether our results look valid and useful, but can also lead to basic conclusions in their own right (provided we have some interest in web architecture to begin with). Having readily available access to basic graph details, like node degree and connectivity allows us to see how our results line up with existing research, and potentially put it to the test. See section 4 for some actual examples of insight derived from our particular set of target websites.

Challenges
Successive requests to the same server can lead to getting blacklisted or banned if the time between requests is too short. Should this occur on some websites (and slip by unnoticed), it could potentially skew our result set. Olston and Najork put it very succinctly in their survey on the science and practice of web crawling [40]: Crawlers should be "good citizens" of the web, i.e., not impose too much of a burden on the web sites they crawl. In fact, without the right safety mechanisms a high-throughput crawler can inadvertently carry out a denial-of-service attack.
A naive implementation of a web crawler might overlook this (or a malicious actor could ignore it entirely), but it stands to reason that most servers will rightfully seek to defend themselves against perceived acts of aggression, at least to the extent of limiting damage and maintaining high uptime. Any behavior that would not realistically be carried out by a human could raise red flags, causing web servers to start dropping requests. Based on our own empirical observation, imposing a mandatory delay of 100-200ms between successive requests seems to yield good results.
URL normalization [6] is an important requirement, at least to the degree where we are satisfied that it covers our target websites properly. Any shortcomings in this area could lead to visiting semantically equivalent URLs, resulting in wasted effort and potential over-representation in our result set.
From the existing research, we adapt a few interesting ideas from a proposed algorithm [5] for a systematic and robust method of URL normalization, however in the final implementation we rely largely on simple string manipulation using regular expressions. Some of the more notable steps we employ are: • Converting the scheme and host to lower case • Removing the default port (e.g. 80 for http) Filtering out non-web content helps keep our data set smaller and more focused. We employ several basic methods here, based on string pattern matching in website URLs. The links we filter out here are not web pages, so we know from the start they are unlikely to provide useful information in keeping the crawling process going: "Crawler traps" can be a significant time drain if unnoticed for extended periods. As Olsten and Najork mention [40], there exist "websites that populate a large, possibly infinite URL space on that site with mechanically generated content". The example they give is that of a web-based calendaring tool, where each month has its own page and a hyperlink to the next and previous months.
For our given set of websites, the biggest danger we notice is that of websites linking to various external indexing services. For instance, a website could link to its own entry on archive.org, which sends our crawler down an unfeasibly long chain of links that do not really improve our result set if followed. We are not directly interested in travelling the entire breadth of other existing indexing services or aggregators, we have started to maintain a list of exclusions for the crawler to avoid. See 3.1.3 for an idea on improving this process.
Thread-safe methods for reading and writing to data storage, in the context of using a single, traditional SQL database for data storage. In this case, using a transaction isolation level of "repeatable read" in PostgreSQL [47] appears to be enough to ensure the data integrity with only a moderate slowdown.

Optimization
Parallelization is something generally well-suited to web crawling activities. Intuitively we should be able to fetch content for most websites independent of each other, so the work can be done on separate threads. Some experimentation may be required with the number of threads assigned to each individual task in order to achieve this result.
Java's parallel streams [12] are a handy tool for quickly implementing parallel processing. Where other, more low-level, solutions would require us to handle the dividing of a problem into subproblems, then combining the results of the solutions ourselves, the Java runtime does this largely automatically. While it does not automatically guarantee operations perform faster (in some cases, quite the opposite, due to overhead), it makes it quite easy to make small code tweaks and find an optimal solution through trial and error. If we already follow the functional programming paradigm, it could mean something as simple as replacing calls to stream() with parallelStream(), configuring the thread pool size, and running performance tests. See Table 1  When crawling deeper than surface-level (i.e. more than just the website home page), we need to be cautious about our parallel tasks from inadvertently making too many simultaneous requests to the same server, as we mention in Subsection 3.1.1. Our approach here is divide our entire set of target web pages we wish to crawl during a regular run (typically around 10-20,000 for our 334 target websites) into "slices", with each slice containing at most one page from the same parent domain. These slices can be visited entirely sequentially, which is nice and safe (but also slow), or we can try to find a way to have them run in a partially overlapping fashion, which would be more optimal but also place us at a slightly higher risk of getting blacklisted for excessive requests. In our particular case, we settle on configuring each thread to run at a slightly different delay (in 100 ms increments), which does not seem to impact the rate of successful requests, and the speed improvement is more than threefold for our use case.

Mode
Pages  Separating tasks, i.e. having the raw data retrieval separate from the actual processing, should increase overall throughput. Since network speed and latency can vary wildly by server, hitting a particularly slow website might otherwise bottleneck the entire process.
Depending on the particular use case, we can wait until our targeted websites are fully retrieved before starting the processing, or we can run the two tasks roughly in parallel, with the processing lagging slightly behind. Some factors influencing this decision include whether we want to extract useful information from partial results, or if we think we could squeeze some extra performance and have the CPU cycles to spare for it (i.e. if fetching websites is not already keeping us at 100% load, or close to it).
Batch processing is one of the less obvious points, since in typical work loads with small sample sizes the performance impact is negligible. But something like building a large array of objects that are saved in a single call to the database, instead of saving each one individually gives us a massive speed advantage, particularly in the context of multiple threads seeking concurrent database access.
Timeout periods should be configured to a sensible value, to prevent having threads locked up in useless waiting periods for more than is absolutely necessary. This value can be arrived at through trial and error, by keeping track of the number of successful server responses over multiple trial runs. We expect this number to increase along with the timeout period, but we should see the improvement rate drop off sharply after a certain point. It is precisely this point of diminishing returns that gives us the best trade-off between results and performance.  Table 3: Effects of timeout value on crawler speed and success rate As we expect, the largest timeout values correlate with the largest number successful requests, but the improvement rate is marginal at best. The difference in percentage points is so insignificant enough that it may be explained away by random chance (or possibly, random background CPU usage on the test machine at the time). As such, we are comfortable in reducing the timeout to the bare minimum for our purposes.

Future improvements
Automating "crawler trap" detection is a good starting step for improving the robustness of the application, allowing it to run independently with greater confidence. Since there is no real way to predict how often this kind of issue can surface, the way we mitigate it currently is by keeping an eye on the application's log output on a regular basis. We need to manually add any newly discovered "trap" to our list of excluded domains, so this is somewhat time-consuming. As soon as we find the trade-off acceptable, we can look at implementing a heuristic algorithm limiting the crawler's traversal of a particular domain past a specific threshold, making the entire process more automated.
Smarter timeouts would improve general crawling speed, especially over long stretches of time, but the impact can vary between marginal and significant depending on the set of websited targeted. By keeping track of each website's recently failed requests, we can place servers that seem to be unreachable (temporarily or otherwise) on cooldown, querying them less often and reducing the amount of time wasted waiting on timeouts overall.
The cooldown value can be set to an arbitrary value to start with, but ideally should be arrived at after some experimentation. We do not want to unwittingly restrict certain websites from our data set too harshly and risk skewing our conclusions. However, since any websites affected by this optimization are unresponsive to begin with, the risk of this should be fairly low.
Database replication adds a fair degree of complexity to the entire architecture, but at the same time provides a small-to-moderate boost in performance, by separating the application's responsibilities among multiple databases hosted on potentially multiple servers (or virtual machines). We expose a good number of API endpoints, some for displaying various statistics on the crawler's progress and results, others to provide a better visualization on our data set (or sections thereof). The SQL queries involved in retrieving this data take up to several seconds to run in some cases, largely due to the volume of data involved, and this constitutes extra load on our current "single point of failure" database.
PostgreSQL provides a very powerful solution in this regard [48], allowing us to do near real-time streaming replication of data from our master database to a standby one. The former can keep handling all the "heavy lifting" required by the web crawler, while the latter is used as a read-only source for reporting purposes. We also get the added bonus that the standby database can be automatically promoted at any time to master status, should the original master suffer an unrecoverable failure, significantly improving uptime and reliability. We do not include an implementation of database replication in the current version of the app, as it would further complicate the setup process for anyone seeking to reproduce (or build upon) our findings. It is however worth mentioning, in the interest of laying out the various pros and cons for interested parties.

Text comparison
Once we get the process of acquiring a large data set of website content out of the way, the next step is to do more in-depth processing and extract more valuable information out of it. What we want to do is implement a kind of plagiarism detector to point out the more glaring similarities between articles on different websites. From the outset, it is clear this can turn into a time and resource-intensive task, and we need to be somewhat clever in order to avoid exponential complexity spiraling out of control and rendering the whole thing unfeasible.

Simple approaches
Without even delving into algorithms, it should be make intuitive sense that most naive implementations would require too much processing to allow it to scale well, and there is at least some minimal research required to avoid wasting much time reinventing the wheel. For instance, a brute-force method of comparing 100 pieces of text one by one would require 4950 separate comparisons after a quick calculation ( n(n-1) /2 where n = 100). Any optimization we implement along the way to reduce the number of comparisons performed can have a significant impact on the overall time. The particular algorithm we implement for performing the comparison is also crucial, to the extent that we can find one to process chunks of website text at least as fast as they are coming in from the web crawler side.
Computing a kind of string similarity coefficient based on Levenshtein distance [8] (i.e. finding the smallest number of insertions, deletions, and substitutions required to change one string or tree into another) potentially gets us the results we are interested in, but is still very much a brute force approach. The most obvious shortcoming is that we are effectively doing the same work over and over by processing each string from scratch on every comparison. The first big improvement would be to introduce an initial, preparatory step of distilling strings into their base components for easier comparison later on.

More advanced approaches
Donald Knuth gives a very well-written primer in his famous book The Art of Computer Programming [33] on how inverted indexes are used to set up fast searching through text strings. To put it succinctly, we set up our index by making a list of unique terms in each individual block of text, and keep track of where the term is located within the text. From here, we can boil down every word to its most basic form (e.g. plural to singular, conjugated verb to infinitive form etc.) to reduce the size of our list of terms while improving representation. Additionally, we can filter out so-called "stop words", which are the most frequent and almost useless words (e.g. "a", "I", "the" for English), further lowering the noise in our search results.
Luckily, we are able to avoid much of the complexity of implementing our own inverted index solution by co-opting the open-source project Apache Lucene [19] into the application. It comes with a wide array of language analyzers (including Romanian), making it suitable both for our particular use case and improving the odds of our application becoming useful as a generic tool for future researchers. By making good use of Lucene's "more like this" functionality [21], we can avoid making an inordinate amount of one-to-one comparisons between items in our data set. This largely mitigates one of the concerns stated earlier, and means the number of comparisons we do (as well as the time taken for each comparison) should scale linearly rather than exponentially.
At this point we are able to perform the indexing and comparison steps at a manageable pace, something in the order of minutes instead of days for around ten thousand text files. However, we still have significant noise in our result set, so we need to further refine our algorithms. To this end, Abid et al [1] suggest n-grams, i.e. sequences of words of length n, are a much better choice than single words for indexing and searching. Indeed, we observe a much tighter result set after switching to tri-grams, and the set itself is small enough to be discernible by a quick skim through (no longer requiring us to scour through millions of resulting combinations).

Challenges
Most challenges in this area stem from the fact that we are attempting to adapt a number of rough, heuristic algorithms to make sense of fairly nuanced text generated by humans (i.e. an extremely limited application of natural language processing). We want it to be useful, so the signal to noise ratio needs to be high, without excluding any useful results and reduce our overall accuracy. For instance, some of the conclusions coming from the app may be accurate (two pieces are text are very similar), but effectively useless at the same time (e.g. copied and pasted cookie policies, privacy policies, GDPR statements etc.). Conversely, two sources may be very similar content-wise, but the individual website's HTML structure could make it difficult to pick out particularly relevant blocks of text, causing it to slip under our radar.
Website architectures can be quite varied, and we want to keep any assumptions about particular approaches in this field at a minimum, so that the application ca be as generic as possible. In particular, subdomains can be somewhat tricky to deal with, we need to remember at all times to consolidate results belonging to the same top domain as a single source. After all, our stated purpose is to find similarities between wholly distinct websites, to point out the spread of content, and we do not concern ourselves with reused content between different sections of the same website. This consolidation step goes a long way towards improving our signal-to-noise ratio and making the more interesting results shine through.
Relevant content is sometimes hard to discern from surrounding context. Looking at any given news website, there is a lot of content displayed on page, but there is often surprisingly little space alotted to the actual content, i.e. the news article itself. The sidebars are typically reserved for internal/external links, advertising, and various widgets seemingly designed to provide some kind of use to the reader. We can discern that many design patterns favor drawing the user's attention, keeping them engaged and encouraging repeated visits, even when it might come into conflict with the main stated purpose of the site. While humans can quickly learn to intuitively pick up on useful content, automating this kind of processing into our algorithms can be quite tricky and time-consuming. In particular, the rise of interstitial advertising, and a general tendency to break up news into fragments and sprinkle vaguely related content between them needs to be accounted for. We will not go into whether or not an entire article is effectively an advertisement, as that falls somewhere outside the scope of the current research.

Optimization
Parallelization is already mentioned in 3.1.2 with regard to how it dramatically improves web crawling performance. The same rules apply here, even though we may not be able to find the same number of truly independent tasks that can be run in parallel. We hit a plateau of diminishing returns fairly quickly, but the performance gains are still worth pursuing as long as they are not too time-draining or significantly impact the readability or maintainability of the resulting code. We sit at a comfortable level of throughput right from the start, in no small part thanks to inherent optimizations present in the software library we employ [19].

Technologies used
We aim to avoid using any proprietary or license-based software, so that all of our code can remain public. We are grateful to the open source community for the multitude of varied and powerful tools at our disposal, and we can at least state that we do not feel hamstrung by our decision. An honorable mention should be made to Apache Nutch [20], a fully featured web crawling solution that could help future tech-minded people to co-opt web crawling into their projects. We do not make use of Nutch in our case, mainly because we wanted to have tighter control over the crawler's behavior, and were comfortable enough in rolling our own lower-level implementation.

Back-end
We use Spring Boot [50] to quickly and easily get a RESTful web service [41] up and running using Java, but with minimum boilerplate and configuration outof-the-box. It ties in well with PostgreSQL [25], which is used for mostly for persistence, but also storage to some degree. We need to save a limited amount of data from the websites explored by the web crawler to our database, some of which is used to inform future crawling iterations. Hibernate [26] makes it easier to perform the mapping between our Java classes and database tables, while Flyway [23] allows us to create our database structure in incremental migrations that can be easily replayed on a new machine when setting it up from scratch.
The crawler component uses jsoup [27] to create all of its network connections and also parse the resulting HTML pages using methods that allow for familiar CSS-like selectors. We also make local text dumps of the bulk of website contents, which are afterwards picked up by our implementation of Apache Lucene [19], creating indexes for quick text searches and comparisons.

Front-end
We use Knockout to build a simple yet dynamic JavaScript interface that pulls data from our application's endpoints and displays them in a more userfriendly fashion. The graph page uses an implementation of vis.js to help visualise website data as an interactive graph, again using data pulled from the back-end. Webpack is used to create a browser-friendly bundle of our own JavaScript source files, together with any node packages we use, as well as any other assets (e.g. CSS files).

Use case: an analysis on Romanian news websites
To test our application, we define a particular use case by restricting the web crawling and analysis to a limited geographical area. We have made this choice largely in the interest of a fast turnaround time, to be able to make quick, experimental changes to our algorithms and study their impact immediately. We avoid making hardcoded assumptions, so that any tools we use can be repurposed with a different scope in mind, large or small.
Romania is actually an interesting choice in this respect, boasting a number of surprising, confusing, or ultimately even paradoxical characteristics. We consider the country to be in a rather unique position with regard to the relationship of Romanians with their fulfillment of basic needs and wants, one of which being news and media consumption. For an unassuming mid-sized country on the geographical fringe of the European Union, it boasts the highest average peak internet speeds in the European region, and is ranked at number 10 worldwide, according to Akamai's 2016 report [2]. Coupled with the generally affordable access plans (both wired and wireless), it is no wonder that adoption rates are on a continuous upwards trend, reaching 81% in 2018 among the 16-74 year old population, according to Eurostat data [16]. While still slightly below EU average (89% in the same Eurostat data set), if we extrapolate from existing trends we could speculate that the adoption rate should reach the EU average in due course.
A recent report by Reuters Institute [31] states that 88% of Romanians get their news online, 82% from TV, 67% from social media and just 18% from print. We can therefore expect that online news websites hold significant sway in shaping public opinion, considering the significant section of the populace relying on them. A similar report from the year before [30] finds that "the Romanian news environment is defined by intense competition for television and online audiences, sustained by understaffed newsrooms that struggle for financial survival".

Graph analysis
We start our study by directing the application towards the top news websites by monthly popularity [51]. From there, we get a record of all links encountered, which can be later visualized as the edges on a directed graph. What we end up with is a fairly large grouping of websites, centered around a smaller core of websites that we are actively interested in (i.e. Romanian-language news websites). The grouping itself is interesting, if we lay out a visual representation of the graph we can see that the links are not formed at random, and are more heavily weighted towards some websites. On one end, there are very few sites with a great number of links, and on the other end many sites with a very small number of links. A linear plot makes it harder to notice this fact, so we need to create a log-log plot to make it stand out more in our distribution. Our plots seem to line up with conclusions from existing research targeting internet topology [18], claiming that we should expect to see a surprisingly simple set of power-laws that describe concisely skewed distributions of graph properties such as the node outdegree and indegree.
For instance, we can see a distinctly non-random pattern if we look at the entire set of Romanian-language websites (not just news) that our crawler has visited at least once. The plots below display data points for roughly 65,000 Romanian websites found in this manner. By restricting the graph to include only Romanian news sites (plus direct neighbors), we can still see a hint of the same pattern developing, but since the sample size is much smaller, we see outliers are more noticeable. In this graph we have 1404 nodes (of which 157 are news websites) and 2450 edges. The full data sets can be taken from CSV files stored on GitHub [46]. This file format can be plugged straight into graph visualization software such as Gephi [11], and potentially others with some tweaking. Since this graph of news websites effectively represents a social network, it exhibits all the standard properties of one, e.g., power law degree distribution, a high clustering coefficient, and a small diameter (relative to the number of nodes in the graph). We illustrate these properties below (the data points behind the plots are also available on GitHub [46]  It feels relevant to note that the website corresponding to the highest degree node (by an overwhelming margin) belongs to hotnews.ro, an online-only news outlet. We could interpret this both as a clearly focused effort to increase their footprint in the only arena where they are competing, but potentially also as a sort of underdog mentality, trying to make a disproportionate effort to get to the top and hold their position. It is likely that this strategy is paying off, considering how they are now considered one of the largest Romanian news websites, pulling in around 250.000-300.000 unique users daily and more than 3 million monthly unique visitors and around 30 million monthly page views, according to stats measured by the Romanian nonprofit organization BRAT (Romanian Joint Industry Committee for Print and Internet) [14].
Other online news outlets that started out as more traditional media companies, like television (e.g. stirileprotv.ro and antena3.ro), or print (e.g. libertatea.ro) appear to serve more as an extension of their main business, seeming to make little more than a token effort in establishing an online presence. The majority of outward links from these websites simply seem to promote other websites owned by the same parent company, while links from online-only outlets are a bit more varied and balanced.

Social analysis based on the data
On the front of content comparisons, we have some interesting results showing similarities between distinct news outlets. We can see many instances where near-identical articles are displayed on different websites, with these websites sharing the same media group parent company (this is to be expected). If we filter out these cases, we then see instances where the application pinpoints articles about the same event, or on the same topic, with a fair rate of accuracy. While it would not be enough to conclusively pinpoint plagiarism, it is certainly a potential step in that direction, if we follow up on these leads (manually, for the time being). Gathering this kind of historic data can also be used to paint a picture about what kind of articles each particular outlet is liable to pick up on, and if we can notice any consistent groupings of websites emerging from there. A list of similar articles we have found over the course of running the application can be found in CSV format on Github [46].
To address fake news specifically, a recent report from Facebook [17] announces they have undergone efforts to remove what they call "Coordinated Inauthentic Behavior", i.e. pages that engage in manipulative behavior towards users on their platform. This is of particular interest to us, since some of these pages pose as Romanian news sources, which fits nicely into our use case. While the Facebook pages are no longer available, their associated websites are still alive and kicking: destanga.ro, perele.ro, antifakenews.ro, momentulzero.ro. These websites have not been discovered organically by our web crawler, despite having seen around 96,000 distinct URLs thus far, which would indicate that there are no links pointing to them at all in our entire data set. Out of curiosity, we add them to our list of target websites to see what we can learn from them, if anything. What we find is that they are largely isolated nodes in our graph, having very few distinct outward links, all of them pointing only towards Google, Facebook, or Wordpress.
Our text comparison component was only able to find very few matches involving these 4 websites over several runs, all of them between momentulzero.ro and the ironically titled antifakenews.ro. This is a tentative indicator that at least some of these sites (labelled as misleading and manipulative by Facebook) are either coming from the same source or have the same goal in mind. Taking just a cursory glance at some of the articles served, we can see that they are quite short (around 500-1000 characters on average), and have no citations of any kind, even when alleging to use a direct quote from a particular person or institution (confirmed by our web crawler being unable to find any hyperlinks outside of social media). These are all good heuristic indicators that seem to support Facebook's conclusions in this particular case, and potentially lead us to other examples in need of a closer look.

Conclusions and future work
Throughout our inquiry, we manage to delve deep into the innards of our target websites, and glean some fairly intimate knowledge regarding their architecture and contents. Some of our expectations get challenged along the way, we might arrive at some surprising conclusions, and oftentimes the issues and challenges we come across can be particularly frustrating to get through, but still yield satisfying results. While we cannot expect groundbreaking results at every turn, or a "smoking gun" behind every corner, we trust that given enough time our application is capable of doing great things in capable hands. The amount of time saved by automating away cumbersome tasks empowers us to look at an increasingly larger picture, at a fine resolution. Sifting through this picture to find occasional nuggets of meaning can become a rewarding task in and of itself.
The current list of features and functionality included in the application is representative of the ideas we came up with, both on our own, and by studying existing research, all while timeboxing the implementation time to prevent "feature creep" so that the project does not drag on for many months or even years. We would be overjoyed to receive any kind of feedback from the community about our offering, and work with interested users to develop new features. We expect most of the future work involved will be around adding new statistics, reports and visualizations to the front-end, making it more friendly to people coming from a non-tech background. Barring some unforeseen revolutionary idea, the resulting data set gathered by the back-end component should be generic enough to be molded to match most reporting needs.
As mentioned by Marres and Weltevrede [37], "it would be a mistake to approach scrapers as if they were stable, stand-alone machines: scrapers come in and fall out of use; they work, and then they no longer work". We can certainly note that the stability of a particular piece of software is correlated with the amount of time spent bug-fixing, debugging and generally testing through use. To that end, making the app available to the public as a generic tool is probably the best way to find and fix the more glaring issues and omissions. After some growing pains, we expect to emerge with increasingly robust and battle-hardened versions of code, though some maintenance is likely to be required on a semi-regular basis, in case entirely breaking changes start to become widely adopted by target websites. To give a technical example, we can expect something like newly issued SSL certificates by certain certificate authorities to give us trouble if we are still using a particularly old version of the Java runtime that is unable to recognize them.