aFlux: Graphical flow-based data analytics

aFlux is a graphical flow-based programming tool designed to support the modelling of data analytics applications. It supports high-level programming of Big Data applications with early-stage flow validation and automatic code generation for frameworks like Spark, Flink, Pig and Hive. The graphical programming concepts used in aFlux constitute the first approach towards supporting high-level Big Data application development by making it independent of the target Big Data frameworks. This programming at a higher level of abstraction helps to lower the complexity and its ensued learning curve involved in the development of Big Data applications.


Introduction
Data analytics has gained prominence in recent years. Nevertheless, developing Big Data applications is not a trivial task. Writing Big Data applications for frameworks like Spark [1], Flink [2], Pig [3], Hive [4] requires interaction with several libraries and APIs, and working with different data abstractions. Furthermore, developers might need to include additional libraries within an application to ensure its successful execution on Big Data clusters. This approach makes the process cumbersome and challenging with regard to quickly prototyping applications for performing exploratory data researches. Therefore, the learning curve associated with it is steep, and it requires a considerable amount of expertise to use Big Data analytics. Additionally, there is no support for end-user programming with Big Data, i.e. programming at a higher abstraction level.
We believe that one promising solution is to enable domain experts, who are not necessarily programmers, to develop the Big Data applications by providing them with domain-specific graphical tools based on 1. Analysing target Big Data frameworks like Spark and Flink, extracting data abstractions and APIs which are compatible with the flow-based programming paradigm, i.e. not supporting APIs requiring user-defined data transformation functions or supporting code-snippets during flow creation to interact with target framework internals. Representing the selected APIs operating on the compatible data abstractions as modular, composable components. These components are independent of the execution semantics of flow-based programming tools and bundle a set of APIs invoked in a specific order to perform a specific data analytics operation.  programs and with support for early-stage validation so that the flow always yields a compilable and runnable Big Data program.

Software prototype
aFlux [7][8][9][10] is a graphical flow-based programming tool (mashup tool) based on the actor model [11] to support the design of data analytics applications with concurrent execution semantics, thereby overcoming the prevalent architectural limitations in the state-of-theart mashup tools [12][13][14]. A flow-based programming model with concurrent execution semantics is suitable for modelling a wide range of Big Data applications currently used in Data Science. Without the aforementioned semantics, designing a flow involving Big Data analytics would lead to components waiting to execute for a long time, as Big Data jobs usually take a long time to finish their execution, leading to the inefficient design of applications.

Conceptual approach
aFlux implements the modular composable components selected from Big Data frameworks like Pig, Hive, Spark and Flink as actors and enables high-level Big Data programming with flow validation and automatic code generation. The implemented components are available on the front-end as graphical components for the user to drag and create an application flow. The flow begins with components which read datasets, followed by a series of data-transformation components and ends with a data-output component. Every component has a set of properties which the user can configure on the front-end. The flow is parsed for correctness and internally represented as a directed acyclic graph (DAG). From the DAG, the native Big Data application is generated using an API-based code generation technique [15]. The conceptual approach for flow creation and automatic code generation, including some examples of supported components for flow creation, is illustrated in Fig. 2. [9] aFlux consists of a web application and an execution environment developed in Java and the Spring Framework. 1 The web application is composed of two main entities: the front-end and back-end, based on REST API. The front-end of aFlux (Fig. 3) provides a GUI for the creation of flow-based applications, while the back-end parses the user flow to generate native application code. The application can be executed in its internal execution environment or sent to an external 1 https://spring.io/. Table 1 Comparison of aFlux with the state-of-the-art, following [6].

Solutions
Target-framework support  [16]. The application shows a console-like output in the footer, and the details regarding a selected component are shown on the right-hand side panel. The 'Application Header & Menu Bar' contains functionalities to control the execution of an application, like starting the execution, stopping the execution, saving the application etc. Using the aFlux front-end, a user can create a flow by wiring several components together. Fig. 4 illustrates a Pig graphical flow created on the front-end, its resultant generated code and its output after execution. An illustrative example of Flink flow creation and code generation has been explained via a video demonstration.

Installation
aFlux uses Maven [17] for project management and can be installed from the Git [18] repository (codemetadata table lists all requisite information). It can be compiled 4 on any operating system including Windows, macOS and Linux if Java 8 development environment is present. Any Java integrated development environment (IDE) like Eclipse [19] or Intellij [20] can be used for importing and compiling, which generates a Web Application Resource or Web application ARchive (WAR) file. The WAR file is deployed inside Apache Tomcat, a web server which supports the execution of Java code. On startup, the application requires a connection to a local MongoDB [21] instance in order to save the program flows created and the application configuration settings. All front-end graphical components are separate Maven projects within the same Git repository. They are compiled separately and loaded from the GUI of aFlux after it is up and running. The web application can be accessed from any standard web browser including Safari, Firefox, Opera and Google Chrome.

Comparison with the state-of-the-art
We compare and contrast aFlux with the existing solutions against these parameters: (i) target-framework support : if the tool supports multiple target frameworks, (ii) extensibility: if the approach can be/has been extended to other Big Data frameworks and (iii) code generation: if the user flow results in a final executable code or if the flow runs in the tool's internal environment. The results are summarized in Table 1.

Overview of impact
aFlux impacts end-users as well as researchers in the following way: (i) it supports graphical flow-based application development, thereby enabling non-experts to quickly prototype Big Data applications [8] and (ii) the graphical programming concepts used in aFlux is the first approach to support high-level Big Data application development by making it independent of the target Big Data frameworks, which is a significant improvement achieved over the existing state-of-the-art. It has been used to support frameworks like Spark [7], Flink [9], Pig and Hive, which demonstrates (a) the extensibility of the approach and (b) the generalizability of the code generation technique. The high-level graphical programming approach abstracts the complexities of Big Data application development from end-users, thereby lowering the associated learning curve and enabling less skilled Big Data programmers to adopt Big Data analytics. The main research contributions have been published in three peer-reviewed publications [7][8][9]. Recently, there have been many commercial solutions aimed at enabling less-skilled Big Data programmers to quickly prototype Big Data applications using Spark and Flink via flow-based programming. Examples include Stratio Sparta 2.0 [27] and StreamAnalytix [28]. The research contributions and results from open-sourced aFlux will pave the foundation for further research in this area and will significantly help less-skilled Big Data users in an academic environment adopt Big Data analytics.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.