MEASURING THE PERFORMANCE OF BIG DATA ANALYTICS PROCESS

Big data analytics (BDA) in a process perspective has major benefits towards a better outcome, thereby satisfied customers and evidence-based practices. The aim of BDA is to examine and analyze raw data and to derive and extract actionable insights from it. BDA involves data and tools for processing and analyzing, and the process which data is handled and managed. BDA process is the end-to-end process which consists of phases named as data acquisition, data preparation (integration and preprocessing), data analysis, visualizations and interpretation. The performance of big data analytics is not merely dependent on having quality data input, but also on performance of the process which the data goes through from acquisition to visualization and interpretation. Measuring the process performance has the benefit of identifying problems and launching corrective actions before these problems deteriorate. The aim of this paper is to present the evaluation for BDA process performance. In view of that, the study identifies the measures, metrics, and indicators for each phase of the BDA process. A subject-matter expert review and a pilot study were conducted, and the results obtained were reported in this paper.


INTRODUCTION
Much has been said about the promises and potential benefits of big data. However, many challenges are still surrounding big data such as data challenges (relating to the defining characteristics such as volume, variety, and velocity etc.), management challenges (like privacy issues, security issues, governance and ethical considerations) and process challenges (concerning with how to capture, integrate, transform, and analyze data, and convey the results) [1].
As data goes through BDA process, there are several issues encountered. Lack of data provenance is one of them [1]. Having information about data upon its origins and carrying this information throughout the process is very useful. The reason is that processing errors, inconsistences and missing information can be traced back and fixed accordingly.
These problems, if not addressed, can make the subsequent analytics phases useless, and can restrict the speed to capture and store data, and the ability to extract meaningful information out of it. Heterogeneity, lack of structure, error-handling, privacy, timeliness, provenance, and visualization also the issues that can exist through the BDA process from data acquisition to interpretation [2].
Looking at big data analytics in data quality perspective is a major focus in its still-emerging literature. However, the quality of big data analytics is not merely dependent on the data, but also on the process in which the data is collected and the way data is processed [3]. Insights revealing the existing quality issues are of important considerations. This signifies the need for measuring the performance of big data analytics, which was minimally discussed. It doesn't mean only the performance of a certain tool such apache Hadoop and NoSQL database [4], [5], [6], but also performance of entire BDA process. Measuring process performance, as shown by other established process literature (like that in a business process), has the benefit of identifying problems and launching corrective actions before these problems deteriorate [7]. Problems can be deterioration in performance, failure to meet requirement or an opportunity for improvement.
Performance is not absolute. Instead it comes as a result of driving factors. Therefore, there is a need for measuring results and as well as the drivers of the results. It is interesting that some performance frameworks were employing this type of performance measurement. It is the precondition for analyzing and improving processes [8].
This paper is organized as follows. Section II covers on related work of the BDA process, existing performance models and framework and performance measures. Section III includes the methodology of the pilot study conducted. In section IV, the initial model of performance measurement for the BDA process is presented. Section V shows the results and discussion of the pilot study and section VI includes the revise and propose model of this study. Finally, our work of this paper is summarized in the last section.

RELATED WORK 2.1. Background
Measuring the performance of the BDA process gives benefits such as better outcomes and satisfied customers. Performance measurement is said to be the process of quantifying efficiency and effectiveness of an action or a process. The importance of performance measurement is much elucidated by the old statements of "If you can't measure it you can't manage it" or "if you can't measure it, you can't improve it". Obviously, identifying problems and performance gaps and proposing solutions accordingly have superlative importance over moving on action not guided by problem-related information and knowledge.
In context of big data, the unprecedented growth of data size echoed by the speed requirement imposes the need for performance measurement. Examples include the performance of data processing time, the performance of data transmission time (over the network) and the performance required when presenting results to users [9], as well as data acquisition time. This illuminates the point that the BDA process requires performance measurement whether it is internal performance (time, cost, resource utilization), usually referred to as efficiency, or external performance which is related to effectiveness. The explanation here agrees to the definition of process performance which involves process efficiency (i.e. productivity), and the effectiveness and compliance of the process [10].
Another point that deserves clarification is what is being captured in the performance measurement process. What to be captured is metadata, or "data about data", about the performance of BDA process, in order to uncover performance deficiencies and weaknesses and to discover opportunities for further improvement. A motivation is that such performance focus may lower the failure rate of big data application projects which is reportedly [11], 50% higher than IT projects failure.
It is common tradition that IS research to be based on existing IS theory or to be developed a new one, and researches in BDA should respect this common tradition. However, arguments are arisen on how existing IS theories fit into BDA research either being theory-driven or process-driven [12]. It was argued that one of these two approaches is not sufficient without the other. Therefore, lightweight theory has been suggested throughout the stages of BDA process. This is, at least, to addresses the epistemological challenges surrounding BDA research.
Shifting the focus to performance measurement side, an outstanding IS theory is a DeLone and McLean's IS Success Model [61], [62]. This is because the keyword "success" roughly approximates the meaning of "performance". Nevertheless, the applicability of any theory to be adopted is not determined by the semantical approximation of words only, but also by the relevance of constructs and contexts. Integrating DeLone and McLean model with the Structure-Process-Outcome (SPO) framework by [13] has much to say about this as process mediates performance contributing factors and resultant outcomes [14].
Generally, there are two approaches for process performance measurement. One of them is clean sheet approach where the performance measurement of process is started from the scratch. The other is to consider existing processes as starting point. Adapting existing approaches is more convenient whereas clean sheet is riskier but more rewarding [15].

Big Data Analytics Performance Measurement
Analyzing performance of big data applications and identifying the factors that influence their quality has been indicated to be one of the challenges existing in Big Data settings [16]. Existing performance evaluation studies are focused on comparing several Big Data frameworks such as Hadoop, Spark, Flink ( [17], [18]). This involves conducting experiments using some datasets which are executed through selected big data frameworks, and then the corresponding performance metrics are extracted. According to Veiga et al. [18], end-users rarely gain benefits from the research findings provided by these studies. There are two scenarios that can be into consideration. First, performance evaluation of such big data frameworks fails to consider that the field of big data is still emerging and tools in use today may not be relevant tomorrow. Second, such performance evaluation on BDA systems lacks the consideration of users' role in performance measurement. In fact, systems success is not only measured in terms of the performance of its features, but also the satisfaction of its users (e.g. users' level of satisfaction with reports, web sites, and support services [19].
This gives rise to the necessity to discover measures that give the users the stake in performance measurement. It also illuminates the need for holistic solution that doesn't necessarily pertain to specific replaceable big data frameworks as those mentioned above, but for BDA systems in general. Therefore, using existing performance measures and metrics, this study investigates how to bridge this gap.

The BDA Process
The success of big data depends on the factors including people, technology and process. Unlike other traditional projects, big data requires different process approach which pertains to exploration and exploitation of data [20]. The need for process methodology has been a prominent discussion. Discussions center around whether to define new process which it is said to be possible, or to utilize, where applicable, the existing process methodologies such as Knowledge Discovery in Database (KDD) and Cross Industry Standard for Data Mining (CRISP-DM). The traditional ETL (extract, transform, and load) process is also another example of preexisting analytics processes. ETL process is criticized to be batch-oriented [21], a characteristic that complicates its applicability to fast analytics in the era of big data. Elsewhere, the applicability of agile methods was noted. Agile methods were formerly created for software development and suggested for data analysis (e.g big data analysis) as a better process guidance [22]. Agile method has been perceived to be suitable to business intelligence lifecycle, unlike to fast analytics lifecycle in era of big data, unless shortcycle agile approaches are employed. Short-cycle agile approach is concerned with faster and more flexible sprints [23].
Agile approach suggests that collaboration and interactions among team members over processes and tools have utmost importance. Coordinating overall efforts of big data seemingly requires a process but in a different perspective. Therefore, some researchers stressed the need for big data team process methodology [24].
However, It is should also be noted the possible difference between big data analytics process (which data-intensive and recurrently executed), and big data projects, for example big data project lifecycle, which is temporary endeavor and that its success depends on maximum utilization of resources in a predefined timeframe. Moreover, the differentiating trait of BDA process is that it is a step-by-step process of understanding and doing BDA which may not necessarily pertain to specific process methodology. This also gives rise to the need for process standardization and integration [3], the ability to integrate processes, standardize tasks and data results and consequently achieve more benefits including minimization of costs and efforts of using big data.
Debates on BDA process can possibly be traced to the infancy of big data in general, and the future is promising as more attention is increasingly directed to big data from both the industry and academic settings.
The existence of the process is one thing and having standardized, well-defined process is another thing. The clarification of this point has pivotal importance since the researchers have vastly mentioned the notion of BDA process in the literature. For example, Ur Rehman et al. [25] presented, a big data analytics process which consists of six phases namely data collection, data preparation, modelling, evaluation, deployment and monitoring. According to them, big data analytics processes differ in terms of descriptive, prescriptive, and predictive analytic models. The said process was aimed at creation of learning models through predictive analytics. Erl et. al.
[26] elaborated BDA lifecycle which consists of nine stages starting with business case evaluation and data identification stages followed by data acquisitions, and then proceeds to several stages that can be considered as data preparation. Their process, after preparation, contains other three stages, namely, data analysis, data visualization and utilization of the results. Elsewhere, the overall process of extracting values from big data is divided into five stages. The five stages are classified under two main sub-processes, data management and analytics [27]. Jagadish et al.
[28] provided a process which is comprised of acquisition, information extraction and cleaning, data integration, modeling, analysis, interpretation, and deployment. Data collection and Registration, data filter/enrich/classification, data analytics/ modeling, data redelivery/ visualizations in big data lifecycle which manages data from a given data sources to consumer data analytics application. [25] Data collection, prepare data, model and evaluate, among other phases for big data analytics process. [27] Acquisition and recording, extraction/cleaning, integration/annotation, aggregation/aggregation/representatio n, modeling and analysis, interpretation. [59] Data collection, data analysis, data visualization [58] Discussed data acquisition, information extraction, data analysis the need for interpretation. [57] Mentioned some useful terms in overview of analytics workflow for big data, including are analysis, visualization, interpretation and others. [56] Acquisition, extraction, integration, analyzing, interpretation [53] Generation, acquisition, storage, analytics.
This research relies on big data analytics process with the following phases (refer to figure 1): data acquisition, data preparation, data analysis, visualization, and interpretation. The phases of BDA process are described below: Acquisition: Data Acquisition includes the selection of sources and collecting the data from diverse sources like online activities (such as tweets, retweets, web crawlers, customers' reviews, clickstreams, and sensors), log files, and data warehouses. The data is timely captured and sent to next phase for more calibration, Preparation: Preparation phase involve activities such as data integration, pre-processing, and cleansing. Data is checked for errors, outliers, missing, or noisy data. Then, the data are unified, dimensions are reduced, and features are extracted. Analysis and Modeling: This phase is more on applying analytical statistical tools and methods on data to extract actionable information and business insights, thereby producing different types of analytical models descriptive, prescriptive, or predictive. Visualization: In this phase, the result of the analysis and modeling phase are presented to users in a meaningful and understandable way; either in a tabular or graphical form or both. The visualization of results should be looked at users' perspective, as their satisfaction is a key success factor to any information systems, and analytics systems are no exception. Interpretation: Results are interpreted and exploited by the users into their context such as operational optimization, decision making enhancement, or even creating new business models.

Existing Performance Models and
Frameworks The BDA analytics system involves two important components: data and a process. The data is handled through this process from acquisition to interpretation phase. There are two components of BDA highlighted by several studies. Hybrid model presented by Serhani et al. [30] considers both data and process but in quality perspective. There are three process quality measures were used to evaluate the preprocessing stage and processing analysis stage: accuracy, throughput, and response time. Other data quality measures were also mentioned in their study. The performance Analysis model for big data applications was presented by Villalpando et al. [16]. Their study examined big data applications performance using ISO 25010 software quality concepts, namely performance efficiency and reliability. Process performance dimensions known as the Devil's Quadrangle is also a framework which is mainly in process redesign. The framework brings together four process performance dimensions namely time, quality, cost, and flexibility [8], [15]. The framework combines financial measures (cost) and other non-financial measures such as time, quality, and flexibility.
Going far into performance measurement literature leads to more process performance measurement frameworks and models which are mainly in business process and manufacturing perspectives. TOPP system framework is one of them. TOPP system uses efficiency, effectiveness, changeability (or ability to change as some call it) as three performance measures [31]. Interestingly, TOPP System is comparable with the devil's quadrangle from the point that cost, and time can be categorized under efficiency as a measurable concept. TOPP systems' effectiveness is defined as customer satisfaction. The quality dimension in devil' quadrangle (comparatively) is aimed at the satisfaction of both customers and process participants (the staff). The similarity is also much comprehensible between changeability and flexibility. This trade-off can be made, although the two frameworks vary in their application. TOPP uses questionnaire to evaluate the performance, not only for a process but also for the entire enterprise in manufacturing areas. The summary of existing frameworks and models is provided in Table2.

Performance Measures
Having considered a BDA process as measurable entity, the next step is to identify performance measures that apply to it. "A performance measure is defined as a metric used to quantify the efficiency and effectiveness of an action" [32]. Time-related measures such as cycle time, response time, latency, and speed are the dominant process performance measures. Capacity, throughput and resource utilization were observed in the literature. The time behavior and resource utilization measures are classified under efficiency. Output-related measures such as user's satisfaction are classified under effectiveness.
Other contributing factors also include technology, competence, compliance, and staff's working conditions. The following are the description of performance measures: Efficiency Measures: Efficiency, as a measurable concept, is an internal process performance measure that shows how well the process transforms inputs into outputs.
It includes "resource optimization (mainly cost and time) along with maximum waste reduction" [33]. Provided in Table 3 is a group of process performance measures classified as efficiency measures. Cost is the expense of the whole process. It is a measure related to evaluating financial resources [44], [10] applied to the activities of the process (software process) Capacity The maximum number of simultaneous connections and/or processes [30], [16], [35] Response time Total time to complete the processing each record. The time interval between when BDA task submitted to start processing and it is Started. The time needed to complete user request [30], [16], [42], [9], [35], [43] Throughp ut Number of records or requests completed processing over a period of time.
[30], [43] Resource utilization How resources such as processing power, storage, people and the money are utilized [18], [16], [42] Timelines s Timelines relates to situations where the results should be received immediately by the users. Timeliness also measured in data acquisition or data collection [30], [43] Effectiveness Measures: Effectiveness is an external process performance measures which shows the extent process achieves the needs of various stakeholders [33]. Process effectiveness measures the degree to which the preferred performance of the process, such as certain outcomes or results, is achieved [34]. Therefore, effectiveness emphasizes on whether process is achieving sufficient output [8]. In information systems' perspective, it means the impact of information provided on assisting users to perform their work [35].
Flexibility Measures: Flexibility is "the ability to react to changes". Flexibility can proudly divide into run-time and build-time flexibility [15], [36]. Also, customization and modifiability in order to meet future changes [34]. Adaptability, as a synonym for flexibility, is the degree to which analytics system can be adjusted to satisfy various needs in changing situations [37]. The flexibility in BDA process, as can be seen in Table 4, is represented in the ability to handle the increasing volumes of data, the ability to adjust to new needs and circumstances, and users' ability to view results for their preference. The users can choose their way to visualize the information (visualization phase) either in graphical or tabular form, on a computer screen or on hand-held devices.

Performance Contributing Factors
Processes, like BDA process, do not function individually. Instead, their performance depends on several factors including people with specific skills, policies and procedures that govern them, technology that enables them, and enabling work environment. The explanation of those factors provided in Table 5.  [49]

THE RESEARCH METHODOLOGY
The methodology followed in this paper consists of three steps: reviewing the related work, proposing the initial model, and conducting pilot study (expert review and survey).
Reviewing the existing and related literature was conducted to understand the background of study, highlighting its significance and its theoretical foundations. Then the focus has been put on discerning BDA process, examining process methodologies, correlate with nature of BDA, and ascertaining what constitutes BDA process in terms of process phases and descriptions of these phases. Afterwards, the literature review investigates existing performance measurement frameworks and models and presents those are related to the topic being addressed in this paper. Finally, as the result of a reviewed literature, a category of performance measures and their definitions were presented. This section is concluded with an illustration that outlines the identified performance measures, their corresponding metrics and indicators. Next, an initial model was proposed based on the related works.
A questionnaire was prepared and distributed to the experts and potential respondents that involved in BDA. The questionnaire was sent to four experts in the area of BDA, in order to conduct content validity test. Two experts had an industry experience. The others had academic backgrounds related to BDA. One of the experts was female and the rest were male respondents. Content validity test is subject assessment of the suitability of the research instrument by subject-matter experts. It ensures that all relevant contents are included and irrelevant contents are excluded. The experts were asked to rate the relevance of survey items related to the initial model which was also presented to them. The survey consisted of 49 items excluding demographic questions. The experts' feedback and comments recorded in a table, their responses were analyzed, and the survey items were added, deleted and modified accordingly, and they were finally reduced to 41 items.
For reliability test, this research employs composite reliability (CR) because it is an estimate of a construct's internal consistency and it takes into count that indicators have different loadings [38]. The following is the formula used for CR.
Where: λ = Factor loading δ = Measurement error The values for CR have the following interpretations. Values that are above than 0.70 are desirable for exploratory research. Values above 0.800 or 0.900 desirable in more advanced stages of research. Values below 0.600 indicate a lack of reliability (Nunnally and Bernstein 1994) cited in [38].
A survey for pilot study was conducted with BDA practitioners. The data collection made use of online Google Form through email and printed copy of the survey physically delivered to the respondents. A total of 22 fully filled-in surveys were returned. All responses were combined in Google Form and then they were converted to Excel file. Afterwards, the necessary changes were made; for example replacing all" Strongly Agree" with "5" (This is due that Google Form stores the labels of responses not the corresponding numbers). Finally the Excel file was loaded into Smart PLS software and reliability test of the constructs, as explained below in the results section, was performed.

THE PERFORMANCE MEASURES FOR BDA PROCESS
The performance measures denote the basic components that participate in the model development. Every measure contains specific elements called metrics. In addition, indicators provide more details about how the performance measurement applies to the selected domain. The details of the BDA process performance measures, metrics, and indicators are shown in Figure 2.

Content validity Test
As mentioned in the methodology section, expert view was conducted with four experts. Their feedback was summarized in Table 6. Availability, suitability, volatility, and maturity were rated relevant. One item was excluded from the 5, original items. Competence There were 6 items for competence and all of them were seen necessary and relevant (refer to Figure 3). Work Conditions 5 items were presented to the experts, 3 were endorsed by them and 2 were excluded. Compliance The construct had 5 items. Based on the experts' judgment, all of them remained relevant.

The reliability test of the model constructs
Reliability can be defined that the measurements are free from error and, therefore yield consistent results. After running the pilot study data of this research in Smart PLS 3.0, it was found the CR test yields acceptable results. The generated results range from 0.681 to 0.927. Table 5 shows the CR test of pilot study for this research. The Reliability Test yielded comfortable scores based on the suggested threshold. One exception is that the value of 0.681 for flexibility which is less than 0.70. However, there are arguments that range is still acceptable [39]. The value for competence (0.927) is seemingly high but also in an acceptable range. Very high scores that are above 0.950 are said to be suspicious than those in the middle alpha ranges as they may involve common method bias [40].

PERFORMANCE MEASUREMENT MODEL FOR BDA PROCESS
The initial model represented in Figure 3 consists of five variables selected among the seven factors mentioned earlier in this paper. The selection is based on the expert's comments, the survey results, and the rationale of how different factors can join together and meaningfully interact. The selected factors are technology, competency, work conditions, efficiency and effectiveness. The essence of these factors lies in their relationship. The relationship is represented in the following sequence and order. First, put the required competence to perform BDA tasks and activities in a place. Then equip the staff, given their competence, with the required tools to perform the job. Make sure that the staff is happy and motivated. Then come down to assess performance using efficiency measures which focuses on how BDA process is internally functioning. Finally evaluate user's satisfaction based on effectiveness measures.
Another perspective of the model components is that they are divided into two parts: Global and local measures. Global measures are technology, competence and work conditions. The global measures, or factors in statistical terms, have holistic effects on all activities in BDA process. Think about the need for technological tools, for example, in data acquisition, data preparation, data analysis, visualization and interpretation. Also having competence and good working conditions are indispensable for every tiny work being performed amid the chain of big data analytics. Local measures are efficiency and effectiveness, these measures are not a holistic as global measures but on other hand, they are measures for individual BDA process phases.

CONCLUSION AND FUTURE WORK
Researches on big data have many directions including big data analytics (BDA), big data infrastructure, and transformation and impact [41]. BDA, in turn, involves the data, the tools and techniques for data processing and analytics, and most importantly, the process which connects all things together.
There are different process methodologies for different artifacts of data science, big data as an advance in data science, has its own processes and ways of organizing things. Nevertheless, the stepby-step process of performing big data analytics and process of managing and coordinating big data project into two different concepts. The first is a cycle of durable and repeatedly executed BDA process for extracting knowledge and insights. The second is a temporary endeavor for coordinating teams, time and resources to address an issue of a concern.
BDA process, as mentioned above, produces the knowledge and the insights that businesses need, the efforts to improve and optimize this process have arguably a sound justification. Performance measurement is an important topic to embark on. Looking at existing performance models and frameworks and other performance literature provided a number of performance measures and performance contributing factors. The identified performance measures as well as the contributing factors give rise to the understanding that BDA process requires a specific set of skills to perform the BDA process related activities, technology that enhances BDA process execution, supportive work environment, and performance measurement ways for spotting performance deficiencies and exploiting opportunities.
A reliability test conducted yields that the performance measures and factors have sound reliability scores. The future work of this research will be focusing on evaluation of the proposed model based on a larger sample of BDA community.  Theory-driven or process-driven prediction? Epistemological challenges of big data analytics. Journal of Big Data, 4(1), 19.