RiskTree: Decision Trees for Asset and Process Risk Assessment Quantification in Big Data Platforms

The inherent characteristics of big data lies in its voluminous scale, varied data formats, and swift processing velocity. The intrinsic characteristics of big data undermine the efficacy of conventional data security techniques and data management standards, consequently compromising the security of big data. As a consequence, big data possesses susceptibilities to security incidents, including unauthorized data access, data manipulation, and data compromise throughout the transmission, storage, and processing stages. Conventional information system security risk assessment methodologies are constrained by human resources and computational techniques, rendering them unsuitable for direct application to big data platforms. Consequently, there is an urgent necessity to develop a risk assessment framework tailored specifically for big data environments, capable of quantifying potential risks and losses. In response to this need, we have devised an automated risk assessment theory that assimilates the unique characteristics of big data with traditional quantitative methods, introducing a risk metric system suited to the big data context. Utilizing the risk-related data generated during operations on the big data platform, we train a decision tree model to derive the weights for each risk indicator. These weights are then employed to conduct a weighted summation of the operational risk indicators, thereby achieving a quantitative evaluation of the platform’s risk profile. To substantiate the proposed framework, experiments were conducted on a simulated big data platform. The experimental outcomes demonstrate that, compared to existing quantitative risk assessment methodologies, our approach enables an automatic, objective, and efficient assessment and quantification of the risks associated with tangible assets and data processing operations within the big data platform.


Introduction
The rapid advancement of information technology in recent years has facilitated the widespread integration of big data across various sectors of society.Big data technologies have shown immense potential in diverse applications, such as commodity recommendation systems and decision analysis.However, the adoption of these technologies also brings along security concerns that have gained prominence.Many big data applications have adopted open-source platforms and technologies, which were initially designed for use within secure and trusted internal networks.The focus during subsequent developments of big data software has predominantly been on performance, leading to inadequate considerations for overall security planning.Additionally, the proficiency in technology and management among enterprises offering big data-related services varies significantly.These factors, compounded, have resulted in a surge of security incidents related to big data in recent years.Notably, major internet companies like Google and Facebook have encountered high-profile instances of user data breaches, highlighting the critical importance of implementing effective security measures to address the risks associated with big data platforms.
With the open nature of Internet technology, many companies rely on different open-source projects to construct diverse types of big data platforms serving various functions.The provision of big data services built upon complex, open distributed computing and storage architectures presents substantial challenges to conventional authentication, access control, and security auditing mechanisms [1].Traditional data protection methodologies often prove inadequate in addressing the escalating security demands associated with ever-increasing volumes of data [2].To address this diverse landscape, Wu et al. [3] utilize the Delphi method to develop a comprehensive security evaluation index system specifically designed for safeguarding enterprise big data.In a similar line of research, Zhu et al. [4] introduce information entropy to determine the weight of each risk index and incorporate the fuzzy comprehensive evaluation method to quantify the privacy risks associated with social networks, enabling the assessment and prediction of privacy risks.Furthermore, various nations have put forth legislative measures and regulatory frameworks with the aim of safeguarding the integrity and confidentiality of user data.For instance, Europe has implemented the General Data Protection Regulation (GDPR), the United States has enacted the Sarbanes-Oxley Act, and China has implemented the Cybersecurity Law of the PRC and the Personal Information Protection Law of the PRC.This highlights the urgent need for enhancing the existing big data security standards.Consequently, in such an environment, the task of quantitatively assessing the risks associated with assets and data processing operations on big data platforms has become a critical issue that requires urgent resolution.Establishing a rational and effective quantitative risk assessment mechanism will enable enterprises and governmental departments to measure the potential risks of big data platforms more precisely.This, in turn, will facilitate the development of targeted risk mitigation strategies, thereby reducing the risks associated with big data platforms to an acceptable level.This research investigates the creation of risk appraisal theories, methodologies, and technologies, explicitly suited for the big data scenario to resolve the noted concerns.It scrutinizes potential dangers to big data infrastructures originating from diverse facets and offers a comprehensive exploration of the risk quantifying and appraising procedures.The contributions of our study are as follows: 1. Capitalizing on prevalent risk evaluation theories and merging them with the properties of big data, we put forth a quantitative risk assessment theory tailored for big data.Subsequently, an assessment paradigm for big data platform's assets and data processing procedures is evolved.By adopting this paradigm, enterprises have the potential to enhance the efficacy of risk detection mechanisms within big data platforms.2. Proceeding from the presented theoretical structure, we devise an assessment index structure designated for the assets of the big data platform and the data processing procedures within it.These index structures offer an explicit and actionable course of action for the risk assessment of big data frameworks.3.In order to affirm the validity of our theoretical proposition, we establish a simulated big data platform.
This platform is employed to verify the proposed scheme.Experimental outcomes attest that owing to superior machine learning prowess, our recommended approach is more objective, offering significantly decreased labor and temporal expenditures.Furthermore, our approach is highly versatile across changing scenarios and facilitates continuous iterations, updates, and refinements.
The structural composition of this document is outlined in the following manner: Section II introduces background and prior works pertinent to risk quantification.Section III elucidates the risk quantification assessment framework and the associated risk indexes tailored for big data platforms.An examination comparing the efficacy of our proposed framework is presented in Section IV.We conclude with summarizing insights for prospective research directions in Section V.The term big data first appeared in the 1980s, coined by renowned futurist Alvin Toffler in his book "The Third Wave," where he hailed big data as the most splendid symphony of the third wave.This concept was initially aimed at describing the rapid growth of massive amounts of data in various industries and the challenges and opportunities these data presented.However, as technology has evolved and been applied, the notion of big data discussed in contemporary circles has transcended Toffler's original scope.The McKinsey Global Institute, in its report "Big Data: The Next Frontier for Innovation, Competition, and Productivity," states that "big data" refers to datasets whose size exceeds the capacity of typical database software tools to capture, store, manage, and analyze [5].In today's information era, big data refers not only to the sheer volume of data but also includes the velocity at which data is generated, the diversity of data types, and the challenge of extracting valuable information from it.These characteristics are known as the 4Vs of big data [6][7][8][9].
1. Volume refers to the massive amounts of data that are measured in petabytes or even exabytes in a big data environment.In contrast, traditional data often occupies a much smaller space, typically measured in megabytes or gigabytes.2. Velocity signifies the rapid speed of data generation and processing.Many big data scenarios require real-time or near-real-time data processing and analysis, while traditional data is produced at a comparatively slower rate and is often processed periodically or in batches.3. Variety alludes to the wide range of data types, including structured data (e.g., tables in relational databases), unstructured data (e.g., text, images, audio, and video), and semi-structured data (e.g., XML and JSON).On the other hand, traditional data is mostly structured data found in relational databases.4. Value indicates that big data has a low value density, meaning that significant data mining and analytical efforts are required to extract valuable information.In contrast, traditional data typically has a higher value density, and its contents, often pre-processed and cleaned, are more readily utilized for analysis and application.
With the widespread adoption of computer technology, the Internet, the Internet of Things (IoT), social media, and mobile devices, the velocity and volume of data generation are experiencing exponential growth.Against this backdrop, big data technologies have become a pivotal factor in driving societal progress and economic growth.By leveraging big data analytics and mining, businesses and government entities can uncover potential business value, enhance decision-making efficiency, optimize resource allocation, and strengthen their competitive edge.
In big data platforms, data assets refer to data legally owned or controlled by organizations (e.g., government agencies, enterprises, and institutions) that are recorded electronically or otherwise.This data can come in various forms, such as text, images, voice, video, web pages, databases, sensory signals, etc., and can be structured or unstructured.These entities can be measured or traded and can directly or indirectly generate economic and social benefits.However, not all data within an organization constitute data assets -data assets are data that can create value for an organization and their formation requires proactive management and effective control [10].By comparison, big data assets refer to large, complex, and diverse data sets that organizations under a big data environment legally own or control.Big data assets encompass not only structured data (e.g., tabular data in databases) but also unstructured data (e.g., text, images, voice, video, web pages, etc.), as well as semi-structured data (e.g., XML and JSON formats).
The data processing procedures in a big data platform can be divided into six stages: collection, transmission, computation, storage, application, and destruction [11], as shown in the figure.In the data collection stage, the platform collects user data and internet data through tracking programs embedded in external terminal devices and web crawlers.It also collects relevant business data through internal databases, file systems, and logs.The collected raw data undergoes preprocessing operations such as cleaning, integration and transformation.During the data transmission stage, the collected data is aggregated into the platform's storage system through protocols like SSL, HTTPS, FTP, and transmitted

F o r p e e r -r e v i e w o n l y --S & S
through the internet and internal transmission links.In the data storage stage, the platform employs distributed storage techniques to shard and store a large amount of structured and unstructured data across multiple devices [12].Using protocols like ZAB, Paxo, or Raft, a primary node is selected from the cluster to coordinate data backup, consistency checks, and data migration tasks among the storage devices, ensuring data availability, integrity, and load balancing.Another device is designated as the backup primary node, responsible for recording storage logs for data recovery and platform security audits.The backup primary node is set to improve system availability.In case the primary node becomes unavailable, the backup primary node can quickly transition to the primary role and take over its responsibilities, preventing system crashes due to primary node failures.The remaining storage devices function as secondary nodes and perform various data storage tasks under the management of the primary node [13].In the data processing stage, the platform utilizes distributed computing techniques.Similar to distributed storage, it selects primary nodes, backup primary nodes, and secondary nodes to handle computation tasks.Platform users submit computation programs through clients, and the processing system retrieves data from the storage system in chunks for computation.The primary node divides complex computation tasks into multiple subtasks and assigns them to several secondary nodes.The data processing is divided into three stages: Map, Shuffle, and Reduce [14].In the data application stage, the platform shares or exchanges data with external users, programs, and systems.Additionally, it provides services for data collection, storage, analysis, visualization, and more.In the data destruction stage, the platform follows relevant national regulations and designates authorized personnel to irreversibly destroy the data stored in storage media and backups to prevent data leakage.
In the field of computer-related domains, risk assessment is a process that identifies potential vulnerabilities or threats in computer systems or networks before or during the occurrence of risk incidents.It uses qualitative or quantitative methods to evaluate security and provides a scientific basis for implementing security measures to mitigate risks.Risk assessment decomposes the overall risk of a system into various components, including hardware, software, data, and business aspects.In quantitative risk assessment, mathematical and statistical methods are further employed to measure the probability and impact of risks.In today's highly digitized business environment, risk assessment has become an important method or tool to help organizations ensure the security of their technological infrastructure.Through systematic security risk assessment, businesses can gain a better understanding of the security status of their information technology infrastructure and implement appropriate security measures to minimize potential risk incidents and losses.Various risk assessment methods have been proposed in practical research, such as the Delphi method, event tree analysis, analytic hierarchy process, fuzzy comprehensive evaluation method, entropy weight coefficient method, CVSS-based quantitative assessment method, AHP-BPGA model, Markov model, enhanced fault tree analysis method, and more.These methods provide frameworks and techniques for assessing and quantifying risks in computer systems and networks.

Related Work
In the realm of information system risk assessment, a risk evaluation procedure consists of the identification of system vulnerabilities, potential threats, and the consequent losses stemming from these vulnerabilities and threats.Present research thrusts in the field can be bifurcated into three distinct streams: the risk assessment index system, the risk assessment model, and the vulnerability scoring system.
Within the ambit of the risk assessment index system, extant scholarly endeavors concentrate on delineating risk indicators and forging a robust risk index framework by dissecting the threats looming over the targeted information system.For instance, Peng et al. [15] proposed a data dissemination process model based on risk factors.They selected risk indicators for the data dissemination process and used the Delphi method to screen and define these indicators for big data transmission operations.They then utilized the Analytic Hierarchy Process (AHP) to derive the importance of each indicator.Similarly, in reference [16], a qualitative analysis of privacy risk factors in social networks under the context of big data was conducted.The Delphi method was employed to construct an evaluation system, and the weight of the indicators was calculated using information entropy measurements.The fuzzy comprehensive evaluation method was applied to quantitatively assess and predict the privacy risks of social networks.Likewise, Zhao et al. [17] adopted a fuzzy assessment approach for gauging the likelihood and ramifications of risk events, thereafter utilizing an entropy weight coefficient method to appraise the contribution of every risk determinant towards the holistic risk assessment.Besides, Lu et al. [18] embarked on a quantitative risk exploration for industrial control mechanisms, refining the AHP by infusing the fuzzy AHP to mitigate the challenges pertaining to judgment matrix consistency.The deployment of quantitative risk index systems furnishes an intuitive peephole into the potential hazards associated with the system in question, thereby facilitating an all-encompassing scrutiny of risk elements.Nevertheless, the majority of the prevailing methodologies are underpinned by subjective data sources, such as surveys or expert analyses.This reliance potentially seeds a degree of subjectivity into the quantification endeavor, which may, in turn, impinge upon the precision and trustworthiness of the outcomes.
As for the risk assessment model, current research undertakes the extraction of risk factors from information systems, constructs models, and transforms risks into model variables for analysis in order to achieve risk quantification.In reference [19], the fault tree analysis method was employed to calculate the risk values associated with information system program interchange, remote attacks, and risks in the network.Subsequently, based on the risk assessment factors and model design, a risk assessment system was developed and implemented in a network environment.For instance, Zhang et al. [20] presented a quantitative risk appraisal approach centered on host system security via vulnerability scanning.Their methodology involves constructing a vulnerability association graph of the host system to facilitate quantitative risk evaluation.Similarly, Xie et al. [21] proposed an attack tree model that scrutinizes the threat vectors of each leaf in the attack tree, enabling the computation of the threat vector of the complete attack path to derive the risk value.In another line of research, Zhang et al. [22] proposed a fuzzy radial basis function neural network model to numerically process network security risk factors and derive risk levels.Nan et al. [23] proposed a security risk analysis model that constructs a Bayesian network to calculate the likelihood and severity of security incidents.They then utilize an ant colony optimization algorithm to compute the risk propagation path.Developing further, Li et al. [24] proposed an Analytic Hierarchy Process-Genetic Algorithm Back Propagation model that sets the structure of the BP neural network according to the provided risk index system structure and adjusts the parameters of the neural network with a genetic algorithm.Dacier et al. [25] established an IT system vulnerability privilege graph model, converting it into a Markov chain to assess risks quantitatively across diverse attack scenarios.Patel et al. [26] proposed an enhanced fault tree method to assess the impacts of vulnerability and threats on information systems quantitatively.
While modeling the target system can yield authentic and objective risk data, the establishment and analysis of quantitative models may become proportionally complex as the system's size and complexity increase.The prevailing vulnerability scoring systems in operation lay down preset risk indicators and evaluation scales for vulnerabilities, deploying these vulnerability scoring methodologies to quantify the threat level posed by vulnerabilities.In the current scientific and operational landscape, widely endorsed vulnerability scoring models include the Common Vulnerability Scoring System (CVSS) [27], the Threat Assessment and Remediation Analysis (TARA) [28], and the Common Weakness Scoring System (CWSS) [29].CVSS divides vulnerability risk indicators into three categories: base, temporal, and environmental.These categories are used to quantitatively assess and summarize the intrinsic characteristics of vulnerabilities, the characteristics that change over time, and the characteristics displayed in the user environment, respectively.The TARA assessment model focuses on threat assessment for selected network assets and analyzes the mitigation measures for network risks.The CWSS assessment system categorizes weak point indicators into three groups: foundational discovery, attack surface, and environment.As the information about weak points becomes more refined, the weak point scoring becomes more accurate.The vulnerability scoring model proposes a systematic evaluation index and calculation formula for system vulnerabilities.It enables comprehensive and accurate analysis of vulnerabilities and plays a crucial role in measuring and numerically expressing the severity of specific system vulnerabilities.However, the threat sources for information systems are not limited to vulnerabilities alone.The quantification of risk for the entire information system should not be confined to vulnerability scoring alone.
In recapitulation, the mainstream thrust of risk assessment research is devoted to the handpicking of risk indicators and the assembly of a risk index system, grounded in the examination of threats confronting the target information system.The techniques hitherto employed for risk quantification are substantively dependent on subjective data sources, a factor that potentially introduces biases and impinges upon the accuracy and trustworthiness of the resultant assessments.Additionally, the process of system modeling, which can be integral to risk analysis, poses significant difficulties, especially as the scope and complexity of the system amplifies.The circumstances underscore the necessity for the development of an automated and standardized quantitative risk assessment framework that caters to the demand of conducting quantitative risk assessment of big data platform.To enhance the compliance and reliability of the risk quantification process on big data platforms, we referenced the national standard GB/T 35274-2017 "Information Security Technology -Security Capability Requirements for Big Data Services" to assist enterprises in better identifying and managing security risks at different stages of data processing on big data platforms.Based on our understanding of this standard, we divided the risk assessment of big data platforms into risk quantification of data assets and data processing processes.In the quantification of data asset risks, we primarily consider potential risks related to data and system assets, organizational and personnel management.For the risk quantification of data processing processes, we referred to the standard's classification of the data lifecycle and conducted threat modeling for each data processing stage to analyze potential risks within them.If one wishes to design a risk assessment process based on other types of big data security standards, it is necessary to redesign the risk assessment process for big data platforms according to the requirements of other standards for the security capabilities of big data platforms, as well as the different divisions of big data platform assets and data processing processes.

Workflow
The proposed procedure for quantitative risk assessment applied to big data platforms presents a structured approach as outlined in Figure 1.This approach can be synthesized into the following key steps: 1. Preparation: At first, an enterprise must undertake a detailed analysis of its big data platform in order to clarify the target and scope of the assessment.2. Risk Identification: As a cornerstone of the risk assessment procedure, this step encompasses crucial elements such as: (a) Data Assets Identification: This involves spotting crucial data assets within the confines of the big data platform and determining their exclusive attributes including their value, origin, storage location, access rights, sensitivity, correlation, and others, thereby making them subject to risk analysis.
Page 6 of 21 (b) Data Processing Procedures Modeling: Data processing procedures within the big data platform are dissected into six distinct stages: data acquisition, transmission, storage, computation, exchange, and destruction.This process involves crafting separate data processing models corresponding to each stage and linking them to other variables such as hardware, data processing technology, personnel, etc. 3. Risk Quantitative Analysis: At this stage, it is essential to establish an overall view of the quantitative likelihood of risk events occurring, the impacts they may cause, and the organization's tolerance towards these risks.This will facilitate the development of a risk indicator system.4. Risk Assessment: Upon the establishment of the risk index system and the calibration of the weights for the risk indices, an effective and systematic evaluation of the risks pertinent to the big data platform can be initiated.This evaluative process leverages the risk index system to quantify potential vulnerabilities and threats that the platform might encounter.The subsequent phase involves computing the aggregate risk value of the big data platform, which is achieved by assigning the predefined weights to their respective risk indices.These weighted indices combine to yield an overall numerical risk value, which reflects the level of risk of big data platform.

Preparation
At first, an enterprise engrosses itself in an extensive analysis of data assets and data processing procedures of its big data platform to stipulate the aims and the assessment's range.The evaluation's primary objectives include gaining a deep understanding of potential risks that tag along with big data platforms, ascertaining if suitable risk management strategies are in play, and verifying adherence to the ordinances of pertinent laws and regulations.The analysis scans through the big data assets within the enterprise, different stages of data processing, services offered by external data vendors, and data that has been shared with third-party entities.Each of these facets falls within the precinct of this comprehensive risk analysis.In addition to this, the enterprise also needs to earmark specific quantitative risk assessment methodologies and tools.Such selection paves the way for effective risk identification, risk analysis, and evaluation.

Data Assets Identification
User profiling , also known as user tags or user profiles, is a multidimensional description of each user that includes their behavior, interests, preferences, and other characteristics.It is constructed through the collection and analysis of user data.User profiles play a significant role in e-commerce big data platforms as they help businesses gain a deeper understanding of their customers, enabling them to provide more accurate and personalized services and products [30].User profiling is a quintessential service incorporated within big data platforms and is pieced out into four principal stages: data acquisition, data storage and computation, model training and forecasting, and the graphical representation of user profiles as referenced by the citation [31].In this section, the user profiling service is scrutinized, exemplifying the multifaceted nature of the service and highlighting the critical involvement of personnel at each junction of the service.
During the data acquisition phase, the responsibility orbits around the real-time collection, consolidation, and distribution of colossal volumes of log data.This stage leverages big data assets such as data procurement servers and tools exemplified by Flume and Kafka, which are instrumental in the seamless acquisition of data.Transitioning to the data storage and processing phase, this step employs data consolidation servers in tandem with storage tools like HDFS, which are tasked with the stewardship and handling of substantial big data sets.Databases armed with management systems such as MySQL or MongoDB are also pivotal in this phase, offering persistent data storage, aptitude for complex queries, and the overarching governance of the dense data amassment.The utility of Docker comes into play for packaging and deploying applications, thus streamlining the computation and scrutiny of big data within this sphere.The computational analysis servers emerge as key contributors by provisioning the requisite computational resources indispensable for executing model training, prognostication, and various analytical functions.Supplementary big data assets, encompassing load-balancing servers, operating systems, personal computers, network switches, routers, and an array of interfaces, interweave into the architecture, enhancing both the efficiency and constancy of the user profiling service.In the computation phase, which is primarily dedicated to model training and prediction, essential big data assets include the Spark.MLlib library designed for machine learning, computational analysis servers that perform the heavy lifting for calculations, data storage servers that house the processed information, and database servers that facilitate structured data management.These components work in unison to fulfill the model training and prediction requirements, as well as to manage the computational demands of the system.For the visualization of user profiles, an array of big data assets are deployed.Web servers provide the necessary infrastructure for hosting web applications, while web service tools like Nginx ensure the efficient delivery of web content.Database servers, coupled with data visualization tools like MySQL, MongoDB, and Tableau, serve a dual purpose of managing the data and presenting it in a userfriendly and interpretable format, enabling interactive user experiences.Safeguarding the confidentiality, integrity, and availability of user profiling services is achieved through the integration of security-centric big data assets.This suite includes web application firewalls to defend against online threats, bastion hosts to secure remote access, traditional firewalls to block unauthorized network traffic, and systems for user authentication and access control to ensure that only authorized personnel can interact with sensitive data.Additionally, configurations of log system platforms enhance transparency and traceability of operations.From a personnel perspective, big data assets are not limited to technological resources.The operation and maintenance team, business technology specialists, security management personnel, and outsourced service providers are integral to the ecosystem.Their expertise, vigilance, and skills are indispensable for the smooth and secure functioning of the user profiling services.The operations team within the big data platform is responsible for maintaining and managing the daily operation of the platform, including hardware and software maintenance, security vulnerability patches, system monitoring, and troubleshooting.Their work directly impacts the stability and availability of the big data platform system.The business and technical experts are tasked with understanding business requirements, designing and developing appropriate data analysis and processing workflows to ensure that the analytical results meet business needs.They provide accurate and effective data for big data business decision-making, directly impacting the accuracy and effectiveness of data processing and analysis.Security management personnel are responsible for formulating and implementing security strategies for the big data platform, monitoring security events, identifying and addressing security threats.Their role is crucial in protecting big data assets from malicious attacks and data breaches.Outsourced service providers are tasked with some operations, data processing, or security management work within the big data platform.In some cases, they are directly involved in the platform's operation, and the quality of their work and level of security measures directly affect the platform's business security.
Conclusively, this paper aligns with the asset risk assessment standards for big data platforms, referencing documents such as GB/T 37550-2019.It sifts through the aforementioned big data assets related to user profile services to forge a quantitative assessment index system.This system consists of five primary indicators and twenty-eight secondary indicators, as depicted in Figure 2, enabling a nuanced evaluation of the asset risks specific to e-commerce big data platforms.

Data Processing Procedures Modeling
As delineated in Section 3.3, the procedures of data processing within a big data platform is stratified into various stages, including data acquisition, data storage, data computation, data transmission, data sharing, and the deletion or destruction of data, and the threat model of these data processing procedures is shown in Figure 3.
When considering the data processing workflow, vulnerabilities may arise during the data acquisition phase of big data platforms.Frequently, collection devices are plagued by limitations in their processing capabilities, constraining them to perform rudimentary preprocessing tasks on raw data.This constraint significantly amplifies the potential for advanced malicious data to evade initial defense mechanisms without detection.Consequently, this directly exposes the big data platform to various threats.The data acquisition process within big data platforms is particularly susceptible to these vulnerabilities.This vulnerability stems from the constrained processing capacities of collection devices, which impede their ability to execute comprehensive data validation and filtering mechanisms.Consequently, this limitation  allows for sophisticated malicious data to infiltrate the platform by circumventing elementary preprocessing steps, ultimately evading detection and compromising data integrity and availability.Therefore, based on the analysis above, data collection security is considered as one of the risk quantification indicators.Under this indicator, there are two sub-indicators: inadequate data preprocessing and security risks related to data collection devices.
During the data transmission phase, big data platforms are susceptible to a multitude of security risks.To elaborate, instances may arise where big data platforms employ less secure transfer and authentication protocols, particularly for services that are deemed non-critical.This choice might be a conjecture of balancing trade-offs between security stringency and system performance.Likewise, data may be selectively encrypted or the integrity of data transmission may be verified only to a limited extent, aiming to minimize overhead and improve performance metrics.Nonetheless, such selective measures might leave the door ajar for security breaches.In the broader scheme of things, the complexity and variability of the transmission environment itself can spawn its own set of complications.These could potentially disturb the stability and reliability of data transfer links, thus impeding the regular flux of data within the platform.Therefore, based on the analysis provided, data transmission security is considered as one of the risk quantification indicators.This indicator includes four sub-indicators: data collection transmission risk, storage data transmission risk, exchange and sharing data transmission risk, and business data transmission risk.Additionally, abnormal detection of devices and transmission links is also included as a risk quantification indicator, under which there are five sub-indicators: lack of abnormal detection mechanisms for data collection, communication, storage, processing, and business devices.
During the process of data sharing and exchange in a big data platform, data flows between the platform's databases, file systems, and external users, programs, and systems.The data passes through various transmission links where the transmission and authentication protocols may have vulnerabilities.To prevent privacy breaches, data needs to undergo processes such as anonymization and desensitization before being published.However, data anonymization can have an impact on data availability.Additionally, operations such as encryption and integrity checks on large amounts of data can be costly.Therefore, the platform may selectively encrypt and perform integrity checks on data after considering the trade-off between performance and security costs.The platform's auditing mechanism may have deficiencies, as it may only record the types, times, and contents of operations without recording the identities of the  operators.Therefore, based on the analysis provided, data exchange and sharing management is considered as one of the risk quantification indicators.This indicator includes four sub-indicators: data not anonymized, improper setting of data exchange or sharing scope, lack of pre-release data review, and non-standard data import and export processes.Big data platforms, reliant on their data storage systems, confront myriad security challenges that jeopardize the data integrity, availability, and confidentiality.These vulnerabilities stem from (i) mechanical and hardware failures within storage nodes, risking significant data loss and operational continuity; (ii) authentication vulnerabilities during the integration of new nodes, exposing the system to potential malicious compromise; (iii) the need for robust data quality monitoring to prevent incorrect data storage, which can skew data analysis and decision-making; and (iv) the critical necessity for data protection measures including encryption and user isolation to prevent unauthorized access.Further exacerbating security concerns are (v) the exposure of user credentials, (vi) weaknesses in client authentication mechanisms, (vii) the importance of comprehensive training for authorization managers to handle access controls correctly, and (viii) the potential for disruption due to flawed election protocols for determining the storage system's main node.These areas outline a comprehensive landscape of security challenges within big data storage systems, necessitating vigilant and proactive measures to maintain data integrity, security, and operational efficacy.Therefore, based on the analysis provided, data storage security is considered as one of the risk quantification indicators.This indicator includes five sub-indicators: data not encrypted, data not classified into different levels, lack of data isolation, inadequate data backup, and outdated storage components.Additionally, data quality monitoring, data availability, and storage media management are also included as risk quantification indicators.
Addressing these security risks involves implementing comprehensive risk management strategies, including regular system and hardware maintenance, strengthening authentication processes, enhancing data protection measures, and ensuring thorough training for all personnel involved in managing the storage system.Additionally, continuous monitoring and regular security assessments can help identify and mitigate vulnerabilities, ensuring the resilience of the data storage system within big data platforms.
Figure 3 unveils the complex security challenges within big data calculating systems, attributing the risks to operational inefficiencies, compromised access policies, hardware malfunctions, and flawed system architectures.The absence of stringent quality checks for input data not only compromises computational integrity by introducing inaccuracies or malicious content but also propagates misinformation.Furthermore, inadequate assessment of processing tasks and outcomes can inadvertently expose confidential data, threatening the sanctity of data confidentiality.Vulnerabilities manifest from poor management of user and administrator credentials, allowing unauthorized access that jeopardizes sensitive data.Compromised client identity authentication processes further exacerbate these security loopholes, undermining both data integrity and confidentiality.Additionally, management staff's decision-making errors, stemming from insufficient or subpar training, can culminate in security breaches and system misconfigurations.The resilience of computational nodes and storage entities is tested through prolonged periods of intense computation, persistent data transactions, and operations, leading to elevated failure  rates that impinge on the system's reliability and availability.Therefore, based on the analysis provided, data calculation security is considered as one of the risk quantification indicators.This indicator includes three sub-indicators: undetected input data, lack of review of computing tasks and output results, and outdated processing components.Additionally, user, program, and device authentication and access control are also included as risk quantification indicators.Addressing these security risks requires a comprehensive approach that encompasses both technical and administrative strategies.Implementing rigorous input validation mechanisms and enhancing scrutiny of processing tasks can mitigate risks associated with operational processes.Strengthening access control measures, including robust password management practices and secure client authentication mechanisms, can help protect against unauthorized access.Furthermore, investing in training programs for management staff can reduce human error and improve decision-making processes.Ensuring regular maintenance and adopting fault tolerance strategies can alleviate hardware reliability issues, thereby enhancing the resilience of the data calculation system.Therefore, based on the analysis provided, management system, system maintenance, device and transmission link management, and log auditing are considered as risk quantification indicators.
Within the realm of big data platforms, the data application and dissemination process is fraught with several security vulnerabilities.A paramount concern is the lack of secure transmission or authentication protocols, which lays bare the possibility of unauthorized access and potential data breaches.Equally critical is the protection of privacy and integrity for publicly released data; insufficient measures in this regard could result in the unintended exposure of sensitive details and the distortion of data quality.Additionally, the vulnerabilities present in the auditing mechanisms, such as an incomplete recording of operational metadata, undermine traceability and accountability, thereby heightening security risks as prohibited activities may remain undetected and unaddressed.These systemic shortcomings underscore the need for a comprehensive enhancement of data transmission security, privacy protection, and audit trail completeness to fortify the defense against the multifaceted threats inherent in big data platforms.Therefore, based on the analysis provided, protocol security is considered as one of the risk quantification indicators.This indicator includes three sub-indicators: authentication and access control protocol security risks, encryption protocol security risks, and node election protocol security risks.Additionally, log security is also considered as a risk quantification indicator, which includes three sub-indicators: unencrypted logs, inadequate log backups, and logs that are not recoverable.
During the data destruction process, misguided data destruction or insufficient management could result in the unwanted leakage of confidential data.If the destruction is incorrectly performed, residual data can be recoverable, posing a significant security risk.In addition to technical errors, inadequate training or management of personnel involved in data destruction can also lead to data spillage.Upon examining these diverse threat vectors and vulnerabilities, the security of the big data platform is subdivided into two secondary risk indices: technical security and management security, as depicted in  secure and responsible data disposal, and an operational environment that fosters security as every user's responsibility.
By adopting the risk indexes system, organizations can better understand, manage, and mitigate the potential security dangers their big data platforms may face.This granular breakdown of risks also aids in designing tailored risk mitigation strategies, emphasizing the discrete needs of both technical and management domains.

Risk Quantitative Analysis
An expert scoring system offers a methodology, which draws upon the expertise of professionals with deep knowledge in the fields of data security, risk analysis, and big data technology to assess the severity and likelihood of potential security threats.In the current research environment, there are certain limitations on the transparency of risk assessment results for experts.In order to address this gap, this study has conducted in-depth research on the correlation between various indicators in real scenarios and their relationship with business contexts.Based on this research, a series of logical relationships have been derived to simulate the generation of expert review data, providing effective data support for this study.To generate simulated expert review data, the data flow in the big data platform is divided into six stages: data collection, transmission, storage, processing, application, and disposal.The designed simulation program employs a multi-threaded concurrent approach to simulate the processing and forwarding of data by the collection, transmission, storage, processing, and business servers in the platform.The collection server periodically collects internal data from logs and database systems within the platform, and also collects external data from platform sources using methods such as web crawling and front-end tracking.The data is then dumped into the storage system through the transmission link.The data in the storage system is read and written by the processing system, enabling data processing or disposal.The business system provides various services to the external platform.Additionally, all servers are capable of adjusting system security configurations based on the current data processing situation.In the designed program, each server's behavior is simulated by a dedicated thread.When a thread obtains a thread lock, it sequentially simulates the server's actions, such as processing or forwarding data, and adjusting system security configurations based on ongoing tasks.After completing the task, the server releases the thread lock.If the server is unavailable, the thread releases the lock and enters a blocking state until the server becomes available.In the simulation of platform data flow, multiple threads are used to separately simulate the behavior of the collection, transmission, storage, processing, and business servers.The program terminates when the simulation limit is reached.
Moreover, we employed correlation analysis to explore the interrelations among risk indicators, gaining insights into their collective impact on the security posture of big data platforms.This analysis aids in identifying potential compounded risks resulting from the interplay of multiple vulnerabilities.For example, how the absence of secure transmission protocols magnifies threats when coupled with weak audit mechanisms.The analysis underscores the intricate interconnections between various components of a big data platform's security posture.These relationships highlight the cascading effect that a vulnerability in one area can have on multiple facets of the platform's overall security.Below is a brief discussion of these findings and potential resolutions: 1. Strengthening Equipment Access Control Policies: Implementing robust access control policies is critical to prevent unauthorized access.This includes using multi-factor authentication, strict role-based access controls, and regular audits of access logs.3. Designing Secure Network Topologies: A secure network topology should minimize vulnerabilities by including firewalls, intrusion detection systems, and segmentation to protect critical areas of the network.This reduces the attack surface accessible to potential attackers.4. Regularly Updating Operating Systems: The security of operating systems is significantly enhanced by timely patch installations.Automating patch management processes ensures that systems are protected against known vulnerabilities. 5. Safeguarding Backup and Recovery Strategies: The security of backup and recovery strategies should be ensured through encrypted backups, secure storage solutions, and regular testing of recovery procedures to confirm their effectiveness.6. Integrating Platform Security Measures: Platform security should be a holistic effort that combines secure interfaces, component security through patches, and secure configurations to a robust defense against threats.7. Employing Strong Encryption Techniques: Data encryption is vital for the security of platform interfaces and the safeguarding of data in transit and at rest.Employing strong, industry-standard encryption algorithms can prevent data interception and unauthorized access.8. Configuring Cloud Servers and Services Securely: Cloud server configurations should support secure load balancing and include continuous security monitoring to detect and respond to threats promptly.9. Verifying Cloud Providers' Security Certifications: Choosing cloud providers with high levels of security certification is crucial.These certifications are indicative of the provider's commitment to security best practices and management capabilities.10.Enhancing Staff Technical Capabilities through Training: Conducting regular security training and technical ability audits helps identify areas where staff may require further education, ensuring that all team members are equipped to maintain the platform's security.11.Improving Personnel Management Systems: Implementing comprehensive personnel management systems can help in correctly managing employee rights and preventing abuse.This includes regular reviews of user privileges and ensuring that the principle of least privilege is followed.
In the process of risk assessment, determining the weights is a crucial task, which is often considered as a feature selection or weight learning problem in machine learning.The calculation of weights directly affects the derivation of risk values and the final results of risk assessment.Therefore, the goal of this study on quantifying the risks in big data platforms is to analyze the simulated generated expert review data in-depth to uncover the hidden correlations and subsequently determine the weights of each indicator.In our research, we made efforts to determine the weights of different indicators by analyzing the correlations within simulated expert review data.The credibility of risk values and the final quantitative risk assessment conclusions rely heavily on the rationality of the indicator weights assigned.However, due to the characteristics of big data itself and the continuous development of computer technology, traditional risk quantification methods are not suitable for the risk quantification assessment of today's big data platforms.Subjective weighting methods and Delphi method are subjective, influenced by human factors, time-consuming, and may lack objectivity or be biased in the assessment of the importance of each risk indicator on big data platforms due to individuals' expertise in specific areas, leading to inaccurate distribution of indicator weights.Analytic Hierarchy Process requires the construction of a large number of hierarchical structures in complex risk quantification scenarios of big data platforms, needs pairwise comparisons of indicators in the weight calculation process, and after each indicator changes, the weight calculation process needs to be repeated, which may lead to consistency issues affecting the reliability of the results.Entropy method requires high data requirements, with the need for a relatively uniform data distribution; otherwise, the results may be distorted, and it cannot handle nonlinear relationships.Principal Component Analysis can only handle linear relationships and cannot capture complex nonlinear relationships, and the dimension reduction process may result in partial information loss.Furthermore, the traditional risk quantification methods mentioned above are inefficient in handling large-scale data, have high computational complexity, and are not suitable for the risk quantification calculation process in big data platforms with massive data volumes.
Traditional machine learning techniques like linear regression and Lasso regression are commonly used for learning weights; however, they are not ideal for addressing this particular problem.Linear regression assumes a linear relationship between the data, whereas in risk assessment, the relationships among various indicators are often nonlinear.On the other hand, Lasso regression can perform feature selection  but primarily relies on the L1 regularization term and works best when there is pronounced collinearity between features.However, strong collinearity may not exist among risk indicators in our case.As a result, we opted to employ the random forest algorithm for computation.Random forests offer several advantages for our purpose.They can assess feature importance, effectively handle a large number of features to prevent overfitting, and manage nonlinear relationships to determine the weights assigned to each indicator.By leveraging the capabilities of random forest algorithms, we can derive meaningful and accurate weights, enhancing the robustness and reliability of our risk assessment methodology.In comparison to linear regression and Lasso regression, the random forest algorithm stands out for its ability to excel in calculating weights for quantifying risk indicators on large data platforms.Unlike linear regression's assumption of linearity and Lasso regression's reliance on L1 regularization and feature collinearity, random forests can effectively address the nonlinearity and complex relationships present in risk assessment scenarios.The random forest algorithm's capacity to assess feature importance, handle large feature sets without overfitting, and manage nonlinear interdependencies makes it a superior choice for determining accurate weights for risk indicators.These advantages highlight the effectiveness of random forests in enhancing the precision and reliability of our risk assessment methodology.

Risk Assessment
To calculate the risk value of the big data platform, we define two matrices: X and V .The X matrix represents the quantified risk values of each risk index, which are obtained from the preprocessed data set.It can be represented as X = [x 1 , x 2 , ..., x n ], where x i represents the quantized value of the ith risk indicator.The V matrix contains the weights assigned to each risk indicator, which are learned from the random forest model.It can be represented as V = [v 1 , v 2 , ..., v n ], where v i represents the weight of the ith risk indicator.
To obtain the quantified risk value of each risk indicator, we perform matrix multiplication between X and V .This can be expressed as R = X × V = [r 1 , r 2 , ..., r n ], where r i = x i × v i represents the quantified risk value of the ith risk indicator.Finally, the quantitative risk values of all risk indicators are summed together to obtain the final total risk value, as shown in Eq. 1: This calculation process allows us to accurately quantify and evaluate the risk of the big data platform based on the individual risk indicators and their respective weights.

Setup
In order to assess the methodology introduced in this paper, we capitalized on a simulated cluster that we constructed utilizing Hadoop and Spark.This cluster was comprised of three nodes, each running CentOS7.We configured the network information relevant to these nodes and instituted non-encrypted logins between them.To generate the requisite experimental data, we established a user portrait functionality for e-commerce big data within the aforementioned cluster.We set this function to operate autonomously, allowing us to amass a significant volume of data during its operation.

Weights Calculation
In the data preprocessing stage, the paper quantified the risk level from very low to very high into numerical values of 10, 30, 50, 70, and 90, preserving the relative magnitude relationship between risk levels and efficiently encoding them.According to GB/T 31509-2015, the overall risk level is determined by the percentage of risk levels for each indicator.As shown in Table 2, a high overall risk level signifies that the proportion of very high risks is not less than 10%, or the proportion of high risks is not less than 30%.A medium overall risk level means that the proportion of medium risks is not less than 30%.In other cases, the overall risk rating is low overall risk level.Please note that this research established two separate experiments.The first was designed to calculate the weights of risk indicators associated with big data platform assets, while the second was to determine the weights of risk indicators related to the data processing procedures.All experiments were performed on Intel i58265U@1.6GHz.In the first experiment, a grid search method was employed to find the optimal number of trees.The primary evaluation metrics for selecting the best number of forests were accuracy on the test set and time cost.After conducting the grid search, the value of n estimators was chosen as 210, indicating that the random forest ensemble consisted of 210 trees.It is important to note that the specific choice of 210 was determined based on the evaluation of accuracy and time cost, ensuring an optimal balance between performance and efficiency.The random forest algorithm was then utilized to determine the weights of each risk indicator.The weights indicate the importance of each indicator in predicting the risk level.In Table 3, the top 10 indicators with the highest weights are presented.It is observed that the weights of these indicators increase as the number of risk indicators associated with them increases.This intuitive relationship suggests that an increase in the number of correlated items implies that the metric has direct or indirect effects on a larger number of other metrics.As a result, the importance of such metric increases accordingly.In the first experiment, we chose to use only the Random Forest algorithm because the Random Forest algorithm has good performance and stability.It can handle large-scale data and high-dimensional features, and it has high accuracy and reliability for weight calculations.Therefore, to simplify the experimental steps and validate the effectiveness of the Random Forest algorithm in the risk quantification of big data platforms, we chose to use only the Random Forest algorithm for weight calculations.
In the second experiment, this paper collected and preprocessed the data generated or outputted during the execution of the simulation program.The collected data was then analyzed to obtain scores   for various risk quantification indicators.Additionally, the total loss of platform assets, the risk level, and the scores of the risk quantification indicators within the same time period were integrated into a single risk data.Multiple sets of risk data were combined to create a risk dataset, which was subsequently used for the calculation of weights for subsequent risk quantification indicators.This preprocessed risk dataset specific to the big data platform was utilized to train decision trees, random forests, and gradientboosting decision trees.The performance and effectiveness of these three machine learning models were comprehensively compared.Ultimately, the weight of the risk quantification indicators for the big data platform was determined based on the model that exhibited the best training effect.After comparing the learning performances of several models, it was determined that the random forest model yielded the most favorable results.The weight of the risk quantification indicators for the big data platform, as obtained from the random forest model, can be found in Table 4.In addition to the weights, the accuracy of the three models on both the training set and the validation dataset, as well as the training time, were recorded and presented in Table 5.In the second experiment, building upon the validation of the effectiveness of Random Forest in calculating the weights of risk indicators in the first experiment, we chose to simultaneously use Decision Tree, Random Forest, and Gradient Boosting Regression Tree algorithms.This was done to compare and evaluate the performance of different algorithms in the calculation of risk indicator weights during the data processing process.Each algorithm has its unique characteristics and applicable scenarios.Through comparative analysis of the results of each algorithm, we aim to comprehensively assess the applicability and performance of different algorithms in the risk quantification of big data platforms, providing stronger support for the reliability and scientific validity of the experimental results.
The method proposed in this paper adopts a hybrid approach that combines empirical judgment with data-driven models.By utilizing the optimization capabilities of machine learning algorithms, the authors aim to develop a more objective and robust risk assessment index system.Through the experimental analysis conducted in this study, the results demonstrate that the proposed system provides a more comprehensive and accurate depiction of the existing risks encountered by the big data platform.By incorporating empirical judgment and leveraging the power of data-driven models, the system offers an improved understanding of the risk landscape.
The integration of empirical judgment ensures that human expertise and domain knowledge are utilized in the risk assessment process.This helps to capture nuances and factors that may not be explicitly   represented in the available data.By combining this knowledge with the data-driven models, the system enhances the objectivity and depth of the risk assessment.The experimental results showcase the effectiveness and validity of the proposed system, highlighting its ability to address the complexities and challenges associated with risk assessment in the context of a big data platform.The quantitative risk assessment system merging empirical judgment and data-driven model provides a more comprehensive, accurate, and objective representation of the risks faced by the big data platform.

Risk Values Calculation
To integrate the asset quantification index system in the big data platform with the risks associated with data processing procedures, the paper generated a dataset consisting of 40,000 instances of simulated expert review data.This dataset captures the risks associated with both the platform's assets and its data processing procedures.The simulated data was then analyzed using the equation specified in Eq. 1.
This equation assigns risk values to each instance, which were subsequently mapped to specific risk levels such as low, medium, and high.This mapping process allows for a clear representation of the overall risk level distribution for the big data platform's assets and data processing procedures.The specific rules used for mapping the risk values to risk levels, as well as the resulting distribution of overall risk levels, are presented in Table 6.This table provides a comprehensive overview of the risk levels associated with the platform's assets and data processing procedures, including the proportion of instances categorized into each risk level.

Comparative Analysis
The machine learning approach presented in this study offers several significant advantages in the field of risk assessment.By employing a data-driven approach, this method ensures objectivity and minimizes biases that may arise from subjective human interpretations.This is accomplished through the automated processing and analysis of data, which eliminates the need for extensive manual analyses by experts.As a result, operational costs are reduced, and valuable human resources are saved.Moreover, machine learning models are well-suited for handling large-scale data, making them highly adaptable to the quantitative risk assessment requirements of various big data scenarios.One of the key advantages of the machine learning approach is its ability to continuously train and optimize models as new data is expanded and updated.This dynamic adaptability allows the models to effectively respond to emerging risks and changing circumstances.In contrast, traditional methods such as hierarchical, Delphi, and fuzzy judgment approaches often require time-consuming re-analysis and computation when faced with evolving datasets.Table 7 provides a summary of the advantages offered by the machine learning approach in comparison to traditional methods.It highlights the efficiency, flexibility, and sustainability of the proposed approach, positioning it as a superior solution for quantitative risk assessment in big data environments.

Conclusion
The objective of this paper is to investigate and analyze the risks encountered by big data platforms.
The study focuses on identifying potential vulnerabilities and threats in both platform assets and data processing procedures.To achieve this, the paper proposes a quantitative assessment theory and index system specifically designed for assessing risks in big data platforms.The primary goal of this research is to provide a set of reasonable and scientifically grounded reference indicators for quantitatively assessing risks in the context of big data.To validate the effectiveness and accuracy of the proposed approach, we constructed a simulated big data platform.This platform was used to test and verify the proposed theory and index system.The results obtained from the simulation provided valuable insights and demonstrated the advantages of the proposed scheme.The evaluation of the proposed scheme highlighted several advantages compared to existing approaches.These advantages include enhanced objectivity, reduced costs, improved generality, and broader applicability.By addressing the limitations of existing schemes and incorporating quantitative assessment methods, the proposed scheme offers a more comprehensive and robust framework for assessing risks in big data platforms.In conclusion, this paper contributes to the field of risk assessment in big data platforms by introducing a quantitative assessment theory and index system.The proposed scheme is validated through the construction of a simulated platform and demonstrates clear advantages over existing approaches in terms of objectivity, cost-effectiveness, generality, and applicability.

Page 2 of 21 F
o r p e e r -r e v i e w o n l y --S & S Haomou Zhan et al.: RiskTree: Decision Trees to Quantify Risks in Big Data Platforms 2 Background and Related Work 2.1 Background

Page 4 of 21 F
o r p e e r -r e v i e w o n l y --S & S

F o r p e eFigure 1 .
Figure 1.Risk assessment process of the big data platform.

F o r p e e
r -r e v i e w o n l y --S & S Haomou Zhan et al.: RiskTree: Decision Trees to Quantify Risks in Big Data Platforms

Page 7 of 21 F
o r p e e r -r e v i e w o n l y --S & S

Figure 2 .
Figure 2. Risk quantitative evaluation index system of User portrait business assets.

Page 9 of 21 F
o r p e e r -r e v i e w o n l y --S & S Haomou Zhan et al.: RiskTree: Decision Trees to Quantify Risks in Big Data Platforms

Figure 3 .
Figure 3. Risks in data processing procedures.

Page 10 of 21 F
o r p e e r -r e v i e w o n l y --S & S

Figure 4 .
Figure 4. Risk quantitative evaluation index system of data processing procedures of user portrait business.

Fig. 4 .
Technical security is further dissected into nine tertiary indices, illuminating key technical controls that should be in place such as secure connection protocols, rigorous authentication methods, effective data privacy and integrity measures, and comprehensive audit mechanisms.Management security comprises of eight tertiary indices, highlighting areas like thorough personnel training, access controls, Page 11 of 21 F o r p e e r -r e v i e w o n l y --S & S Haomou Zhan et al.: RiskTree: Decision Trees to Quantify Risks in Big Data Platforms

2 .
Ensuring Proper Device Configuration: Devices should be correctly configured to optimize performance and reduce vulnerabilities.Regular configuration reviews and adherence to best practices in device management can mitigate these risks.Page 12 of 21 F o r p e e r -r e v i e w o n l y --S & S Haomou Zhan et al.: RiskTree: Decision Trees to Quantify Risks in Big Data Platforms

Page 13 of 21 F
o r p e e r -r e v i e w o n l y --S & S Haomou Zhan et al.: RiskTree: Decision Trees to Quantify Risks in Big Data Platforms

Page 14 of 21 F
o r p e e r -r e v i e w o n l y --S & S

Page 15 of 21 F
o r p e e r -r e v i e w o n l y --S & S Haomou Zhan et al.: RiskTree: Decision Trees to Quantify Risks in Big Data Platforms

Page 16 of 21 F
o r p e e r -r e v i e w o n l y --S & S Haomou Zhan et al.: RiskTree: Decision Trees to Quantify Risks in Big Data Platforms

Page 17 of 21 F o r p e e
r -r e v i e w o n l y --S & S Haomou Zhan et al.: RiskTree: Decision Trees to Quantify Risks in Big Data Platforms r -r e v i e w o n l y --S & S Haomou Zhan et al.: RiskTree: Decision Trees to Quantify Risks in Big Data Platforms Hui Li received B.Sc. degree from Fudan University in 1990, M.A.Sc.and Ph.D. degrees from Xidian University in 1993 and 1998.Since June 2005, he has been the professor in the school of Cyber Engineering, Xidian University, Xi'an Shaanxi, China.His research interests are in the areas of cryptography, wireless network security, information theory and network coding.He is a co-author of two books.He served as technique committee co-chairs of ISPEC 2009 and IAS 2009.Page 21 of 21 F o r p e e r -r e v i e w o n l y --S & S Haomou Zhan et al.: RiskTree: Decision Trees to Quantify Risks in Big Data Platforms

Table 1 .
Format of expert review data.

Table 2 .
The relationships between the overall risk level and the proportion of risk items.

Table 3 .
Weight of risk indexes of big data platform assets.R1 presents Personnel privilege abuse, R2 presents Personnel technical competency audits, R3 presents Personnel security training, R4 presents Personnel management system, R5 presents Cloud service provider security authentication, R6 presents Cloud infrastructure access control policies, R7 presents Cloud Server Configuration, R8 presents Cloud Service Security Monitoring, R9 presents Load Balancing Server Operation ,R10 presents Platform Interface Security, R11 presents Platform Component Data Encryption, R12 presents Platform Component Access Control Policy, R13 presents Platform Component Security Patches, R14 presents Platform Component Configuration, R15 presents System Backup And Recovery Policies, R16 presents Application Service Configuration, R17 presents Application Service Access Control Policies, R18 presents Database Configuration, R19 presents Database Access Control Policy, R20 presents Operating System Configuration, R21 presents Operating System Security Patches, R22 presents Operating System Access Control Policy, R23 presents Network Topology Design, R24 presents Network Security Configuration, R25 presents Spare Device Availability, R26 presents Physical Device Operational Status, R27 presents Physical Device Configuration, R28 presents Physical Device Access Control Policy.

Table 4 .
Weight of risk indexes of data processing procedures.R1 presents User authentication and access control, R2 presents Program, device authentication and access control, R3 presents Data collection security, R4 presents Data transmission security, R5 presents Data storage security, R6 presents Data calculation security, R7 presents Device and transmission link abnormal detection, R8 presents Protocol security, R9 presents Log security.N ote 2 :T1 presents Management system, T2 presents Data quality monitoring, T3 presents Data availability, T4 presents Data exchange and sharing management, T5 presents Storage media management, R6 presents System maintenance T7 presents Device and transmission link management, T8 presents Log audit.

Table 5 .
Training results of three machine learning algorithms.

Table 6 .
Distribution and the threshold value of total risk level of the big data platform.: If the risk value of big data platform assets is lower than 35.5236, it is classified as low risk.When the risk value is not less than 35.5236 and less than 43.4740, it is medium risk.When the risk value is not less than 43.4740, it is high risk.N ote 2 : If the risk value of data processing procedures is lower than 32.3718, it is classified as low risk.When the risk value is not less than 32.3718 and less than 39.1054, it is medium risk.When the risk value is not less than 39.1054, it is high risk.

Table 7 .
Comparative analysis of different approaches.P1 presents objectivity, P2 presents low-cost, P3 presents processing with large amount of data, P4 presents applicable to a variety of scenarios, P5 presents easy to update and iterate.N ote 2 : • presents TRUE, • presents FALSE