An Optimised Method for Fetching and Transforming Survey Data based on SQL and R Programming Language

The development of information systems in recent years has contributed to various methods of gathering information to evaluate IS performance. The most common approach used to collect information is called the survey system. This method, however, suffers one major drawback. The decision makers consume considerable time to transform data from survey sheets to analytical programs. As such, this paper proposes a method called ‘survey algorithm based on R programming language’ or SABR, for data transformation from the survey sheets inside R environments by treating the arrangement of data as a relational format. R and Relational data format provide excellent opportunity to manage and analyse the accumulated data. Moreover, a survey system based on structured query language and R programming language is designed to optimize methods to manage survey systems by applying large features offered via combining multi data science languages. The experiments verified enhancements of flexibility, technical tools, and data visualization features employed to process the collected data from different aspects; therefore, the proposed approach demonstrates a simple case study to enhance the evaluation requirements of the proposed technique. Finally, the estimated results of this research can be used to improve the methods of information management on different aspects such as survey systems and other data models that hold the relational and non-relational models using SABR. This method demonstrated improved accuracy of data collected, reduced data processing time and arranged data to the willing model.


Introduction:
Most of the decision must be made in real time to ensure its effectiveness (1,2). For that purpose, the process of collecting data from different sources and transforming them into the format and formula desired by the decision makers become a challenging tasks. In this context, the problem can be defined as an assumption of a group of N people who answered the questionnaire related to the specific study. Each survey sheet consists of two parts: the first table [Info matrix = (i,j)] that holds a general information of the participant with unlimited attributes and rows, and the second table [SQ matrix = (i,j)] which holds questionnaire questions and answers fields with unlimited attributes and rows.
The main problem in this situation is how to collect those data using a high accuracy method in a short time while providing a wide platform for thorough data analysis in the r programming environment.
In this paper, the problems of transferring data from the survey sheets to the desired data repositories are dissolved by the proposed approach. The proposed approach for collecting and analysing the survey sheets creates a new method for decision makers through its features. The main advantage of this approach is that the collected data can be formatted according to the requirements of the decision makers as well as other analytical techniques. In other words, it supports the automatic formats of the data sets according to the end users' criteria.
R is an open source community that gives more flexibility and extensibility with thousands of packages. It is considered the most common and powerful language to handle statistical data and provides a great environment to deal with big data analytics. Furthermore, R gives the opportunity to run structured query language (SQL) which is considered the most powerful language to treat relational data (3,4). SQL plays a vital role in managing relational data because of its higher performance and support of unlimited functions to deal with relational aspects. SQL supports many applications in many fields like web applications and information discovery applications (5,6).The combination of the features of R programming and SQL opens up new horizons in the field of combining and analysing data. This environment provides unlimited flexibility and tools to treat the data according to the desired criteria.
In this paper, the proposed combination of R and SQL is applied in the field of information systems by using the Likert-Scale that is a standard approach to collect and analyze data. Thus, the main contribution of this research is on reducing the complexities and the difficulties of collecting and organizing a group of survey sheets in the desired location and in a desired format and styles. Such complexity on relational and non-relational data that come from different sources are taxing for the researchers, therefore, our survey algorithm based on R programming language (SABR) model for the purpose of dissolving the problems of data transformation from various survey sheets (sources) to the desired repository, is a novel contribution.

Related Work:
Recently, the decision makers started evaluating the performances of the IS quickly, then use the survey systems to daily increase the clear views of the current works. There exist multiple methods to evaluate the dependency of the IS situations, such as adopting one or more than one survey per days for a specific case. This huge data that flow from various sources needs an integrated environment to be analysed and obtained the desired results. For this reason, the issues are addressed for transforming data coming out from various survey sheets for collecting them in one repository, as well as changing the format of relational and non-relational models due to the cumulative data on both sides that need decision makers and technical tools. (7) proposes QSL for administering the questionnaire and managing data in the different types of E-questionnaire. Besides, (8) presents a privacy-preserving survey system to collect and analyse the data based on secure developed protocols. (9) tried to solve the problem of survey systems in a small company. They proposed methods of analysis to study the relations between the items. Further, (10) presents a survey system based on a website application (I-System) and analysing data collection by using well-known data mining algorithms. Finally, (11) demonstrates an approach to integrate and capture the information that flows between two tables in a survey system. As a result, it can be concluded that all the presented referenced works did not demonstrate a distinctive way with the dynamic survey sheets that contain a table in the form of restricted direction models. In addition, there exists no proposed method that enhances the ata modes to relational or non-relational and vice versa.

Preliminary:
To describe the proposed work, this section presents some terms related to the research approach.
1. Survey Sheet: refers to the group of data that is gathered by individual questions using the electronic paper format. In this study the survey sheets consist of two main tables, Information  table and SQ table. A. Information Table: The information table holds the general information about the people that answer the survey questions and consists the following attributes as shown in Figure 1 Table: The survey questions table hold all the question details and consists of the following attributes as shown in Figure 2: 1. ID: The foreign key that is linked directly to the ID attribute in the Information table after processing by the R programming language to construct a one-to-many data relationship.
2. Section: If the survey system used consists of more than one parts, the part number can be mentioned in this field. 3. QN: The question number attribute refers to the sequence of the question in the survey sheet. 4. Question: It holds the survey question text. 5. Strongly Agree, Agree, Average, Disagree and Strongly Disagree columns refer to answers indicating the levels of evaluating the asked questions from one to five degrees.

SABR: Is the short form of Survey Algorithm
Based on R Programming Language. SABR is a proposed algorithm written under the environment of R programming language supported by the features of both data science languages, R and SQL.

Methodology:
The method is described in accordance to steps carried out in the study namely data collection, data processing, and algorithm development. Data Collection: In order to test and run the proposed work the real data is firstly gathered. We administered questionnaires consisting of questions as shown in Fig. 2. The questions targeted at measuring the information system quality (12) at AL-Kitab University. There are in total 28 survey sheets collected from the respondent. General View of the Data Processing Steps: At this stage, there are four main steps that lead towards generating the desired results as shown in Figure 3. After collecting the distributed survey sheets and arranging them, the first step is to determine the location of source files (survey sheets) that are already stored in the computer storage. The second step is uploading the survey sheets into the R programming environment. The third step refers to the procedure that transforms the data from the survey sheets to the desired tables and format (relational and non-relational) as shown in Figure 4. The fourth and last step is analyzing and printing the cumulative data stored in the Information and SQ tables according to the predetermined goals.

Algorithm (SABR) development:
The overall work of the proposed approach is as shown in Figure 5. It can be enumerated in the following steps: 1. All the survey sheets collected are arranged from (1 to n) in one folder.

A new information table is created as a
repository to store the data of all the information tables from sheet 1 to sheet n. 3. A new SQ table is created as a repository to store the data of all the SQ tables from sheet 1 to sheet n. 4. The numbers and the path of the survey sheets' location are assigned to the algorithm. 5. The data is fetched from the sources tables to both repositories information and SQ tables as shown below: Finally, when all the data are successfully uploaded from the survey sheets to the information and SQ tables, the one-to-many data relationship between them can be done using ID attributes in both tables.

Figure 5. Flowchart of the proposed algorithm.
Based on the SABR procedures, Table 1 and  After transforming and storing the data in the final station, the designed system automatically runs a package of queries to draw the data shape. The main purpose of this step is to provide a clear picture of the community study because the category of the study has a direct effect on the decision-making process and plays a significant role on the shape of the strategic planning. The Figure 6 shows the relationship between the evaluation range and the marital status and it clearly presents the relation. Furthermore, many graphical results are generated in this section by both R and SQL, but only a few is shown.

Figure 7. A third view of the data.
The data visualization algorithm is directly linked with the output of the queries. It works automatically according to the input and output and auto adjusts the data sources according to the end objectives. Furthermore, the algorithm includes the Plotly tool that is considered a great package for high data visualization. The data are treated from the relational, non-relational and statistical side and finally visualized and reported according to the final decision as shown in Figure 7.

Simulation Result:
In this section, the realized results are obtained by applying packages of sequence queries written by R and SQL on the generated tables for advance analysing, each query presents a specific situation. Thus, this study proposes a group of queries to provide a basic analysis and a general view of the accumulated data. The proposed work provides more flexibility to the dynamic arrangement of the dataset according to the decision maker's requirements as shown in queries below. Query1. Assume the decision maker wants to make a relation between the following criteria Based on the above requirements, the proposed system engine processes the dataset and produces the results as shown in Table 2 to fulfil the requirements of the decision maker. Based on the result above, the query 1 output achieves a (formula-1) to measure the accuracy of the proposed algorithm in the transformation of the data from the survey sheets to the desired tables (Information and SQ tables) as shown below So (N = T) the formula_1 is True. Query2. Assume the decision maker wants to draw a path to track the unarranged answer sheet by using the ID attribute, the results are shown in Table 3. Table 3. Results based on query two (Query2). Total   101  Male  0  2  3  0  0  5  102  Male  1  2  2  0  0  5  103  Female  0  0  3  2  0  5  104  Female  0  0  1  3  1  5  105  Male  2  2  1  0  0  5  106  Male  1  2  0  0  2  5  107  Male  5  0  0  0  0  5  108  Female  1  1  1  1  1  5  109  Female  5  0  0  0  0  5  110  Male  0  3  1  0  1  5  111  Male  0  0  0  2  3  5  112  Female  0  3  If m value is not equal to zero, the proposed system automatically returns the ID and tracks to the problem path; based on that a clear picture can be provided to the decision maker about the desired study.

ID Gender Strongly Agree Agree Average Disagree Strongly Disagree
Query3. This query summarizes the basic statistical methods that can be applied in survey fields like the measure of central tendency and the Measure of Dispersion. For example, if a decision maker wants to calculate the frequency, mean and standard deviation for each survey question based on the evaluation degrees from five to one, therefore the results will be as shown in Table 4 and Figure 8.  The data visualization algorithm is directly linked with the output of the queries. It works automatically according to the input and output and auto adjusts the data sources according to the end objectives. Furthermore, the algorithm includes the Plotly tool that is considered a great package for high data visualization. The data are treated from the relational and statistical side and finally visualized and reported according to the final decision.
The proposed work tackles the requirements and generates the results in both text and graphics formats to build a great environment for understanding and analysing the results. Based on the above results, the Query3 output achieved a formula (Formula-3).
The (Formula_3) denoted as (N, M1, M1 To extend the statistical processes on the desired study, the system proposes to analyze the correlation between the study elements as shown in Table 5 and provides vast opportunities for the decision maker to choose the desired correlation according to his requirements and purpose as shown in Table 6.  The biggest challenge in this research is that the statistical methods in the questionnaire systems cannot be executed using the normal formulas because each value has a specific weight and the weight occurs in each calculation. For example, if the decision maker wants to calculate the mean for the first question in Table 4  Based on the above discussion, each statistical method is calculated separately and made suitable for the data analysis. The proposed system utilizing SABR uses R and SQL under the R environment to simulate data like a professional program.

Conclusion:
The approach used in this research is based on (SABR) algorithm that studies the technique of data transformation from the main sources to the desired locations based on the R environment and through the combination of two great data science languages, R programming language, and SQL. In a nutshell, the proposed technique allows the application of various statistical methods on the collected data. In this study, the problems of transforming data are resolved using (SABR) algorithm as much as possible. However, further research using the proposed technique needs to be carried out to prove the usefulness of its features and benefits in different fields. Furthermore, an example of a future work related to this study could be the way of combining the questionnaire systems and data mining techniques for the purpose of deep learning and data discovery. The other possible study is how to provide a suitable environment to run The DeLone and McLean Model of Information Systems Success based on R environment supported by SQL. Finally, it would be beneficial to investigate the proposed system in multidimensional and multi-valued databases.