The Excel Data Mining Add-in. Applications in Audit and Financial Reports

Performance reasons in decision making based on business data usually requires a good management of multiple data formats and also processing speed, flexibility, portability, automation, power of suggestion and ease of use. The paper comes with theoretical ideas and practical examples in favor of using the Excel Data Mining Add-in’s for the aforementioned reasons. Most of the examples include figures linked to video scenarios constructed by the authors and part of an interactive on-line list with eighteen pieces. Together they contribute to understanding most of the requirements to fulfill in order to have valid examples and useful results.


Introduction
This paper starts from some techniques used by most Data Mining tools when dealing with large data from databases and presents the advantages of using spreadsheets as client applications (msdn.microsoft.com/.../dn282385.aspx). The last ones are so familiar to the end users and have an interface that integrates programming or scripting languages for office applications such as VBA meaning Visual Basic for Applications for Microsoft Excel (msdn.microsoft.com/.../ee814737.aspx), and Google Apps Script for Google Sheets (developers.google.com/.../sheets), many functions and advanced facilities for processing, analysis, representation and simulation all based on the interactivity and dynamics principles with great impact on users' ability to perceive, interpret, understand and manage complex information in different cases. The concept of Data Mining essentially means the supervised identification of undiscovered patterns and hidden relationships in huge data sets (searchsqlserver.techtarget.com). Inmon which is a wellknown guru in data warehousing (computerweekly.com) gave one of the most concise definitions of a Data Mining (Inmon and Linstedt, 2014) namely analysis of large quantities of data to find patterns such as groups of records, unusual records and dependencies. The Data Mining initiatives usually come from marketing and retail sales departments and are suited for organizations having very large databases . This concept is closely related to that of data oriented Decision Support Systems (DSS) and Business Intelligence (BI)especially the one for strategic purposes (dssresources.com/…id=174) that requires huge amounts of data (bi-insider.com). Although the BI term is known as a set of concepts and methods for improving decision-making emerging in the 90's (Howard Dresner from the Gartner Groupdssresources.com/…/dsshistory.html), the evidences from the specialty literature indicate approaches from 15 years earlier (Cleland and King, 1975;Pearce, 1976) containing clear references to BI, business planners and managers and decision making. As concluded by Dan Power (dssresources.com/…id=199), Data Mining tools include: case-based reasoning, data visualization (mostly graphs, trees, and clusters), fuzzy queries and analyzes, genetic algorithms, and neural networks. Starting a few years ago we are witnessing implementations of this concept and related models not only in applications dedicated to database and data warehouse management systems, but in modules of spreadsheet applications that are working with these above as suggested even from this paper's title. This seems obvious when thinking that such dedicated products allowed the construction of DM structures and models starting simply from one table (usually as aggregation of many others from a database). The applicability of the theoretical and practical elements of this article in auditing, especially the one of performance ) and financial reports is justified starting from a specific need to valorize the existing data structures (often data records in tables and tables in databases) and get rapidly and at minimum cost reports able to present clear information on causality related to effectiveness (actual / estimated results compared to those proposed) and efficiency (consumed resources compared to achieved / estimated results).
The concrete examples in this article support certain conclusions drawn from the literature review, namely: the utility of approaching the audit engagements by using data mining techniques (Vintilescu Belciug et.al., 2010) as a complement to traditional methods of risk analysis and intervention on site, the consecration of existence of possible areas of integration between data mining and audit processes grouped by stages (Sirikulvadhana, 2002) such as: planning, execution, documentation and completion) or by specific examples (Wang si Yang, 2009) as neural networks for: risk assessment, finding errors and fraud, determining the going concern of a company, evaluating financial distress, and making bankruptcy predictions and decision trees for: analysis of bankruptcy, bank failure, and credit risk), the advance of the latest software tools that implement data mining algorithms and the fact that many users considered them until recently being not very friendly (Chersan et.al., 2013) and requiring technical skills advanced enough. The paper also aims to eliminate some confusions on using time stamp values (e.g. calendar dates, parts of it or replacement values) when operating on time series containg business data (e.g. sales amount recognized as a factor of direct influence for the level of certain financial indicators such as the operating income).

The research methodology
The source data for the examples presented in this paper come from two Microsoft samples databases. The first set of examples was created starting from an Access database file called foodmart (sites.google.com/.../supp4excel2datamining) originally available on the installation CD of a previous version of Microsoft (MS) SQL Server. The second one is from a MS SQL Server sample database called "AdventureWorksDW2012_Data.mdf" already installed and prepared for use inside a Windows 8.1 32 bits virtual machine (y2u.be/Xs2SWtBqdzI) that we have used for this article. This machine benefited from the Microsoft Imagine / formerly Dream Spark educational software license for all applications installed inside and it was optimized for Oracle Virtual Box. In fact SQL Server 2012 (or 2008) is a prerequisite for the installation of the Excel Data Mining add-in which is detailed in the second video tutorial (playlist mentioned below). Although they serve for building the examples and related video support materials (tutorialsthe playlist created by the authors and available at goo.gl/JDDtFp), such data only have a guide purpose in this research with high applicative nature, the similarities to reality being merely coincidental.

From intuitive patterns to deep analysis starting from simple sources of data as tables
The first example we have chosen was meant to classify by generating a decision tree where the estimated variable was a categorical one with two possible values (house_owner: Yes or No -Y or N) depending on some other fields (see figure 1) containing information about customers (an export to Excel from the customer table in the foodmart database).  Source: The video tutorial created by the authors: y2u.be/Nx9xqCX1DjY As seen above (left side of fig.2) the branches that indicate a higher probability for Yes (Yhouseowner) are darker, the rest of them being colored with a lighter shade. We can also observe that the houseowner as a variable depends essentially on the marital_status (right side of fig.2 the slide bar on Strongest Links) and then on the yearly_income (the slide bar on All Links). And that can also be deduced directly from the decision tree in which the node closest to the root expresses a test (inf.ucv.ro) corresponding to the marital_status attribute. When clicking on marital_status='M' (terminal node) we have got a probability more than 74% in all ten tests we have done in the same configuration (input columns, column to analyze, algorithm, percentage of data for testing). Figure 3. Examples of discriminative analysis after applying the logistic regression (profs.info.uaic.ro) for the same conditions above and specifying those two already identified major impact input variables and some of their values Source: The video tutorial created by the authors: y2u.be/-6jzQuyTjlo In the previous images ( fig.3) we tried to show how we have predicted probabilities that the customers fall into those two categories of the binary response (onlinecourses.science.psu.edu): house owner or not, depending on some explanatory variables and their values. We have done the discriminative analysis partially captured above ( fig.3) starting from another algorithm, namely Logistic Regression implemented by Microsoft using a variation of the Neural Network algorithm (msdn.microsoft.com/.../ms174828.aspx) which is easier to train. Cumulating historical data and using descriptive fields from many tables of a database The dynamic and interactive reports responding to many information needs that we are so familiar with as well as the older static ones as snapshots of information at precise moments and generating more questions than answers (Rasmussen et.al., 2002) may use both current and historical data. The 1st category is represented by data from Transaction Processing Systems (TPS) and commonly referring to the current year while the second essentially means data involving a larger period as time reference. The proportion of using those two categories essentially depends on decisional needs (at operational, tactical or strategic level). For minimizing the redundancy and dependency of data or because of storage space and write speed needs (deshpande.mit.edu) the schema of a traditional relational data source is usually thought as many tables obtained by applying the principles of normalization (w3schools.in). Moreover, because of further performance reasons (read, respectively write speed needs) historical data must be separated from current data. Both categories essentially include records from transaction tables (e.g. expenses, sales, exams, etc.) the difference being made by the value of the time stamp. That explains why those tables loaded only with historical data are being renamed with a time indication, archived and separated from the rest of the transactional system in order to improve its operational (current) performance. When needing large amounts of historical data for analyzes based on ad-hoc queries the systems must do vice versa by aggregating into a single table (source for a fact table in a data warehouse) all the records from the historical archives of the transaction tables (of the same type as the resulting one). In most cases, that generates the advantage of an increased potential to identify patterns. But it also comes with difficulties related to putting data together in a common and consistent format especially when the applications and the structure of the data source have also changed in time. The figure 5 is presenting an example of inventory data gathering (applications including freight audit) in two major steps corresponding to two SQL queries in Microsoft Access: 1 st -based on cumulating (UNION clause) the records from two transaction tables of the same type and corresponding to just two years (1997 and 1998) and adding an necessary id column (inventory_id with values generated automatically -AutoNumber type) in the resulting persistent table (INTO clause); 2 nd -based on temporarily retrieving values of descriptive fields from all the tables related or suitable for a relation ( fig.5 -INNER JOIN clause) with the one resulting from the 1 st query above, namely inventory_fact. In this case the resulting tabular data consisting in the second set of just 11352 records won't get into a persistent table of the database (a kind of de-normalization -searchoracle.techtarget.com) otherwise needed to save time at the expense of storage space and it will serve for external export (Excel) just after executing / running the query itself.

Association rules for identifying behavioral patterns
In the theory and practice of data warehouses and multidimensional modeling the examples below reminds of the "snowflake" schema meaning that the source for a dimension (perspective of analysis based on descriptive columns organized in hierarchies) is not represented by just a single table but many related ones (in one-to-many relations: e.g. product category, product subcategory, and product - fig.7) able to support the analysis with more than just one descriptive field per dimension. In order to be able to apply the association rules algorithm in this case below we have also needed repetitive values for the SalesOrderNumber field to be associated to different product categories / subcategories / names. The main reason for gathering those descriptive data residing in multiple tables from the database in the example above ( fig.7) is to determine association rules type "If I buy the product X, I will buy the product Y too." in the purchasing behavior (FactInternetSales source table) and the most important dependencies ( fig.8).
From the results in figs. 7 and 8 we can understand why the applications of the algorithms able to identify association rules can contribute to audit and fraud detection and prevention. As example, if the set of inputs would have attributes such as: Claim identificator, Insurance type, Name of the insurance product, Name of the insured person, Insurer, Name of the examiner agent and Solution (total or partial loss and reject) and the algorithm would identify "IF Casco insurance, Insured person X and Examiner agent Y THEN total loss" as association with high probability and importance, it would not necessarily mean a fraud alarm but it would worth at least the effort to investigate further. Source: The video tutorial created by the authors: y2u.be/3_8E01hnSD0

Forecasting starting from aggregated historical data
For more historical data than in the previous example ( fig.5) we have considered to create a special forecasting scenario closer to reality. We have started from scratch with a new example involving data on 36 months in four calendar years, this time by using a simple SQL query on a single table but with ORDER BY and GROUP BY clauses for sorted results and aggregations meaning computing aggregated values as: sums, averages, total counts, counts for a specified condition and so on. In our case those were averages on every month of an year combined into a single numerical field derived / composed by passing from left to right in the specific order: years to months corresponding to larger to smaller units ( fig.9 just like in Microsoft's data sample which is provided when installing the Data Mining add-in). Source: The video tutorials created by the authors: y2u.be/RjTwGROD0TI and y2u.be/qHJ3Zm3JBT4 After the steps described above (figure 9) and several other processing operations ( fig.10) we will get to a set of data suitable for forecasting implemented by using the Microsoft Time Series algorithms as a combination of ARIMA (Auto-Regressive Integrated Moving Average -optimized for improving accuracy in long-term predictions) and ARTXP (Auto-Regression Trees with Cross-Predictionoptimized for predicting the next likely value in a time series -msdn.microsoft.com/.../bb677216.aspx) algorithms.

Support for querying persistent Data Mining models
First of all, persistent in this context refers to a model defined the way it will be deployed and stored on the server (SQL Server Analysis Servicesa module other than the Database Engine) and available for querying ( fig.11). The Data Mining add-in available in Excel offers many advantages over the direct use of SQL Server Analysis Service. Among others, on can mention here: speed of use of Excel's tabular environment and formula language, possibility of many exports / imports as / from spreadsheets starting from different database formats and to indirectly involve multiple source tables by using the Structured Query Language (SQL), the possibilities of exploiting the resulting structures and models directly (the "copy to Excel" option), by using queries (SQL DMX extension -figs.12 and 13) or programmatically ( fig.12). Last two are conditioned by activating persistency when defining models (use temporary model option unchecked - fig.13 vs. figs.1 and 2). Combining that with the ability to programmatically determine behavioral patterns and generate predicted values starting from high performance and easy to use tools such as this Data Mining add-in available for Office 2010, 2013 and 2016 promises much in terms of productivity. All these advances were defined after many years of using dedicated and now well-known technologies (e.g. SQL Server tested by authors since early 2000).
When it comes to spreadsheet products (dssresources.com/.../sshistory.html) such as: VisiCalc, Lotus 1-2-3, Microsoft Excel, Microsoft Works Spreadsheet, Sun Open Office Spreadsheets, Polaris Office Sheet, and Google Sheets the average experience of final users is up to decades. Furthermore the easiness of using these applications even just as interface instruments to connect to data from databases and data warehouses and display it was an objective reason to continue with testing the Data Mining component that led to making this article. By using a way of reporting which identifies itself with a sequence of steps which borrow their names from those eighteen support tutorials and also some techniques previously defined namely: E2P4CAFR , ACCORD / CADRE (Homocianu and Airinei, September 2014) and S-DOT (Homocianu and Airinei, August 2014) on can reach in stages, but with a minimal number of steps to follow some representations that are dynamic, interactive, suggestive, based on causality and rooted in the current reality and in the history defined by data stored in the organization's data sources.

Conclusions
We can conclude that the possibilities of the Excel Data Mining add-in component are above the expectations of a business analyst, offering the advantage of integrating identified classification patterns, association rules and predictions with the support for connectivity to various data formats, data validations, advanced graphical representations, geographical referencing, automatic conditional formatting and key performance indicators (KPI), pivot and power pivot tables and charts, automatic solving of optimization problems (solver) and the DAX (Data Analysis eXpressions) language together with the traditional formula language thereby increasing the chances of defining dashboards based on simulations, analyzes and Data Mining models truly useful for audit staff interested in performance monitoring.
We hope we have identified many real motivations to choose this Microsoft add-in for the Office suite as a near real time Data Mining tool, beyond many other recommendations available in the specialized literature and practice.