Challenges and Opportunities with Big Data

- Data is exploding at a rapid rate. Big data is the term for data sets so large and complicated that it becomes difficult to process using traditional data management tools or processing applications. Heterogeneity, scale, timeliness, complexity, and privacy problems with Big Data impede progress at all phases of the pipeline that can create value from data. The problems start right away during data acquisition, when the data tsunami requires us to make decisions, currently in an ad hoc manner, about what data to keep and what to discard, and how to store what we keep reliably with the right metadata. Much data today is not natively in structured format, tweets and blogs are weakly structured pieces of text, while images and video are structured for storage and display, but not for semantic content and search. Transforming such content into a structured format for later analysis is a major challenge. The value of data explodes when it can be linked with other data, thus data integration is a major creator of value. Since most data is directly generated in digital format today, we have the opportunity and the challenge both to influence the creation to facilitate later linkage and to automatically link previously created data. Data analysis, organization, retrieval, and modeling are other foundational challenges. Data analysis is a clear bottleneck in many applications, both due to lack of scalability of the underlying algorithms and due to the complexity of the data that needs to be analyzed. Finally, presentation of the results and its interpretation by non-technical domain experts is crucial to extracting actionable knowledge.


INTRODUCTION
Big data is the term for data sets so large and complicated that it becomes difficult to process using traditional data management tools or processing applications.Big data refers to huge data sets characterized by larger volumes (by orders of magnitude) and greater variety and complexity, generated at a higher velocity than your organization has faced before.These are the important three key characteristics of big data.Unstructured data is heterogeneous and variable in nature and comes in many formats, including text, document, image, video, and more.Unstructured data is growing faster than structured data.In addition to big data challenges induced by traditional data generation, consumption, and analytics at a much larger scale, newly emerged characteristics of big data has shown important trends on mobility of data, faster data access and consumption, as well as ecosystem capabilities Big data analytics is a technology-enabled strategy for gaining richer, deeper, and more accurate insights into customers, partners, and the business-and ultimately gaining competitive advantage.By processing a steady stream of real-time data, organizations can make timesensitive decisions faster than ever before, monitor emerging trends, course-correct rapidly, and jump on new business opportunities.While the potential benefits of Big Data are real and significant, and some initial successes have already been achieved, there remain many technical challenges that must be addressed to fully realize this potential.The sheer size of the data, of course, is a major challenge, and is the one that is most easily recognized.However, there are others.Industry analysis companies like to point out that there are challenges not just in Volume, but also in Variety and Velocity [Gar2011], and that companies should not focus on just the first of these.By Variety, they usually mean heterogeneity of data types, representation, and semantic interpretation.By Velocity, they mean both the rate at which data arrive and the time in which it must be acted upon.While these three are important, this short list fails to include additional important requirements such as privacy and usability.The analysis of Big Data involves multiple distinct phases as shown in the figure below, each of which introduces challenges.
Many people unfortunately focus just on the analysis/modeling phase: while that phase is crucial, it is of little use without the other phases of the data analysis pipeline.Even in the analysis phase, which has received much attention, there are poorly understood complexities in the context of multi-tenanted clusters where several users' programs run concurrently.Many significant challenges extend beyond the analysis phase.For example, Big Data has to be managed in context, which may be noisy, heterogeneous and not include an upfront model.Doing so raises the need to track provenance and to handle uncertainty and error: topics that are crucial to success, and yet rarely mentioned in the same breath as Big Data.Similarly, the questions to the data analysis pipeline will typically not all are laid out in advance.

Data Acquisition and Recording:
Big Data does not arise out of a space: it is recorded from some data generating source.Scientific experiments and simulations can easily produce petabytes of data today.Much of this data is of no interest, and it can be filtered and compressed by orders of magnitude.One challenge is to define these filters in such a way that they do not discard useful information.The next big challenge is to automatically generate the right metadata to describe what data is recorded and how it is recorded and measured.Metadata acquisition systems can minimize the human burden in recording metadata.Another important issue here is data provenance.

Information Extraction and Cleaning
Frequently, the information collected will not be in a format ready for analysis.It may be in various formats such as texts, images, videos.So data has to be extracted in various among these various formats and suitable data has to be picked for our use.Existing work on data cleaning assumes wellrecognized constraints on valid data or well-understood error models; for many emerging Big Data domains these do not exist.

Data Integration, Aggregation, and Representation
Given the heterogeneity of the flood of data, it is not enough merely to record it and throw it into a repository.If we just have a bunch of data sets in a repository, it is unlikely anyone will ever be able to find, let alone reuse, any of this data.With adequate metadata, there is some hope, but even so, challenges will remain due to differences in experimental details and in data record structure.Data analysis is considerably more challenging than simply locating, identifying, understanding, and citing data.For effective large-scale analysis all of this has to happen in a completely automated manner.This requires differences in data structure and semantics to be expressed in forms that are computer understandable, and then "robotically" resolvable.
There is a strong body of work in data integration that can provide some of the answers.However, considerable additional work is required to achieve automated error-free difference resolution.
Even for simpler analyses that depend on only one data set, there remains an important question of suitable database design.Usually, there will be many alternative ways in which to store the same information.Certain designs will have advantages over others for certain purposes, and possibly drawbacks for other purposes.Witness, for instance, the tremendous variety in the structure of bioinformatics databases with information regarding substantially similar entities, such as genes.Database design is today an art, and is carefully executed in the enterprise context by highly-paid professionals.We must enable other professionals, such as domain scientists, to create effective database designs, either through devising tools to assist them in the design process or through forgoing the design process completely and developing techniques so that databases can be used effectively in the absence of intelligent database design.

Query Processing, Data Modeling, and Analysis:
Methods for querying and mining Big Data are fundamentally different from traditional statistical analysis on small samples.Big Data is often noisy, dynamic, heterogeneous, inter-related and untrustworthy.Nevertheless, even noisy Big Data could be more valuable than tiny samples because general statistics obtained from frequent patterns and correlation analysis usually overpower individual fluctuations and often disclose more reliable hidden patterns and knowledge.Further, interconnected Big Data forms large heterogeneous information networks, with which information redundancy can be explored to compensate for missing data, to crosscheck conflicting cases, to validate trustworthy relationships, to disclose inherent clusters, and to uncover hidden relationships and models.
Mining requires integrated, cleaned, trustworthy, and efficiently accessible data, declarative query and mining interfaces, scalable mining algorithms, and big-data computing environments.At the same time, data mining itself can also be used to help improve the quality and trustworthiness of the data, understand its semantics, and provide intelligent querying functions.The value of Big Data analysis in health care, to take just one example application domain, can only be realized if it can be applied robustly under these difficult conditions.On the flip side, knowledge developed from data can help in correcting errors and removing ambiguity.
Big Data is also enabling the next generation of interactive data analysis with real-time answers.In the future, queries towards Big Data will be automatically generated for content creation on websites, to populate hot-lists or recommendations, and to provide an ad hoc analysis of the value of a data set to decide whether to store or to discard it.Scaling complex query processing techniques to terabytes while enabling interactive response times is a major open research problem today.A problem with current Big Data analysis is the lack of coordination between database systems, which host the data and provide SQL querying, with analytics packages that perform various forms of non-SQL processing, such as data mining and statistical analyses.Today's analysts are impeded by a tedious process of exporting data from the database, performing a non-SQL process and bringing the data back.This is an obstacle to carrying over the interactive elegance of the first generation of SQL-driven OLAP systems into the data mining type of analysis that is in increasing demand.A tight coupling between declarative query languages and the functions of such packages will benefit both expressiveness and performance of the analysis.

Interpretation
Having the ability to analyze Big Data is of limited value if users cannot understand the analysis.Ultimately, a decisionmaker, provided with the result of analysis, has to interpret these results.This interpretation cannot happen in a space.
Usually, it involves examining all the assumptions made and retracing the analysis.Furthermore, as we saw above, there are many possible sources of error: computer systems can have bugs, models almost always have assumptions, and results can be based on erroneous data.For all of these reasons, no responsible user will cede authority to the computer system.Rather she will try to understand, and verify, the results produced by the computer.
The computer system must make it easy for her to do so.This is particularly a challenge with Big Data due to its complexity.There are often crucial assumptions behind the data recorded.Analytical pipelines can often involve multiple steps, again with assumptions built in.The recent mortgage-related shock to the financial system dramatically underscored the need for such decision-maker diligence -rather than accept the stated solvency of a financial institution at face value, a decision-maker has to examine critically the many assumptions at multiple stages of analysis.In short, it is rarely enough to provide just the results.Rather, one must provide supplementary information that explains how each result was derived, and based upon precisely what inputs.Such supplementary information is called the provenance of the (result) data.By studying how best to capture, store, and query provenance, in conjunction with techniques to capture adequate metadata, we can create an infrastructure to provide users with the ability both to interpret analytical results obtained and to repeat the analysis with different assumptions, parameters, or data sets.

III. CHALLENGES IN BIG DATA ANALYSIS
After a see through about the multiple phases in the Big Data analysis pipeline, we now turn to some common challenges that underlie many, and sometimes all, of these phases.

Heterogeneity and Incompleteness:
When humans consume information, a great deal of heterogeneity is comfortably tolerated.In fact, the nuance and richness of natural language can provide valuable depth.However, machine analysis algorithms expect homogeneous data, and cannot understand nuance.In consequence, data must be carefully structured as a first step in (or prior to) data analysis.Even after data cleaning and error correction, some incompleteness and some errors in data are likely to remain.This incompleteness and these errors must be managed during data analysis.Doing this correctly is a challenge.

Scale
Of course, the first thing anyone thinks of with Big Data is its size.After all, the word "big" is there in the very name.Managing large and rapidly increasing volumes of data has been a challenging issue for many decades.In the past, this challenge was mitigated by processors getting faster, following Moore's law, to provide us with the resources needed to cope with increasing volumes of data.But, there is a fundamental shift underway now: data volume is scaling faster than computer resources, and CPU speeds are static.
The second dramatic shift that is underway is the move towards cloud computing, which now aggregates multiple disparate workloads with varying performance goals (e.g.interactive services demand that the data processing engine return back an answer within a fixed response time cap) into very large clusters.
This level of sharing of resources on expensive and large clusters requires new ways of determining how to run and execute data processing jobs so that we can meet the goals of each workload cost-effectively, and to deal with system failures, which occur more frequently as we operate on larger and larger clusters (that are required to deal with the rapid growth in data volumes).A third dramatic shift that is underway is the transformative change of the traditional I/O subsystem.For many decades, hard disk drives (HDDs) were used to store persistent data.HDDs had far slower random IO performance than sequential IO performance, and data processing engines formatted their data and designed their query processing methods to "work around" this limitation.But, HDDs are increasingly being replaced by solid state drives today, and other technologies such as Phase Change Memory are around the corner.
These newer storage technologies do not have the same large spread in performance between the sequential and random I/O performance, which requires a rethinking of how we design storage subsystems for data processing systems.Implications of this changing storage subsystem potentially touch every aspect of data processing, including query processing algorithms, query scheduling, database design, concurrency control methods and recovery methods.

Timeliness
The flip side of size is speed.The larger the data set to be processed, the longer it will take to analyze.The design of a system that effectively deals with size is likely also to result in a system that can process a given size of data set faster.However, it is not just this speed that is usually meant when one speaks of Velocity in the context of Big Data.Rather, there is an acquisition rate challenge as described and a timeliness challenge described next.
Given a large data set, it is often necessary to find elements in it that meet a specified criterion.In the course of data analysis, this sort of search is likely to occur repeatedly.Scanning the entire data set to find suitable elements is obviously impractical.Rather, index structures are created in advance to permit finding qualifying elements quickly.The problem is that each index structure is designed to support only some classes of criteria.With new analyses desired using Big Data, there are new types of criteria specified, and a need to devise new index structures to support such criteria.

Privacy
The privacy of data is another huge concern, and one that increases in the context of Big Data.For electronic health records, there are strict laws governing what can and cannot be done.For other data, regulations are less forceful.However, there is great public fear regarding the inappropriate use of personal data, particularly through linking of data from multiple sources.Managing privacy is effectively both a technical and a sociological problem, which must be addressed jointly from both perspectives to realize the promise of big data.

Human Collaboration
In spite of the tremendous advances made in computational analysis, there remain many patterns that humans can easily detect but computer algorithms have a hard time finding.Indeed, CAPTCHAs exploit precisely this fact to tell human web users apart from computer programs.Ideally, analytics for Big Data will not be all computationalrather it will be designed explicitly to have a human in the loop.The new sub-field of visual analytics is attempting to do this, at least with respect to the modeling and analysis phase in the pipeline.There is similar value to human input at all stages of the analysis pipeline.
In today's complex world, it often takes multiple experts from different domains to really understand what is going on.A Big Data analysis system must support input from multiple human experts, and shared exploration of results.These multiple experts may be separated in space and time when it is too expensive to assemble an entire team together in one room.The data system has to accept this distributed expert input, and support their collaboration.

IV. CONCLUSION
We have entered an era of Big Data.Through better analysis of the large volumes of data that are becoming available, there is the potential for making faster advances in many scientific disciplines and improving the profitability and success of many enterprises.However, many technical challenges described in this paper must be addressed before this potential can be realized fully.
The challenges include not just the obvious issues of scale, but also heterogeneity, lack of structure, error-handling, privacy, timeliness, provenance, and visualization, at all stages of the analysis pipeline from data acquisition to result

Figure 1 :
Figure 1: The Big Data Analysis pipeline.Major steps in analysis of Big Data are shown in the flow at the top.Below are tasks challenging In this paper, each of the boxes in the above figure are considered, and discussed both what has already been done and what challenges remain as seeking to exploit Big Data.Section 1 contains the introduction of big data process pipeline, Section 2 contains the details about the phase in the processing pipeline, Section 3 explain the challenges faced in big data analysis and Section 4 concludes the research work with future directions.