In today’s world of technology, there is a ton of “big data” that needs to be processed. And to do so, it has been necessary to develop new concepts and techniques that have allowed us to explore what was previously impossible or completely impractical. Some of these terms are a little difficult to understand at first. They reflect something fundamental: the way we capture, store, and process information nowadays has changed tremendously compared to ten years ago.
In the Big Data research field, this change is called 4th Generation Databases after the three generations defined by IBM researchers in 1983 [1]. This blog entry focuses on explaining how these four generations differ from each other and why this blog entry even exists in the first place. For the latter, you will have to read the end part in which I give you some examples of open issues in Big Data.
Okay, finally, let’s move on with the article!
Big data researchers generally think that big data is different from traditional data because analyzing it requires other techniques and skills. They are right, but this does not make sense when it comes to business decisions for two reasons:
1) The 2nd Generation Databases were also able to store vast quantities of information (in fact, they focused on scaling them up). Check RemoteDBA.com for more information.
2) Many Big Data applications are still based on mature statistical procedures or optimization techniques built with tools existing for many decades though newly released libraries have revolutionized their usage.
Big Data
“Big Data” is not the same as high-dimensional data. The latter can be found in traditional applications (think about how hospital statistics are gathered, for example). These new techniques should allow us to prove theorems using less restrictive assumptions than before. They should also be able to deal with more complex distributions, especially non-Gaussian ones. 2) Statistics was built with mathematics, and it comes down to proving mathematical properties of algorithms or functions under assumptions that can be relaxed when it comes to Big Data.
Optimization techniques rely on numerical methods. Most algorithms take advantage of the fact that an answer exists and work by iteratively improving their cost function till convergence. This means that they assume there is a point where this cost function becomes minimal. Unfortunately, not all problems can be solved that way. Think about class imbalanced data sets, non-convex cost functions, or computationally heavy algorithms… these are typical scenarios where statistics and optimization stand back to each other.
Sampling is the first technique necessary for statistical methods to work. Big Data comes with different kinds of samples – some at a vast scale – but it also requires the use of models able to describe specific properties of the real-world process you are working on before even applying statistical procedures (e.g., network sampling, subspace sampling). Another possible scenario consists in writing down an optimization problem over a specific model that directly deals with the notions of dimensionality or structure present in the data.
Unfortunately, this is neither the place nor time to explain what these concepts are about. Still, they are exciting because by allowing us to analyze vast quantities of data with small amounts of information! 6) Big Data also requires dealing with different notions of approximation, sparsity, and smoothness. This is possible thanks to specific algorithms’ properties that make them independent from noisy or sparse inputs.
The last difference I will mention is imposing less restrictive conditions on the distributions you can model with as a consequence, statisticians might need to work on new descriptions, e.g., non-parametric models for Linked Open Data applications or data coming from social networks where friendship links are already known between nodes (e.g., Facebook).
What’s next?
Well, here are some open issues in Big Data that will undoubtedly keep the data scientists on their toes. Think about what kind of requests Google may have to answer to a user! 1) New types of queries and queries that consider certain semantic notions. [A year ago, I would never have been able to predict many of today’s emerging problems regarding Big Data applications] how do you represent a social network over a vast number of users so that computing properties or statistical measures are fast enough?
Remember all these articles about NSA being able to spy on us using secret programs called? 2) How can we use Big Data technologies while respecting privacy concerns? If someone could access our data and only know what is OK to share, how do we deal with our privacy? Can we design algorithms that will work both on Big Data and traditional data sets? This would be a fascinating philosophical question! Think about the 1+1 = 2 issue again.
How can we push further our knowledge of statistical learning theory (i.e., limits on what you can learn from input/output examples)? And don’t forget all those new phrases like online learning, active learning, or transfer learning; they also deserve to be studied and understood.
What other models should be used for Big Data applications? Remember: not all problems can be solved using statistics and optimization. When no samples are available, sampling techniques might fail, and we end up using models that can’t be efficiently computed (e.g., sampling from Big Data). I still think we need to find a way to deal with those issues… How do we approach them? Are neural networks the answer, or maybe cellular automata? Can you combine different machine learning algorithms and statistical notions to exploit Big Data properties better and respect our needs regarding privacy, costs, and queries?
We still don’t know how to deal with unbalanced data sets very well. Do you remember all these techniques for imbalanced problems like SMOTE or random oversampling? Well, it seems they are not good enough anymore! As such, what should we do? Can you find new ways to deal with those issues, or are we not able to?
What about using notions of information-theoretic lower bounds for approximating interactions in social networks. How can probability theory help us for Big Data applications? I think the future looks bright thanks to Big Data.
Finally, stay curious and always ask yourself what’s your motivation before doing science! Why do you want to understand something? Is it because somebody told you so, or do you have a real problem that bothers you every day at work? If it’s the second reason, then keep going! You are already halfway there. If not, maybe try again later.