Data science, the emerging scientific paradigm of discovering hypotheses from large data corpora, has turned the natural progression of science on its head. Instead of painstakingly designing hypotheses and testing them, it is now possible to generate hypotheses automatically by sifting through giant data sets.
The danger of this approach is the phenomenon of multiple testing (or more colorfully, the green jelly bean problem, where if enough hypothesis are considered separately, eventually one observed effect may look statistically significant without being true. This problem is all the more serious with large and complex data because the algorithms that generate these hypotheses can be opaque, and the data itself can overwhelm our ability to process and visualize it. Moreover, the number of features in the data can overwhelm most procedures designed to analyze them.
The challenge of big data analytics therefore is to determine what information and structure really lies in these large, feature-rich data sets, and which models that can be evaluated efficiently and accurately, and visualized to provide confirmation of the learned phenomena.