Big Data: Ride the Bandwagon, Sell Hot Dogs at the Show, or Jump Out of the Way?

Big Data: Ride the Bandwagon, Sell Hot Dogs at the Show, or Jump Out of the Way? https://pediatricsnationwide.org/wp-content/themes/corpus/images/empty/thumbnail.jpg 150 150 William Ray, PhD William Ray, PhD https://secure.gravatar.com/avatar/06fc3a0b70d2f6ea4ee5ea53543d4912?s=96&d=mm&r=g March 10, 2015 March 24, 2021

March 10, 2015
William Ray, PhD

Big Data has big potential. But can it truly tell you what you want to know?

Big Data. It’s all the rage, and if you listen to the hype, you might get the impression that it’s going to cure cancer, bring about world peace and clean your kitchen while it’s at it. All the cool kids are doing it. Should you?

To answer that question, you need to spend a bit more time than most people thinking about what Big Data is, what it isn’t and what it realistically is good for.

To paraphrase the dearly departed Douglas Adams, Big Data is big. Really big.

Unfortunately, that’s pretty much the complete definition, and even more unfortunately, it doesn’t tell you whether it’s good for anything. Insidiously, the fact that Big Data is big — and by “big” people effectively mean “bigger than anything we can figure out how to rationally deal with” — means that the basic idea of Big Data gives the average reader warm and fuzzy feelings. People tend to feel that Big Data is so large that it ought to be able to magically do more powerful things than anything “Small Data” can accomplish.

Those feelings aren’t wrong, but they’re also not particularly specific, and herein lies the rub: To decide whether Big Data can answer a question, you need to think about whether Big Data is Enough Data. For many of the questions that people hope Big Data will answer, there will never be Enough Data. To continue to paraphrase Douglas Adams, while Big Data may be really big, Enough Data is often vastly, hugely, mind-bogglingly big.

Just How Big is “Big Data?”

As a really quite small example, my group is interested in identifying networks of residues in proteins that are required for protein function, and we approach this work by looking for places where evolution has caused multiple residues to change together. For our garden-variety favorite protein, Adenylate Kinase, there’s a wealth of data out there — thousands upon thousands of evolutionary variants.

This doesn’t quite amount to Big Data, because we can in fact rationally deal with it, but that doesn’t matter. If we think about what we’d ideally like to know — all of the patterns of “this residue co-evolves with those residues that often” — Big Data won’t help us, because there will never be Enough Data. Even in our comparatively small, roughly 300-amino-acid protein, if we want to evaluate every possible pattern of interactions, we run up against the vastly, hugely, mind-bogglingly impossible wall of more than 10^70^th combinations that we would need to know about.

To put that number in context, current estimates say that there are about 10^80^thelectrons in the entire universe. If we started writing down unique patterns of co-evolution that we observe in Adenylate Kinase onto electrons, putting a unique label on each of the billions of electrons in each of the billions of cells in each of the billions of people (and everything else) on the planet, on each of the billions of stars and planets in our galaxy, and did the same in each of the billions of other galaxies in the universe, we’d come frighteningly close to running out of electrons by the time we finished making labels for our data set.

And that’s just to collect one instance of each pattern.

If we start talking about collecting enough replicates to reach statistical significance, especially with corrections for multiple-hypothesis testing, we’d need every one of the electrons that actually exist to each contain an entire additional universe of electrons.

No matter how big our data about this protein’s evolution gets, it will never — ever — be Enough Data to answer all the questions we’d really like to answer. And that’s just for our effectively tiny little problem with only 300 variables.

Revising Expectations

People are hoping to use Big Data to answer questions about how all of the polymorphisms in the human genome interact to affect health. Many of these questions involve not hundreds, not thousands, but millions of variables and all of their possible combinations. To unambiguously answer these questions would require sequencing so many more human beings than there are electrons in the universe that just the exponent would become incomprehensibly large.

Anyone who tries to tell you that Big Data will answer all of your questions “if you just collect Enough Data” isn’t lying, but they’re also not telling you just how impossible it will be to ever collect Enough Data. They’re probably already on the bandwagon, hot dog in hand. Either that, or they are trying to sell you a hot dog and a ride, hoping you don’t think too hard about just how huge Enough Data really is.

Does this mean that you should just give up on the Big Data parade and go home? Not necessarily.

While Big Data will never be Enough Data to answer those “every question” types of inquiries, it is most definitely Enough Data to answer certain questions. The somewhat annoying reality, however, is that any given Big Data dataset might not be Enough Data to answer the particular questions in which we’re interested. It won’t even necessarily be Enough Data for us to tell whether we have Enough Data to answer the question!

Faced with this, we can try to ask whatever questions we like of any Big Data we have. We could always get lucky and discover that we do have Enough Data for that question or for a question that’s similar enough to satisfy our curiosity. On the other hand, we shouldn’t be too disappointed if it turns out that our Big Data isn’t Enough Data for our question. After all, no matter how much Big Data we have, it will always be exponentially less than what Enough Data would be to answer specific questions.

More excitingly, though — and this is why you shouldn’t pack your bags and go home — we can try to ask our Big Data the questions it can answer.

Discovering What Big Data Can Tell Us

Sometimes the answers our Big Data can provide aren’t what we wanted to learn. (A recent attempt at Big Data mining on YouTube discovered that many YouTube videos contain people or cats. Hardly earthshaking.) On the other hand, sometimes our Big Data turns out to be Enough Data to answer some truly interesting questions, even if they aren’t ones that we planned.

Take, for example, the image below. We’ve built a visualization tool that lets you look at certain kinds of Big Data and survey all of the answers that are in it. This figure is showing some of our Adenylate Kinase Big(ish) Data. Focus on a handful of areas in the image that look interesting and remember them.

Big(ish) data is fed into a computer program that gives researchers new clues about how ADK may function.

Now click here to reveal some Big Data answers.

Are any of the areas you selected circled? If they are, you’ve just solved a problem with Big(ish) Data that can’t currently be solved algorithmically — you’ve found one or more of several patterns of interacting residues that are intimately tied to Adenylate Kinase function (for more information about this visualization, see stickwrld.org or this Pediatrics Nationwide infographic).

You can’t do this algorithmically because (as I’ve mentioned) there are over 10^70^thpossible patterns, and we can’t even begin to guess which of those 10^70^th patterns to test. Our visualization tool shows us which patterns are there, though, effectively telling us what questions our Big(ish) data about Adenylate Kinase can currently answer.

Most other Big Data areas have problems and opportunities similar to those we’re facing with our protein.

For example, if you want Big Data to tell you exactly how genetics affect disease, don’t hold your breath. There will never, in a universe of universes of universes of people, be Enough people sequenced to unambiguously answer that question for all of genetics or for all diseases. There will never even be Enough people sequenced to unambiguously answer how one particular genetic feature (in the context of all others) affects one particular disease.

But don’t be dismayed.

Those caveats aside, almost any Big Data that you have is Enough Data to contain uncountably more interesting facts than you’re aware of today. It might be Enough Data to tell you how a specific genetic element affects your disease of interest in some specific contexts, or Enough Data to tell you that genetic elements you’ve never even thought about are involved in the disease. While you’ll rarely be lucky enough for your Big Data to be Enough Data to answer your exact question, it’ll often be Enough Data to answer at least some related questions, and this may be Enough Information for you to work with. Your Big Data may also be Enough Data to show you exactly what Small Data you need to focus on and collect to answer your question precisely.

So long as you keep in mind that your Big Data can’t tell you everything, and that you can’t ever be certain you have Enough Data to tell you any specific thing with absolute certainty, Big Data can be an incredibly powerful tool to answer some questions you have and many questions you never even knew existed.

If we don’t try to force Big Data to answer the questions we want and instead just accept the answers it can give, then we’ll be looking at questions that can’t currently be answered any other way. We’ll be using Big Data the way it is meant to be used.

And that, as Douglas Adams would say, is so amazingly cool that you could keep a slab of meat in it for a month.