Data. There’s a lot of it about. If you work in the buzzword driven world of consultancy or large corporations (including the industries of academia and government), you have the dubious pleasure of having it rammed down your throat at every opportunity. Usually, it comes packaged in eminently hashtaggable marketing soundbites.

I find it deeply reassuring that the sort of people who have an understanding of these things usually bite back. David Spiegelhalter has become (more) famous for referring to one of the recent buzzes, Big Data, as ‘bollocks’ .

Is Big Data bollocks?

It comes pre-capitalised, and it is used to sell goods and services. That usually indicates a load of bollocks.

This does not mean that everything that it refers to is bollocks. As the sainted Prof. Spiegelhalter has himself pointed out, he didn’t really call everything bollocks, just the more hyped stuff. That’s the irritating thing about marketing chunder. The effective stuff has a connection to some fundamental truth. That’s what makes it stick.

They’re clever, those marketeers.

This is actually what is most frustrating about it. The components of big data (look, no capitals) are real. It’s just that, in lumping them together into a marketable whole, the simple truth is hidden. Let’s have a look at that simple truth.

We now produce a lot of data. When I say data, I mean pieces of information that have some structure, recorded in an accessible format. The degree of structure and accessibility might vary rather a lot, but the point is that it’s stuff that can (in principle) be collected, manipulated, and analysed. You don’t have to think particularly hard to imagine that there are at least four main ways (please mention more below!) in which this can become ‘big’:

There’s lots of it
There’s lots of it and it’s coming in really quickly
There’s lots of it, but it’s not quite the data you’re after
There’s lots of it, because it’s all linked

Number 1 creates a problem for our traditional analysis programmes, and is a pain to visualise in a way that makes sense. Number 2 is similar, but causes additional problems in that it’s difficult to change your analytic approach on the fly. A couple of minutes (or even seconds) working out what to do will result in data being lost. This is particularly the case for data streams that are too large to capture efficiently.

Number 3 is what is most common in healthcare. The truth is that even ‘old-fashioned’ data management approaches like SQL-based databases can cope with pretty huge data sets, and programmes like SAS (the expensive commercial option) or R (which anyone can play with) can cope with larger data sets than most people use. Add in a bit of code-fu using things like regular expressions and you’re sorted.

Number 3 describes things like databases of health insurance claims, that are often used to investigate what happens to patients. These are data sets designed for administrative purposes that turn out to be handy for other things. Unfortunately, working out what has actually happened to a patient based solely on their insurance claims is tricky. In healthcare, this stuff is usually called Real-World Data or Real-World Evidence (note the capitals and gentle stench of buzzword), but it can become Big Data pretty easily due to the wonders of data linkage. Leading to…

Number 4 is increasingly important in healthcare as nations such as Sweden link health and social care data sets using social security numbers. Essentially, data get bigger when datasets link. A pretty extreme example of this is communications data. Imagine you run a telecoms company and so have information about calls between individuals. The way you store that data makes a huge difference to the size of the data set, and it changes rapidly. Now, if you take that data and the personal information you have about your own customers, you start being able to make inferences about social groupings, their income patterns etc. You may even be able to link that to location data (even if it’s just through their sign-up postcodes) to map out the social strata of a nation. And that’s just based on your commercial data. That is a data management challenge, but a huge potential commercial opportunity (and deeply creepy for most people).

The potential commercial opportunity means that people who can help you manage that data or mine your existing data sources for economically valuable information are worth cash. Lots of cash. That’s where Big Data comes in. If you have a Big Data opportunity, someone is going to be very happy to make money out of you.

Uncertainty is a hard taskmaster

Herein lies the rub.

The problem I have with this is that it’s transient. The management of data is just one aspect of good analysis. It’s a hugely important aspect, but it isn’t everything. The tools we have to manage data on our personal computers will change, even if it’s just using our computers as a thin client linking to more efficient data-management machines, as we see with some cloud-based applications.

Quite apart from that, most of the data we’re looking at is redundant and, for most people, big data sets can be turned into much smaller data sets through some simple filtering. Essentially, we’re just in the process of getting used to handling larger data sets. Some data will remain specialist, but most will become much more manageable as the market for it becomes established and the commercial benefit of producing user-friendly tools is realised.What won’t change, at least immediately, is the difficulty of analysis. Analysis is all about taking data and going beyond it. Going beyond the data involves embracing uncertainty, and uncertainty is a hard taskmaster.

The end of theory really is nonsense

This really is a bit of a red herring, but there are still people talking about the end of theory. The claim is that, with sufficiently large data sets, associations between data are sufficiently strong that it becomes unnecessary to test and adapt theories. There are some real issues with this. Not least of these is the failure to distinguish between describing stuff and inferring stuff.

One of the arguments is that groups like Google manage to use their considerable data-fu to target advertising etc without recourse to theory at all.

This is correct, but it utterly misses the point.

Producing targeted advertising is a clever trick, but it’s really about managing data rather than interpreting it. It’s about associating information rapidly using the data effluent of internet use. This is great, but it’s not really predictive beyond a very restricted period. Going significantly beyond the data requires an interpretative step, and that’s where theory rears its confusing head.

One example used by people having a pop at Big Data is Google Flu Trends. I think the criticisms are a little harsh because using search data to predict flu needs is both neat and potentially useful. I find it interesting because it lends the lie to the ‘end of theory’ crowd. Google Flu Trends appears to use an algorithm that relies on a regression model. That means it is not theory-free. It’s just that the theory is not rammed down your throat. It doesn’t work perfectly because it’s a model and models are imperfect. For example, a lot of the model is missing. Put simply, it is used to predict the proportion of physician visits that are due to flu-like illness. That means its output is heavily dependent on the total number of physician visits. Other illness outbreaks (e.g. winter vomiting sickness) that aren’t included in the model will affect this (a fact I’m sure they consider in the larger algorithm).

What I want to drive home is that people who need to use large data sets do rely on theory. The failures of big data approaches will come as no surprise to epidemiologists. Epidemiology is profoundly shaped by the requirement to make interpretative leaps because of limitations in the data that, in turn, rely on strong theoretical assumptions. The long-term success of these approaches will inevitably rely on testing those theories and adapting them. Those theories come from somewhere, and building on them relies on putting theories together.

This is where my own bit of marketing bullshit comes in: Big Knowledge.

Oh no, not more buzzword nonsense

Don’t worry. There’s nothing to buy. This is just something that interests me.

We’re currently amazed by what we can get out of large or fast-paced or complex data sets because they are relatively new to us (at least in this form). Once we have a concrete understanding of how to do this, we are left with another big question. So what?

The initial data analysis and theory testing stage is really a reduction process. It’s taking information and summarising it in manageable equations, algorithms, and mathematical structures. We’ve already seen that this does require theory, and that this is akin to standard data analysis…just bigger…obviously. It also has the same problems as standard data analysis. Things like over-interpretation of limited data, missing key mediator variables, and trying to make the data fit your beliefs.

The next steps are more interesting. They involve integration and expansion; drawing these analyses together and looking at what other insights they can provide.

Inference engines and the future that’s already here

Where we have known classes of data we can plan out analyses in advance. We can even plan out ways in which our analyses adapt to their own results and rules for the interpretation of their results. In other words, we can build machines that generate simple inferences.

In the same way that statisticians provide tools that allow us to make simple inferences, mathematicians and philosophers have provided us with the logical tools to build from these local inferences into more complex items of knowledge and even things that look a little like understanding. One of the tools that computing produces using this is a thing called an inference engine.

This is deeply cool, but let’s start with something simple.

There are lots of established facts. We may happily argue about what constitutes a ‘fact’ from a philosophical point of view, but most of us, most of the time, are happy to assert such things as ‘there is tea in this cup’, ‘force = mass x acceleration’, and ‘a neoplasm is an abnormal growth of tissue’. If we can structure those facts in a consistent format, it becomes possible to use them for reasoning. Consider the following:

A neoplasm is an abnormal growth of tissue
A tissue is a collection of cells with a shared origin
Cell division within the body occurs through mitosis
Aurora A kinase is involved in healthy mitosis
Aurora A kinase is encoded by the AURKA gene

These individual pieces of information build into something rather more interesting. What is particularly interesting is that, if we can structure these facts in a sufficiently coherent way, a machine could put those facts together. In fact, a machine can put a lot of facts together and manipulate them across long logic chains that would confuse most humans. That makes machines capable of processing facts in a way that generates hypotheses.

To illustrate this, let us suppose that someone called Ian has a mutation of the AURKA gene. We may immediately start to infer that Ian has a high risk of cancer, which may be true. However, what we are doing is taking information about one level of our understanding and using it to infer something several levels removed. We are also fixating on the information in front of us. I could equally add a statement to the list above saying:

Most genetic mutations have little or no effect on gene function

If I add this in, the additional leaps we have made subconsciously become more apparent. What I’m saying is that machines, such as inference engines, may help us avoid common pitfalls in reasoning. Moreover, machines are relatively good at holding multiple possibilities in their working memory simultaneously. An inference engine can go even further and tell us what we should observe if each of these possibilities is true, particularly if it has lots of related inferential constructs. In this case, a machine would probably concentrate on hypotheses at the same level as the original observation:

Ian has a mutation of the AURKA gene
Mutations either affect gene expression or they do not
Mutations either affect transcription or they do not
Mutations either affect the structure of the gene product or they do not

Each of these statements is testable, and so a machine can produce testable, low-level hypotheses. These, in turn, can help us to update our beliefs about higher-level hypotheses using linking facts:

In order for mutations to affect gene function, they must either affect the production of the substance encoded by the gene, or they must alter that substance itself
Therefore: If the mutation does not affect gene expression, transcription, or the structure of the gene product, it is unlikely that it will affect gene function
In turn, it is unlikely that it will affect mitosis etc.

This is a lot easier than trying to perform objective analysis, even using methods like analysis of competing hypotheses, on high-level information. This is profoundly useful for science in two ways:

It allows complex scientific research to focus on relatively isolated pieces of information
It allows the integration of new knowledge into the corpus of human knowledge rapidly and efficiently

So what?

The big deal about all of this is that, as I said already, we have a lot of facts that we have assembled over the years. Once upon a time, we assembled these facts in oral histories, then books. When that became cumbersome, we assembled them in libraries. Then we added card indexes. Now, we have an internet and distributed, searchable information with varying levels of indexing.

When we had a small number of books to worry about, we brought ideas together by extracting them individually and processing them in our own minds. As the knowledge grew, we had to become specialised and look at smaller and smaller areas in more and more detail. A lot of great insights then came when different areas of expertise were brought together, but it has become more and more difficult to do this efficiently.

We need a way to make knowledge scalable. And this is it.

The methods associated with big data are enablers of a greater change, towards an organised knowledge that allows our brains to rest a little and choose their battles. If that sounds like passing responsibility for thought onto computers, relax. We’ve used calculators for ages, and try doing Gibbs sampling on a napkin. When you have to figure something out, you should spend time thinking about it, but I bet you also check the internet, a text book, or even a colleague.

That’s all this is; humans using external information systems to enhance their cognitive ability.

Getting a bit more concrete

I’m a big fan of something I’m referring to as disease modelling, which is basically taking what we understand at the biochemical, the physiological, and the clinical level and putting it together into predictive models. The leap comes when these are used to describe the interplay of features in individual health and disease.

Okay, so that’s kind of a grand vision. Wahay, how exciting. Who cares? The internet’s full of grand vision. What it needs is some concrete reality and application. Here’s a concrete example from the real world.

In the real world, 23andme got into some trouble with the FDA because they used the sort of information provided by groups like the National Library of Medicine through databases like ClinVar to tell individuals about health risks.

23andme looked at reported associations between genotype and clinical phenotype. It has to be said, they did it brilliantly and the data appears to have been represented pretty well. Unfortunately, they didn’t jump through the necessary regulatory hoops for FDA approval. I’m guessing that they didn’t because they couldn’t. You see, your genotype interacts with all sorts of other risks because of the length of the causal chain (as I’ve mentioned before). There are even specialist healthcare professionals, clinical geneticists, employed to identify and explain these complex situations. A database with a single mapping step to aggregate information and no data on individual risk, the mode of action of genes, or their expression patterns just won’t cut it.

The thing is, there is information on biochemical pathways in a structured format, there is structured information capturing the likely effects of genetic variants on gene expression and function, and there is information about the location and function of genes. Put it together using more than simple association, and you have proper, predictive modelling. Modelling you can use to develop hypotheses that you can test as an ordinary human being. Modelling that incorporates environmental and lifestyle risk alongside genetic risk, including (shock horror) the interaction of the two.

That’s pretty cool.

You can probably tell I’m no great expert – Please tell me why I’m wrong. These are nice ideas, but I want to make them actually happen, and that needs criticisms and suggestions.

2 Comments on “Big Data, Big Knowledge: marketing hype or the future of human understanding?”

Haploflow says:

September 16, 2014 at 11:24 am

A couple of links where points above are discussed further.

https://www.youtube.com/watch?v=mOpV30sMePk

http://criticaldata.mit.edu/wp-content/resources/docs/ioannidis.pdf

There’s also an interesting talk by Geoffrey West on Quantitative and Systems thinking in Medicine.

https://www.youtube.com/watch?v=MquNL-F7V6g

Log in to Reply
Haploflow says:

September 16, 2014 at 11:27 am

Finally, the best quote from the systems thinking lecture is as follows:

‘A major challenge is to ascertain the quantifiable baseline scale of life, and, consequently, the suite of metrics and their variances that define the average, idealized, healthy human being.’

Log in to Reply

Big Data, Big Knowledge: marketing hype or the future of human understanding?