The great caffeine conundrum

More and more of us are routinely logging data about our health and wellbeing. How can we use that data to inform and adapt our behaviour? How can we avoid over-interpreting what we experience? Here’s a real-world example.

There are lots of little changes that, we are told, could improve our health, wealth, and happiness in small, incremental ways. The problem is, in part, knowing which ones will work for us, which ones work for other people, and which ones are utter nonsense.

We can get a good head start on assessing what is likely (or not likely) to work using the same sort of tools that are used for evidence-based medicine. In particular, organisations like the Cochrane Collaboration produce reviews of evidence that can tell us about the relative efficacy of different interventions, many of which are non-pharmaceutical. There are even databases. For example, the Database of Abstracts of Reviews of Effectiveness (DARE) at the University of York is pretty cool, although its funding is due to end in January 2015.

We don’t need to rely on other peoples’ reviews, though. Looking at evidence can be fairly straightforward and there’s no reason why we can’t do most of the searching ourselves. A bit of time spent searching the US National Library of Medicine databases can be very informative.

Unfortunately, even the best review cannot always tell us what will work for us. There are a number of very good reasons for this:

  • Aggregate effects – Systematic reviews and meta-analyses are generally based around average effects in the population. That means that they are useful for telling us what to do for a group of people, but they only give us limited information about individual responses
  • Selective reporting – People spend a lot of time and effort promoting their worldview. Even in the scientific literature, a worrying number of publications appear to fit with the authors’ preconceived ideas. In other words, people often select the evidence that fits their ideas and ignore other data. I would include in here much of the misuse of statistics
  • Unclear reporting – A lot of material is written in a way that makes it very difficult to understand. We’re all guilty of this to some extent, but technical publications really do take the biscuit. Systematic reviews, which should be nice and clear, are now so obsessed with demonstrating objectivity that they can often bury the relevant material deep inside the text about their methods

With formal medical treatments of existing illnesses, the gap between evidence-based medicine and the effects we expect in an individual should be filled by clinical expertise. An important role of the clinician is to take their expertise about the way humans work and interpret the evidence to make an informed guess about what will work for an individual patient.

For other types of intervention, it can be difficult to find someone to bridge the gap.

In reality, when the risks of an intervention are apparently very small and when our professional guides are uncertain, we may just experiment on ourselves. The problem here is that we know what to expect and often second-guess the outcome of our experiment.

What about the caffeine?

This brings us neatly to my brother and coffee. My brother (usually) drinks loads of it, and he decided to take a break and see what would happen. In fact, he decided to steer clear of all caffeine for a bit and see what happened.

This is important. He is not normal. He drinks more coffee than most people, so stopping caffeine consumption should have a greater effect than in most people.

Stage 1: Why are you doing it?

Stopping caffeine consumption could have lots of effects on my brother. It could affect his digestion, give him withdrawal symptoms (e.g. headaches, lethargy), impair his social life, or influence his home life by making him irritable. In an ideal world, I would have listed out all these possible effects and planned some data collection activities to look at them.

But I found out late in the day, so I could only use the data that he collected routinely or could remember. That’s always a challenge, so the best thing is to concentrate on the basics: Why are you doing it?

In my brother’s case, he was cutting out caffeine for some weird body-composition experiment that I didn’t have access to. However, he was also interested to see if it improved his sleep. That gives us a broad initial question:

Does removing caffeine from his diet improve my brother’s sleep?

He told me that his sleep was better. We’ll return to this later.

Stage 2: Get some data

There are lots of intricacies to data collection. There are problems relying on memories of events, there are wish-fulfilment biases, measurement errors, silly psychometric mistakes, and scales that affect the thing they’re trying to measure. It’s a minefield.

Don’t worry about that. You should worry about it if you’re choosing what treatments to provide to terminally ill patients. If you’re making a personal decision about whether to drink coffee, who cares? You need to be confident in the data yourself, and you should be critical (as we’ll see in a moment), but don’t be too daft. You can always change your mind.

My brother had an advantage. He’d been routinely collecting information for some time using a bit of wearable tech called a Jawbone UP. That’s not an endorsement, it’s just a fact. The UP records sleep quality.

That last statement is, of course, nonsense. A measure of sleep quality would probably be a questionnaire of some sort. The precise claim on the UP website is:

UP uses Actigraphy to track your sleep, monitoring your micro movements to determine whether you are awake, in light sleep, or in deep sleep

This translates as an accelerometer that can measure movement, combined with some filters to allow an assessment of whether sleep is ‘light’ or ‘deep’. There’s also an overall ‘sleep quality’ measure that is, presumably, calculated by little pixies in the wristband with a detailed knowledge of human physiology. Or not.

There’s a break down of the data available on their forums.

In summary, my guess is ‘light sleep’ means a bit of stirring and ‘deep sleep’ means immobility of near comatose proportions. There’s also a column of data for ‘REM sleep’, but that just contains zeroes for some reason. A lab-controlled environment with professional measurement it ain’t. But it’s probably better than asking him about his sleep a month later.

Stage 3: Look at the data

Always look at data. That’s what it’s there for. Here’s the data:

made with ChartBoot

made with ChartBoot


made with ChartBoot

Stage 4: Question the data

There are some immediate observations that come from the graphs above:

  1. There are a lot of gaps in the data – This is due to the Jawbone UP breaking and being sent away for repair/replacement. Given that, I’m happy to treat the data as missing completely at random, which makes life a lot easier
  2. The sleep quality measure peaks out at 100 fairly regularly – This is called a ceiling effect, and makes the data less useful. Given that and the fact that I don’t understand the measure itself, I’m tempted to avoid analysing it
  3. The sleep duration data does appear to vary around some fairly consistent average values
  4. The big one – I can’t see any dramatic change in sleep pattern

Let’s concentrate on that final observation. When I first got this data, I thought it would be a great opportunity to show a good analytic approach. After all, it’s pretty obvious that stopping caffeine will improve sleep. Right?

Hmmm. The thing is, I don’t actually know when my brother drinks his coffee. Maybe he’s a morning addict and it’s all out of his system by the time he goes to bed. Maybe the caffeine-free diet doesn’t benefit his sleep at all. We have to accept that, maybe, the caffeine isn’t really affecting his sleep.

I still haven’t mentioned when my brother stopped caffeine. His last caffeine was on 27th September. That’s quite late in the graphs above, so maybe that’s hiding some of the effect, if there is one. It also means that we only have a little data after the change. Here’s some detail from that period:

made with ChartBoot

Overall, that makes me very doubtful that we have enough reliable data to come to a firm conclusion.

All is not lost, though.

We have discovered something fundamental about the human condition:

Humans see what they want, and expect, to see

My brother saw an effect, and I believed him, because it is logical that removing caffeine would affect sleep.What matters is that we have looked at the data, questioned it, and adjusted our expectations. That puts us in a position to ask realistic questions of the data.

Stage 5: Turn the question into an analytic question

One of the many things that puts people off engaging in research, I think, is that it is over-complicated because excessively complex statistical tests look at data in ways that are not immediately intuitive. Sometimes, this is necessary or appropriate. Most of the time, it isn’t.

We started with  fairly simple question:

Does removing caffeine from his diet improve my brother’s sleep?

I’m now going to try and turn that into an analytic question. That’s a bit of jargon I use in teaching to distinguish between normal questions expressed in natural language and the slightly more specific questions we ask statistically. Basically, it means taking a question and expressing it in a way that allows us to explicitly assess hypotheses about the world (Bayesians note use of the word assess instead of test – please don’t get shouty).

In this case, we could start by refining two aspects of the question:

  • Removing caffeine – In this case, we can only infer the effects of removing caffeine generally (i.e. any time my brother were to stop caffeine) from the specific instance we know about (the 27th September onwards)
  • Improve – There are several ways, just looking at the sleep time data, that we could cut this. We could say ‘improve’ means ‘more sleep’. We could say it means ‘more deep sleep’, or we could say it means ‘a greater proportion of sleep is deep’. I didn’t know which one was important to my brother, so I asked. He said, ‘more deep sleep’ is his relevant definition

So, our initial analytic question is:

Did my brother experience more deep sleep per night once he had removed caffeine than he experienced previously?

As I’m sure you can see, there are two important features of this question:

  1. I could go on refining it forever (e.g. defining what I mean by ‘more’) without really making it a better question
  2. It doesn’t quite get at the point of interest

The second point there is key. The question as it currently stands is all about one event. We want to take information about that one event and make inferences from it about other events of the same class. In other words, we want to be able to say that, because there was a difference this time, we have reason to believe that there will be differences on future occasions. This inferential step should, in my opinion, be explicitly stated:

Given the change (if any) in deep sleep per night my brother experienced when he stopped taking caffeine, what can we infer about the effects of stopping caffeine on my brother’s deep sleep?

This change does make the question more complex, and a bit less transparent. For this reason, I like to answer the simple version first, then go on to the complex version. People are more likely to understand the analysis that way, I am less likely to make a silly mistake, and the world becomes just a little bit more straightforward.

Do note that, for a formal statistical analysis, you would need to be more precise because the analysis has to be open to independent interpretation. However, this is an analysis to help you make decisions, not to convince other people that your decisions are correct.

Stage 6: Do the analysis

As I said, we’ll start with the simple analysis, which is intended to answer the simple question. If we compare the average duration of deep sleep before and after stopping caffeine, we will see that the difference is not exactly astounding; the means are 198 and 213 minutes, and the medians are 197 and 208 minutes respectively.

If we look at the actual data plotted (rather crudely – a boxplot would be better), we can see that there is substantial overlap between the values after stopping caffeine and the values before stopping caffeine:

made with ChartBoot

Based on this alone, my brother might have enough information to make a decision about his caffeine use. A rational person would say that the effect, if any, is probably around 10-15 minutes of additional deep sleep a night without caffeine, and even that is rather uncertain.

Let’s see if we can be a bit more precise than that.

One way of answering our question is to use bootstrapping. This involves taking samples from the data we already have and using them to make inferences about a larger data set. These data are ideal for this approach because:

  • We have lots of baseline data – That is, we have a lot of data from when my brother was using caffeine, so we can generate reliable samples from this
  • I don’t want to make distributional assumptions – The data look fairly normally distributed around a mean value, but I don’t want to make any assumptions about this
  • I don’t want to lose individual detail – This is important. I don’t need to generalise my findings to other people, just to my brother’s sleep more generally. Given the amount of data we have, I want to use it and capture his idiosyncrasies rather than work around them

As we are using bootstrapping, we can make the analytic question even more specific:

If I took any random sample of ten days’ sleep data from the period before he stopped caffeine, what is the chance that I would see a pattern at least as extreme as my brother saw?

I’m going to do this analysis in R, which is free and can be used by anyone. There are packages available that allow us to do this analysis without coding, but I’m doing it long-hand because it is both easy and more transparent. I also want to show that statistical programming should be kept as simple as possible so that a mere mortal may read it.

First of all, we open the programme and set our working directory:


Then we import data from before caffeine cessation. In this case I have it as a nice comma separated file:

withcaff <- read.csv("deepsleepbeforecaffeine.csv", header=T)

This gives you a thing called a data frame object, which won’t work properly later on. Therefore, we need to change it into a vector (i.e. a simple list of numbers):

withcaff <- as.vector(withcaff[,1])

As you can see, I am calling this data object withcaff.

We are interested in looking at the likelihood of observing our data given what we already know about my brother’s sleep patterns. Specifically, we want to know what the likelihood is of seeing the mean and median deep sleep times we observed after caffeine cessation. To do that, R needs to know what the median and mean values were:

criticalmean <- c(213)
criticalmedian <- c(208)

What we’re going to do now is find out the likelihood of observing these values or more extreme ones given the data we have on my brother’s sleep before he stopped taking caffeine.

We now specify the number of samples we want to take. I’m going to take a wild stab and go for 10,000.

A <- c(10000)

Then, we create a couple of objects to hold the results of our sampling:

meanvalues <- c()
medianvalues <- c()
length(meanvalues) <- A
length(medianvalues) <- A

Once we have done this, we can simply repeatedly sample from the data we have and record the mean and median values in our vectors. Our post-caffeine sample was of 10 days’ sleep, so I will take lots of samples of size 10.

I am using a for loop to do this. It simply runs a set of commands a set number of times. In this case, it is running it A times (i.e. 10,000 times). For loops are often considered bad practice in R, but I don’t care – it’s clearer doing it this way and I can wait a few seconds for the analysis to run:

for (i in 1:A) {
randomsamples <- sample(withcaff, size=10,replace=T)
medianvalues[i] <- median(randomsamples)
meanvalues[i] <- mean(randomsamples)

Now we have the data, we can look at the likelihood of seeing the sleep times we actually saw, by just counting:

pmedian <- (sum(medianvalues>=criticalmedian))/A
pmean <- (sum(meanvalues>=criticalmean))/A

The pmedian I get is 0.2268.

This can be interpreted as:

Given what we know about my brother’s sleep, our best guess of the chance that any ten day period has a median deep sleep value of at least 208 minutes is approximately 0.23, or 23 percent.

The pmean I get is 0.1167.

We could also look at the results of our analysis using a histogram:

hist(medianvalues, breaks=50)
abline(v=criticalmedian, col='red')

The histograms look like this:



The net result of all of this is that stopping caffeine may have affected my brother’s deep sleep a bit, but we cannot be very sure about this.

Stage 7: Next steps

My brother started taking caffeine again. I could have discussed with him how much more data he would need to make a more certain decision. This is actually quite an easy thing to do, particularly with a bootstrapping approach. However, he decided against bothering. Why?

Well, he did have some withdrawal effects; a little weight gain and irritability. These may well have passed with time (I would certainly expect them to). More importantly, though, he just didn’t see the benefit in proportion to the disbenefit (or cost) of stopping caffeine. Ultimately, that balance is what ends up guiding our decisions. The role of statistics in this situation should be to inform that decision.


I hope this has shown you some approaches to looking at data about yourself. It is perfectly possible to analyse this data at home and use those analyses to inform your own decisions. The statistics involved do not need to be complex.

In this particular case, we could see the data clearly and honestly and use that to temper the preconceptions we had about the effects of caffeine to help my brother make a better decision that he could justify to himself.

What do you think? How do you look at data from your wearables?

Posted in Experiments Tagged with: , , , , , ,
9 comments on “The great caffeine conundrum
  1. Haploflow says:

    This article also highlights a very important point that everyone should consider when making an intervention: what should I measure and why?

    Caffeine, has multiple pathways of action, known and probably unknown. Are we measuring the right things? Is sleep quality and quantity the only factor? This is true of all interventions – multiple pathways, multiple points of measurement. Also, each individual will vary in the order of preference in what they think is important to their wellbeing.

    • Monkey says:

      This is a hugely important point, and thanks for making it.

      As I wrote this, I noticed that the issue of measurement and outcomes comes out quite strongly. You’ll see from some of the other posts (e.g. that the issue of where on the causal chain you measure, and how you measure is something that is both challenging and important.

      Sleep, for a lot of people, would be the relevant outcome. For my brother, it wasn’t the main outcome of interest. However, even if sleep were the main outcome of interest (e.g. in a clinical sleep study) we should still have some theory about the mode of action for caffeine. We can then use the sleep data to assess our theory, as well as other intermediate measures.

      The really cool thing is that, if we select the right measures, we might be able to identify people who would, and would not, benefit from reducing caffeine. For example, we might look at the effect of timing of caffeine use, or we might look at adenosine receptors (one site of action for caffeine) or even caffeine metabolites to see if people that metabolise it better experience fewer sleep problems.

      This may all sound a bit like you need lots of kit, but there are things you can do at home. For example, the metabolites of caffeine appear to be involved in heart rate effects. Given that, it may be possible to correlate changes in heart rate following consumption with sleep effects on the same day.

      This all gets too complicated very quickly, which is why I suggest concentrating on what you want to achieve and why you think that will happen. This allows you to build questions that are targeted and can actually be answered sensibly.

      Thanks for reading – let me know if there’s anything you’d particularly like to read about!

      • Thomas says:

        In some people with ADD/ADHD, sialumtnts such as caffeine can have a calming effect enough so that they can fall asleep. The principle is that in a person who has ADD, their thoughts are jumping around from one to another, constantly engaging the brain in mental activity. But sialumtnts help a person with ADD concentrate one just one or a few things. So that they can concentrate on relaxing more to the point where they fall asleep more easily.Also CNS like caffeine sometimes have the opposite effect in children, but I haven’t heard of a theory as to why.

        • Monkey says:

          I haven’t seen a comprehensive explanation of the paradoxical effects of stimulants in ADD/ADHD. They’ve been reported since before the second World War, so I’m comfortable assuming that they are real!

          It’s worth noting that things like sleep-wake cycles and stimulation/arousal are often non-linear because they are controlled by multiple interacting systems. Typically, these involve quite complex nerve pathways between the cortex (the external part of the brain) and the subcortical areas (the bits in the middle near the brainstem) that are involved in basic life functions.

          In case the idea of competing systems is counterintuitive, the way I think about it is that small children are often highly over-active just before they fall asleep to the extent that you can almost see the arousal and sleepiness systems competing, resulting in clumsy running about etc just before sleep. Indeed, they’re often most overactive when they’re most tired, presumably because the two (or more) balancing systems get out of step with each other.

          I don’t know, but I suspect the apparently paradoxical effects of stimulants are related to this. This might also explain why there appear to be subgroups of people with ADD/ADHD who do not respond to standard stimulant therapy – different people might have difficulties with different parts of the interacting systems.

          Sorry I can’t be more useful – if anyone else has any suggestions, please say!

  2. RJ says:

    Fantastic article Monkey G.

    I had the Jawbone Up for a while, but by the third time it broke down, I just asked for a refund. However, I have been using an app called sleep cycle, which allows you to tag the sleep data with up to 5 customised items, e.g. ate late, exercise etc. I wonder how easy it would be to design an open source experiment where you can get participants to tag their sleep data with pre-determined attributes before getting them to send in their results to analysed in an aggregate form. I wonder how you could remove the biases that would come out of such public data?

    Anyway, just thinking out aloud.


    • Monkey says:

      Thanks for the comment, RJ. This is a great idea and brilliantly simple.

      The issue of bias does matter, but I think there are ways around it. One approach is to concentrate on exploratory analysis (or epidemiological methods), that are primarily hypothesis-generating. This is one of the underlying strengths of large data sets – effects that are very difficult to see in individuals (because of all the other factors influencing sleep, performance etc.) can be seen in a large sample. For example, we may look at the variability of peoples’ sleep and note that people who eat late tend to get less sleep. There could be all sorts of reasons for this (because association is not the same as causation), but it creates a testable hypothesis.

      Here’s the thing: If people are monitoring lots of different parameters and agree to be part of an experiment, we can ask them to eat at different times without necessarily telling them which particular consequence we are interested in. This means that we can run a proper randomized, controlled, experiment.

      It’s not perfect – blinding people to the intervention won’t work, for example. However, we can minimize the biases present.

      I quite like the idea of collecting lots of data routinely and using it to generate hypotheses that are then tested, and replicated, amongst a sample of users.

      Actually, thinking about it, you could even go further. You could tell some people why you are doing it, and keep other people blind to the purpose. That would allow you to estimate the size of bias present, which would be pretty neat.

      The underlying point is, as you say, having data entered in a consistent format so that you can collate it across people.

      • Johan Lindén says:

        Interesting article and comments. Will check out this blog further. I am also thinking och taking stat class again. But maybe I should outsource my stat questions when needed.

  3. Brian Toomey says:

    This is great.​ Really fun read, and interesting data. ​

    Three small comments:

    1.) I spoke with Seth Roberts extensively about this before his passing, and he convinced me that a qualitative metric of how rested you feel with 3 significant digits would outperform trackers, and alleviate the ceiling effect. It’s lovely to collect data in an automated way like a jawbone would, but I am not sure keeping it charged it actually less effort than jotting down a three digit number would be. Switching to this metric might give you more reactive and richer data.

    2.) I think it’s a mistake to assume that the effects of caffeine are monotonic. See for an excellent take on hormesis on non-monotonicity of most phenomena. It’s possible that smaller doses are better than zero or larger doses.

    3.) You might also look at timing effects. For me, caffeine before and after noon are night and day different (literally I guess), and, say, 50mg caffeine at 9am and 4pm are totally different. Seth Roberts found this for Vitamin D as wel.

    • Monkey says:

      Thanks for a brilliantly useful and insightful comment, Brian.
      I think I agree about self-reported sleep data. It is important to have some anchor point to avoid bias, but that’s manageable. Sometimes, it feels like we use tech for the sake of it when simpler solutions would be just as good.
      Similarly, I agree on the issues of monotonicity and timing. Thanks for mentioning them. We take non-monotonic relationships for granted in drug pharmacokinetics, then forget them when talking about exercise or lifestyle interventions.

      Anyway, great comments, and thanks for adding them.

1 Pings/Trackbacks for "The great caffeine conundrum"
  1. […] The Great Caffeine Conundrum. A wonderfully thorough post about using the scientific process, statistics, and self-tracking data (Jawbone UP) to answer a seemingly simple question, “Does eliminating caffeine consumption help me sleep better?” […]

Leave a Reply