Is the pain worth it?: Can Rcpp speed up Passing Bablok Regression?

Background

R dogma is that for loops are bad because they are slow but this is not the case in C++. I had never programmed a line of C++ as of last week but my beloved firstborn started university last week and is enrolled in a C++ intro course, so I thought I would try to learn some and see if it would speed up Passing Bablok regression.

Passing Bablok Regression

As mentioned in the past, the field of Clinical Chemistry has a peculiar devotion to Passing Bablok regression… and hey, why not?

Here is the code for a minimal implementation of Passing Bablok regression as discussed in this paper. This is the scale-invariant version.

So let’s make some fake data and see what we get:

Just to sanity check, we can get the coefficients from the mcr() package.

Ok, looks good.

Rcpp

The Rcpp package permits compiling and execution of C++ code from within R. This can be good for computationally intensive tasks. Here is my child-like attempt at recapitulation of the R script above.

Saving this to a local directory as PB_Reg.cpp, we can compile it and call the function into the R session as follows:

And run it:

Looks good! But is it faster? Like really faster? Let’s find out. And let’s also compare it to what the compiler() package produces:

test replications elapsed relative user.self sys.self user.child sys.child
3 Rcpp 100 0.024 1.000 0.024 0.000 0 0
2 Compiler Package 100 0.214 8.917 0.214 0.000 0 0
1 Rscript 100 0.217 9.042 0.213 0.004 0 0

So this makes C++ code about 9x faster than R source code. Not bad but in this case it was probably not worth the effort. Oh well. I learned something. Incidentally, the compiler() package did not help much in this case as it is only marginally better than raw R source. I have seen better results from compiler() in other cases.

Final thought

Speaking of speed:

“Free yourself, like a gazelle from the hand of the hunter, like a bird from the snare of the fowler.”

Proverbs 6:5


Mining Your Routine Data for Reference Intervals: Hoffmann, Bhattacharya and Maximum Likelihood

Background

Let me preface this by saying I am not making a recommendation to use the Hoffmann method. Neither am I advocating for reference interval mining from routine data. There are many challenges associated with this kind of effort. That's for another post I think. However, I am going to how one does the calculations for two methods I have seen used: the Hoffmann Method and the Bhattacharya Method. Then I will show how to do this using the mixtools package in R which uses the expectation maximum algorithm to determine the maximum likelihood.

The Concept

When you look at histograms of routine clinical data from allcomers, on some occasions the data will form a bimodal looking distribution formed by the putatively sick and well. If you could statistically determine the distribution of the well subjects, then you could, in principle, determine the reference interval without performing a reference interval study. We can all dream, right?

All three of the approaches I show assume that the two distributions are Gaussian. This is almost never true. But for the purposes of the calculations, I will provide each approach data that meets the assumptions it makes. So, let's make a fake bimodal distribution and see how each method does. We will assume equal numbers of sick and well so that the bimodal distribution is obvious. One will have \(\mu_1 = 2\) and \(\sigma_1 = 0.5\) and the other will have \(\mu_2 = 6\) and \(\sigma_2 = 2\). The expected normal range for this population is based on \(\mu_1\) and \(\sigma_1\) and is \(2 – 0.5 \times 1.96\) and \(2 + 0.5 \times 1.96\) or about 1–3.

plot of chunk unnamed-chunk-1

To illustrate how the two populations add you can plot one in green and one in pink. The overlap shows in a yucky brown.

plot of chunk unnamed-chunk-2

Hoffmann

In 1963 Robert Hoffmann proposed a simple graphical approach to this problem and use of his method is alive and well—see here for example. The method assumes that both modes are Gaussian and that if one eye-balls (yes…the paper says “eye-fit”) the first linear-looking portion of the cumulative probability distribution (CDF) function as plotted on normal probability paper and finds its intersection with the lines y = 0.025 and y = 0.975, one can impute the normal range.

What do I plot for Hoffmann: a QQ-plot or the CDF?

It is very important to understand that the use normal probability paper, as Hoffmann described, was mandatory because it produces a normal probablity plot. As he says,

“This special graph paper serves the useful purpose of 'straightening out' a cumulative gaussian distribution. It forms a straight line.”

A CDF plotted on linear scale is sigmoidal. This is not what we want. We want a normal probability plot which is just a special case of the QQ-plot where the comparator distribution is the normal distribution. Inadvertently plotting a plain old CDF will not produce correct estimates of the lower and upper limits of normal (ie \(\mu \pm 1.96\sigma\)). The reason I emphasize this is that I have seen this error made in a number of reference interval papers—but not the one I cited above—it is correct. The importance of the distinction becomes not-very-subtle when you apply the Hoffmann approach to a pure Gaussian distribution. In short, use of the CDF in linear space generates erroneous results as we will show later on.

The Correct Approach

Here is the standard r-base normal QQ-plot of our mock data set:

plot of chunk unnamed-chunk-3

To prevent reader confusion, I am going present the plots the way Hoffmann originally showed them. So I will put the patient data on the x-axis. It doesn't change anything. You can do it as you like.

plot of chunk unnamed-chunk-4

From this you can see that there is obviously linear section between about x = 0 to x = 2 (and with the eye of faith, there is a second after x = 6). This is what Hoffmann calls the “eye-fit”. Since the first linear section is attributable to the first of the two normal distributions which form the overall distribution, we can use the it to determine properties of the first distribution. If I look only at the data between x = 0 and x = 2, I am sort-of guaranteed to be in the first linear section. You don't have to kill yourself to correctly identify where the linearity ends because the density of the points should be highest near the middle of the linear section and this will weight the regression for you.

Next if I extend this line to find its intersection with y = -1.96 and y = 1.96 (ie the z-scores corresponding the limits of normal, namely the 2.5th and 97.5th centiles), I can estimate the reference interval, by dropping perpendicular lines from the two respective intersections. Here is what I get:

plot of chunk unnamed-chunk-5

So the Hoffmann reference interval becomes 1.11 to 3.70 which you can compare to the expected values of about 1 and 3 based on the random data. Not the greatest but not bad.

What not to do

Let's apply the correct approach to the Hoffmann method (QQ-Plot) and incorrect approach (CDF on a linear scale) to a pseudorandom sampling (n=10,000) of the standard normal distribution, which has a mean of 0 and a standard deviation of 1. Therefore the central 95% or “normal range” for this distribution will be -1.96 to 1.96. I will plot regression lines through the linear part of each curve and find the respective intersections with the appropriate horizontal lines.

plot of chunk unnamed-chunk-6

The QQ-plot generates estimates the limits of normal, \(\mu \pm 1.96\sigma\), as about \(\pm 1.96\) as it should. You can easily show that the same procedure on the CDF intersects the lines \(y = \alpha /2\) and \(y = 1 – \alpha/2\) at values of \(\pm (1 – \alpha) \sqrt{\pi/2} \sigma\) which is about \(\pm 1.19\) for \(\sigma = 1\) and \(\alpha = 0.05\). This erroneous estimate is shown with the pink vertical lines. So the Hoffmann method does not work if one attempts to extend the linear portion of the CDF if it is plotted in linear space and it will produce estimates of \(\sigma\) that are 40% too low in this case. If you're puting this all together, this because the CDF is well away from its linear portion when the cumulative proportions are 0.025 and 0.975—not so for a QQ-plot. If you see a “Hoffmann plot” constructed from a sigmoidal CDF plotted on a linear scale, something is wrong.

Bhattacharya

This method is based on a much more highly cited paper in Biometrics published in 1967 by C.G. Bhattacharya. Loosely speaking, the method of Bhattacharya determines the parameter estimates of \(\mu_i\) and \(\sigma_i\) from the slope of the log of the distribution function. It was originally intended as a graphical method and so it also involves some human eye-balling.

We will need the log of the counts from the histogram. When we store the results of a histogram in R, we have the counts automatically.

We can now calculate the log of the counts (denoted \(y\)) and \(\Delta log(y)\) from bin to bin. We put these in a dataframe along with the counts and the midpoints of the bins. The bin width, which is chosen to be constant \(h\), is the distance between the midpoints of each bin.

Now let's plot \(\Delta log(y)\) as a function of the midpoints of the bins. I also number all the points to facilitate the next step.

plot of chunk unnamed-chunk-9

We can see from the figure that there are two sections where the plot shows a downsloping line: one between points 2 to 6 and another between points 10 to 21. How straight these lines appear is affected by how wide your bins are so if you get lines that are hard to discern, you can try making fewer bins.

In any case, using Bhattacharya's notation, the next step in the procedure is to draw regression lines through the \(r_{th}\) linear section and determine the intercept \(\hat{\lambda}_r\) with the x-axis. Bhattacharya intended this as a graphical procedure and advises,

“While matching the straight line it is better to fit closely to the points where the frequency is large even if the apparent discrepancy becomes somewhat large where the frequency is small.”

Since we are doing this by calculation, we can take his advice by weighting the linear regressions according to the counts. This allows the determination of the \(\hat{\mu}_r\) by:

\[\hat{\mu}_r = \hat{\lambda}_r + h/2\]
and also the determination of \(\hat{\sigma}_r\) by:

\[\hat{\sigma}^2_r = -h/ \text{slope}_r – h^2/12\]

And here are the results we get:

mu Values sigma Values Normal Range Limits
2.06 0.59 0.90
6.25 1.83 3.21

And here is what it all looks like

plot of chunk unnamed-chunk-12

In this demonstration, there are only two Gaussian distributions to resolve, but the method is not limited to the resolution of two Gaussian curves at all. If there are more, there will be more downsloping lines crossing the x-axis. So we get normal range estimates of 0.90 and 3.21 which compare much better with the expected values of about 1 and 3. We also get good estimates of \(\mu_2=\) 6.3 and \(\sigma_2=\) 1.8 which are about 6 and 2 respectively in our data set.

Bhattacharya also provides a means of calculating the mixing proportion of the two distributions—that is, the proportions of patients in the sick and abnormal populations. We don't need that here so I omit it.

Gaussian Mixture Model

In R there are a lot of ways to approach the separation of mixtures of distributions using maximum likelihood. Here I am using a function from the mixtools package that is particularly easy to use. The concept of using maximum likelihood for mining your reference interval is not new (see this paper) but many would be intimidated by the math required to do it from scratch.

With R, this is pretty easy but please be cautioned that real data does not play as nice as the data in this demonstration (even moreso for Hoffmann and Bhattacharya) and it is unlikely that you will get smashing results unless your data fits the assumptions of the model.

In any case,

which gives very good parameter estimates indeed! Estimates of \(\mu_1\) and \(\mu_2\) are 2.01 and 6.19 respectively and estimates of \(\sigma_1\) and \(\sigma_2\) are 0.52 and 1.97 respectively.

Looking at this graphically:

plot of chunk unnamed-chunk-14

So the normal range estimate from EM method is 1.00 to 3.03 which is pretty fantastic.

Summary of Results

LLN ULN
Raw Random Data 1.03 2.98
Hoffmann 1.11 3.70
Bhattacharya 0.90 3.21
mixtools EM – winner! 1.00 3.03

It's not too hard to figure out which one of these approaches works best. But what do you do if your patient data distribution is obviously not a mixture of Gaussians (ie when the distributions look skewed)? There are ways to do this in R for this but I will cover that another time–maybe in a paper.

Conclusion

  • Three methods of estimating the normal range from a mixture of Gaussians have been presented.
  • The Hoffmann method performs OK if you use a QQ-plot.
  • The Hoffmann method does not work for CDFs plotted on a linear scale.
  • The Bhattacharya method performs better but still requires some human oversight.
  • The normalmixEM() function from the mixtools package performs very well without any human oversight.
  • These results do not imply that any of these approaches will perform well on real patient data for which the components of the overall distribution are not likely to be Gaussian. Caution advised.

Parting Thought

Please don't fall on the wrong side of God's mixture separation procedures for wheat and chaff.

Said, John the Baptist, “But after me comes one who is more powerful than I, whose sandals I am not worthy to carry. He will baptize you with the Holy Spirit and fire. His winnowing fork is in his hand, and he will clear his threshing floor, gathering his wheat into the barn and burning up the chaff with unquenchable fire.”

Matt 3:11–12

Compare Tube Types with R – Repeated Measures ANOVA

Background

Sometimes we might want to compare three or four tube types for a particular analyte on a group of patients or we might want to see if a particular analyte is stable over time in aliqioted samples. In these experiments are essentially doing the multivariable analogue of the paired t-test. In the tube-type experiment, the factor that is differing between the (‘paired’) groups is the container: serum separator tubes (SST), EDTA plasma tubes, plasma separator tubes (PST) etc. In a stability experiment, the factor that is differing is storage duration.

Since this is a fairly common clinical lab experiment, I thought I would just jot down how this is accomplished in R – though I must confess I know just about \(\lim_{x\to0}x\) about statistics. In any case, the statistical test is a repeated-measures ANOVA and this is one way to do it (there are many) including an approach to the post-hoc testing.

Some Fake Data to Work With

I’m going to make some fake data. I tried to dig up the data from an experiment I did as a resident but alas, I think the raw data died on an old laptop. But fake data will do for demonstration purposes. Let’s suppose we are looking at parathyroid hormone (PTH) in three different blood collection tubes: SST, EDTA and PST. For the sake of argument, let’s say that we collect samples from 20 patients simultaneously and we anlayze them all as per our usual process. This means that each patient has three samples of material that should be otherwise identical outside of the effects of the collection contained.

This is the way we usually express (and receive) data like this in an Excel spreadsheet:

Subject SST PST EDTA
1 17.5 18.1 19.9
2 15.1 15.7 20.0
3 29.0 29.2 32.9
4 5.7 6.2 6.4
5 25.0 26.1 27.0
6 25.7 26.4 29.0
7 41.2 40.8 48.1
8 20.4 22.1 24.3
9 28.7 26.9 36.0
10 11.0 13.9 13.7
11 32.4 31.9 36.9
12 44.5 49.2 57.4
13 16.2 17.1 15.7
14 21.7 24.1 26.3
15 38.8 36.8 42.6
16 34.4 34.0 44.2
17 12.6 12.1 14.1
18 19.8 20.9 25.4
19 19.9 18.2 23.0
20 35.4 37.4 34.1

This Excel-ish way of storing the data is referred to as the “datawide” format for obvious reasons.

Gather the Grain

As it turns out this is not the way that we want to store data to do the statistical analyses of interest. What we want to do is have the tube type in a single column because this is the factor that is different within the subjects. We want to gather() or melt() the data (depending on your package of choice) to be like so:

Subject Subject value
1 SST 17.5
2 SST 15.1
3 SST 29.0
4 SST 5.7
5 SST 25.0
6 SST 25.7
7 SST 41.2
8 SST 20.4
9 SST 28.7
10 SST 11.0
11 SST 32.4
12 SST 44.5
13 SST 16.2
14 SST 21.7
15 SST 38.8
16 SST 34.4
17 SST 12.6
18 SST 19.8
19 SST 19.9
20 SST 35.4
1 PST 18.1
2 PST 15.7
3 PST 29.2
4 PST 6.2
5 PST 26.1
6 PST 26.4
7 PST 40.8
8 PST 22.1
9 PST 26.9
10 PST 13.9
11 PST 31.9
12 PST 49.2
13 PST 17.1
14 PST 24.1
15 PST 36.8
16 PST 34.0
17 PST 12.1
18 PST 20.9
19 PST 18.2
20 PST 37.4
1 EDTA 19.9
2 EDTA 20.0
3 EDTA 32.9
4 EDTA 6.4
5 EDTA 27.0
6 EDTA 29.0
7 EDTA 48.1
8 EDTA 24.3
9 EDTA 36.0
10 EDTA 13.7
11 EDTA 36.9
12 EDTA 57.4
13 EDTA 15.7
14 EDTA 26.3
15 EDTA 42.6
16 EDTA 44.2
17 EDTA 14.1
18 EDTA 25.4
19 EDTA 23.0
20 EDTA 34.1

Now we see that there is a column for tube-type and a column for the PTH results which we can name accordingly. You can see why this called the “datalong” format.

Visualize

Summarize the data:

Let’s just have a quick look graphically:

plot of chunk unnamed-chunk-6

plot of chunk unnamed-chunk-6

And as a boxplot with the points overtop:

plot of chunk unnamed-chunk-7

Separate the Wheat from the Chaff

Now we want to make comparisons to see if these are different. To accomplish this, we will use the aov() function. This requires us to have data formatted “datalong” as it is in the tube.data.2 dataframe.

If you are like me, this syntax is confusing. But it goes like this. PTH is a function of Tube.Type which is straight forward–hence the PTH ~ Tube.Type bit. The error term has the Subject in front of the / and the factor that is different within the subjects (Tube.Type) after the /. That’s my grade 2 explanation from reading this and this and this.

This tells us that there is a difference between the groups but it does not specify where the difference is.

I can’t see the difference. Can you see the difference?

Sorry – I just had to make a pop-culture reference to this. We want to be specific about where the differences are without making a Type I error which might arise if we blindly charge ahead and do multiple paired t-tests. One easy way to accomplish this is to use the pairwise.t.test() function which does corrections for multiple comparisons. You can choose from a number of approaches for adjustment for pairwise comparison. This requires the “response vector” which is PTH and the “grouping factor” which is the tube type.

This is pretty easy to understand. There are statistically significant differences found between the EDTA and PST (p = 0.00083) and the EDTA and PST (p = 0.00008) but none between SST and PST (p = 0.35033).

Conclusion

Non-statistician’s approach to tube-type comparisons, which is also applicable to analyte stability studies. This is a one-way repeated measures ANOVA with one within-subjects factor. There is a great deal more to say on the matter by people who know much more in the citations in the links provided above.

God probably uses datawide format

All the nations will be gathered before him, and he will separate the people one from another as a shepherd separates the sheep from the goats. He will put the sheep on his right and the goats on his left.

(Matt 25:32-33)