Lab Medicine Stats – The Lab-R-torian

Is the pain worth it?: Can Rcpp speed up Passing Bablok Regression?

September 10, 2019September 10, 2019 dtholmes@mail.ubc.ca

Background

R dogma is that for loops are bad because they are slow but this is not the case in C++. I had never programmed a line of C++ as of last week but my beloved firstborn started university last week and is enrolled in a C++ intro course, so I thought I would try to learn some and see if it would speed up Passing Bablok regression.

Passing Bablok Regression

As mentioned in the past, the field of Clinical Chemistry has a peculiar devotion to Passing Bablok regression… and hey, why not?

Here is the code for a minimal implementation of Passing Bablok regression as discussed in this paper. This is the scale-invariant version.

PB.reg <- function(x,y){
  lx <- length(x)
  l <- lx*(lx - 1)/2
  k <- 0
  S <- rep(NA, lx)
  for (i in 1:(lx - 1)) {
    for (j in (i + 1):lx) {
            k <- k + 1
            S[k] <- (y[i] - y[j])/(x[i] - x[j])
    }
  }
  S.sort <- sort(S)
  N <- length(S.sort)
  neg <- length(subset(S.sort,S.sort < 0))
  K <- floor(neg/2)
  if (N %% 2 == 1) {
      b <- S.sort[(N+1)/2+K]
  } else {
      b <- sqrt(S.sort[N / 2 + K]*S.sort[N / 2 + K + 1])
  }
  a <- median(y - b * x)
  return(as.vector(c(a,b)))
}

PB.reg <- function(x,y){

lx <- length(x)

l <- lx*(lx - 1)/2

k <- 0

S <- rep(NA, lx)

for (i in 1:(lx - 1)) {

for (j in (i + 1):lx) {

k <- k + 1

S[k] <- (y[i] - y[j])/(x[i] - x[j])

}

S.sort <- sort(S)

N <- length(S.sort)

neg <- length(subset(S.sort,S.sort < 0))

K <- floor(neg/2)

if (N %% 2 == 1) {

b <- S.sort[(N+1)/2+K]

} else {

b <- sqrt(S.sort[N / 2 + K]*S.sort[N / 2 + K + 1])

}

a <- median(y - b * x)

return(as.vector(c(a,b)))

}

So let’s make some fake data and see what we get:

set.seed(314)
x <- seq(1, 100, length.out = 100)
y <- 1.1* x + 20 + x*rnorm(100,0,0.10)
reg <- PB.reg(x,y)
round(reg,2)

set.seed(314)

x <- seq(1, 100, length.out = 100)

y <- 1.1* x + 20 + x*rnorm(100,0,0.10)

reg <- PB.reg(x,y)

round(reg,2)

## [1] 18.90  1.14

1	## [1] 18.90 1.14

Just to sanity check, we can get the coefficients from the mcr() package.

library(mcr)
reg.mcr <- mcreg(x,y)
round(reg.mcr@glob.coef,2)

library(mcr)

reg.mcr <- mcreg(x,y)

round(reg.mcr@glob.coef,2)

## [1] 18.90  1.14

1	## [1] 18.90 1.14

Ok, looks good.

Rcpp

The Rcpp package permits compiling and execution of C++ code from within R. This can be good for computationally intensive tasks. Here is my child-like attempt at recapitulation of the R script above.

#include <iostream>
#include <vector>
#include <cmath>
#include <limits>
#include <algorithm>
#include <numeric>
#include "Rcpp.h"
using namespace std;
using namespace Rcpp;
// [[Rcpp::plugins("cpp11")]]
// [[Rcpp::export]]

vector <double> PB(vector<double> x,vector<double> y){
  int lx=x.size();
  int l=lx*(lx-1)/2;
  int k=-1;
  vector <double> S(l,numeric_limits<double>::quiet_NaN());
  vector <double> sortS;
  vector <double> neg;
  double b;
  double a;
  vector <double> aVec(x.size());
  vector <double> coef(2);

  // Calculate the components of the S Matrix
  for(int i=0; i<lx-1;i++){
    for(int j=i+1; j<lx;j++){
        k+=1;
        S[k]=(y[i]-y[j])/(x[i]-x[j]);
    }
  }

  // Sort the S Matrix
  sort(S.begin(),S.end());

  // Delete any undefined values from S
  for(vector <double> :: iterator it = S.begin(); it != S.end(); ++it){
      if (!isnan(*it)) sortS.push_back((*it));
  }

  // Put all the negative values into a vector called neg
  for(vector <double> :: iterator it = S.begin(); it != S.end(); ++it){
      if ((*it)<0) neg.push_back((*it));
  }

  // Calculate N and K
  int K=int(floor(float(neg.size()/2)));
  int N=sortS.size();

  // Calculate the slope
  if (N%2==1) {
    b=S[static_cast<int> ((N+1)/2+K-1)];
  } else {
    b=sqrt(S[static_cast<int> (N/2+K-1)]*S[static_cast<int> (N/2+K)]);
  }

  // Make a vector aVec of the estimates of the intercept
  for(int i=0; i<x.size(); i++){
      aVec[i]=y[i]-b*x[i];
  }
  sort(aVec.begin(),aVec.end());

  // Calculate the median from the sorted values of aVec
  if(x.size()%2==1){
    a=aVec[static_cast<int>((x.size()+1.0)/2.0-1.0)];
  }else{
    a=(aVec[static_cast<int> (x.size()/2.0-1.0)] + aVec[static_cast<int> (x.size()/2.0)])/2.0;
  }

  // Report results
  coef[0]=a;
  coef[1]=b;
  return coef;
}

int main(){
  return 0;
}

#include <iostream>

#include <vector>

#include <cmath>

#include <limits>

#include <algorithm>

#include <numeric>

#include "Rcpp.h"

using namespace std;

using namespace Rcpp;

// [[Rcpp::plugins("cpp11")]]

// [[Rcpp::export]]

vector <double> PB(vector<double> x,vector<double> y){

int lx=x.size();

int l=lx*(lx-1)/2;

int k=-1;

vector <double> S(l,numeric_limits<double>::quiet_NaN());

vector <double> sortS;

vector <double> neg;

double b;

double a;

vector <double> aVec(x.size());

vector <double> coef(2);

// Calculate the components of the S Matrix

for(int i=0; i<lx-1;i++){

for(int j=i+1; j<lx;j++){

k+=1;

S[k]=(y[i]-y[j])/(x[i]-x[j]);

}

// Sort the S Matrix

sort(S.begin(),S.end());

// Delete any undefined values from S

for(vector <double> :: iterator it = S.begin(); it != S.end(); ++it){

if (!isnan(*it)) sortS.push_back((*it));

}

// Put all the negative values into a vector called neg

for(vector <double> :: iterator it = S.begin(); it != S.end(); ++it){

if ((*it)<0) neg.push_back((*it));

}

// Calculate N and K

int K=int(floor(float(neg.size()/2)));

int N=sortS.size();

// Calculate the slope

if (N%2==1) {

b=S[static_cast<int> ((N+1)/2+K-1)];

} else {

b=sqrt(S[static_cast<int> (N/2+K-1)]*S[static_cast<int> (N/2+K)]);

}

// Make a vector aVec of the estimates of the intercept

for(int i=0; i<x.size(); i++){

aVec[i]=y[i]-b*x[i];

}

sort(aVec.begin(),aVec.end());

// Calculate the median from the sorted values of aVec

if(x.size()%2==1){

a=aVec[static_cast<int>((x.size()+1.0)/2.0-1.0)];

}else{

a=(aVec[static_cast<int> (x.size()/2.0-1.0)] + aVec[static_cast<int> (x.size()/2.0)])/2.0;

}

// Report results

coef[0]=a;

coef[1]=b;

return coef;

}

int main(){

return 0;

}

Saving this to a local directory as PB_Reg.cpp, we can compile it and call the function into the R session as follows:

library(Rcpp)
sourceCpp("PB_Reg.cpp")

1 2	library(Rcpp) sourceCpp("PB_Reg.cpp")

And run it:

round(PB(x,y),2)

1	round(PB(x,y),2)

## [1] 18.90  1.14

1	## [1] 18.90 1.14

Looks good! But is it faster? Like really faster? Let’s find out. And let’s also compare it to what the compiler() package produces:

library(compiler)
PB.reg.cmpfun <- cmpfun(PB.reg)

library(rbenchmark)
bmark <- benchmark(
  Rscript = PB.reg(x,y),
  `Compiler Package` = PB.reg.cmpfun(x,y),
  Rcpp = PB(x,y),
  replications = 100,
  order = "relative"
)
knitr::kable(bmark)

library(compiler)

PB.reg.cmpfun <- cmpfun(PB.reg)

library(rbenchmark)

bmark <- benchmark(

Rscript = PB.reg(x,y),

`Compiler Package` = PB.reg.cmpfun(x,y),

Rcpp = PB(x,y),

replications = 100,

order = "relative"

)

knitr::kable(bmark)

	test	replications	elapsed	relative	user.self	sys.self
3	Rcpp	100	0.024	1.000	0.024	0.000
2	Compiler Package	100	0.214	8.917	0.214	0.000
1	Rscript	100	0.217	9.042	0.213	0.004

So this makes C++ code about 9x faster than R source code. Not bad but in this case it was probably not worth the effort. Oh well. I learned something. Incidentally, the compiler() package did not help much in this case as it is only marginally better than raw R source. I have seen better results from compiler() in other cases.

Final thought

Speaking of speed:

“Free yourself, like a gazelle from the hand of the hunter, like a bird from the snare of the fowler.”

Proverbs 6:5

Mining Your Routine Data for Reference Intervals: Hoffmann, Bhattacharya and Maximum Likelihood

September 5, 2017February 1, 2018 dtholmes@mail.ubc.ca

Background

Let me preface this by saying I am not making a recommendation to use the Hoffmann method. Neither am I advocating for reference interval mining from routine data. There are many challenges associated with this kind of effort. That's for another post I think. However, I am going to how one does the calculations for two methods I have seen used: the Hoffmann Method and the Bhattacharya Method. Then I will show how to do this using the mixtools package in R which uses the expectation maximum algorithm to determine the maximum likelihood.

The Concept

When you look at histograms of routine clinical data from allcomers, on some occasions the data will form a bimodal looking distribution formed by the putatively sick and well. If you could statistically determine the distribution of the well subjects, then you could, in principle, determine the reference interval without performing a reference interval study. We can all dream, right?

All three of the approaches I show assume that the two distributions are Gaussian. This is almost never true. But for the purposes of the calculations, I will provide each approach data that meets the assumptions it makes. So, let's make a fake bimodal distribution and see how each method does. We will assume equal numbers of sick and well so that the bimodal distribution is obvious. One will have $\mu_1 = 2$ and $\sigma_1 = 0.5$ and the other will have $\mu_2 = 6$ and $\sigma_2 = 2$. The expected normal range for this population is based on $\mu_1$ and $\sigma_1$ and is $2 – 0.5 \times 1.96$ and $2 + 0.5 \times 1.96$ or about 1–3.

#two Gaussian distributions with means of 2 and 6 respectively and sd's of 1 and 2
set.seed(10) #to make sure you generate the same data
mode1 <- rnorm(1000,2,0.5)
mode1 <- mode1[mode1 > 0] #get rid of negative results
mode2 <- rnorm(1000,6,2)
mode2 <- mode2[mode2 > 0] 
d <- sort(c(mode1,mode2))
dhist <- hist(d,
              breaks = c(seq(0,20,0.25),100),
              xlim = c(0,10),
              main = "Histogram of Patient Results",
              xlab = "Concentration of Analyte")

#two Gaussian distributions with means of 2 and 6 respectively and sd's of 1 and 2

set.seed(10) #to make sure you generate the same data

mode1 <- rnorm(1000,2,0.5)

mode1 <- mode1[mode1 > 0] #get rid of negative results

mode2 <- rnorm(1000,6,2)

mode2 <- mode2[mode2 > 0]

d <- sort(c(mode1,mode2))

dhist <- hist(d,

breaks = c(seq(0,20,0.25),100),

xlim = c(0,10),

main = "Histogram of Patient Results",

xlab = "Concentration of Analyte")

plot of chunk unnamed-chunk-1

To illustrate how the two populations add you can plot one in green and one in pink. The overlap shows in a yucky brown.

hist(d,
     breaks = c(seq(0,10,0.25),100),
     freq = TRUE,
     xlim = c(0,10),
     main = "Histogram of Patient Results",
     xlab = "Concentration of Analyte")
hist(mode1, breaks = c(seq(0,10,0.25),100), add = TRUE, col = rgb(0,1,0,0.3), freq = TRUE)
hist(mode2, breaks = c(seq(0,10,0.25),100), add = TRUE, col = rgb(1,0,0,0.3), freq = TRUE)

hist(d,

breaks = c(seq(0,10,0.25),100),

freq = TRUE,

xlim = c(0,10),

main = "Histogram of Patient Results",

xlab = "Concentration of Analyte")

hist(mode1, breaks = c(seq(0,10,0.25),100), add = TRUE, col = rgb(0,1,0,0.3), freq = TRUE)

hist(mode2, breaks = c(seq(0,10,0.25),100), add = TRUE, col = rgb(1,0,0,0.3), freq = TRUE)

plot of chunk unnamed-chunk-2

Hoffmann

In 1963 Robert Hoffmann proposed a simple graphical approach to this problem and use of his method is alive and well—see here for example. The method assumes that both modes are Gaussian and that if one eye-balls (yes…the paper says “eye-fit”) the first linear-looking portion of the cumulative probability distribution (CDF) function as plotted on normal probability paper and finds its intersection with the lines y = 0.025 and y = 0.975, one can impute the normal range.

What do I plot for Hoffmann: a QQ-plot or the CDF?

It is very important to understand that the use normal probability paper, as Hoffmann described, was mandatory because it produces a normal probablity plot. As he says,

“This special graph paper serves the useful purpose of 'straightening out' a cumulative gaussian distribution. It forms a straight line.”

A CDF plotted on linear scale is sigmoidal. This is not what we want. We want a normal probability plot which is just a special case of the QQ-plot where the comparator distribution is the normal distribution. Inadvertently plotting a plain old CDF will not produce correct estimates of the lower and upper limits of normal (ie $\mu \pm 1.96\sigma$). The reason I emphasize this is that I have seen this error made in a number of reference interval papers—but not the one I cited above—it is correct. The importance of the distinction becomes not-very-subtle when you apply the Hoffmann approach to a pure Gaussian distribution. In short, use of the CDF in linear space generates erroneous results as we will show later on.

The Correct Approach

Here is the standard r-base normal QQ-plot of our mock data set:

qqnorm(d, type = "l")

1 2	qqnorm(d, type = "l")

plot of chunk unnamed-chunk-3

To prevent reader confusion, I am going present the plots the way Hoffmann originally showed them. So I will put the patient data on the x-axis. It doesn't change anything. You can do it as you like.

my.qq <- qqnorm(d, datax = TRUE, type = "l", ylab = "Patient Results", xlab = "Quantiles of the Normal Distribution")

1 2	my.qq <- qqnorm(d, datax = TRUE, type = "l", ylab = "Patient Results", xlab = "Quantiles of the Normal Distribution")

plot of chunk unnamed-chunk-4

From this you can see that there is obviously linear section between about x = 0 to x = 2 (and with the eye of faith, there is a second after x = 6). This is what Hoffmann calls the “eye-fit”. Since the first linear section is attributable to the first of the two normal distributions which form the overall distribution, we can use the it to determine properties of the first distribution. If I look only at the data between x = 0 and x = 2, I am sort-of guaranteed to be in the first linear section. You don't have to kill yourself to correctly identify where the linearity ends because the density of the points should be highest near the middle of the linear section and this will weight the regression for you.

Next if I extend this line to find its intersection with y = -1.96 and y = 1.96 (ie the z-scores corresponding the limits of normal, namely the 2.5th and 97.5th centiles), I can estimate the reference interval, by dropping perpendicular lines from the two respective intersections. Here is what I get:

#get regression line - it's linear from about from 0 to 2
my.qq <- as.data.frame(my.qq)
linear.bit <- subset(my.qq, x <= 2)
#get the regression line of the linear section
reg <- lm(y ~ x, data = linear.bit)
plot(y ~ x,
     data = my.qq,
     type = "l",
     ylab = "Quantiles of the Normal Distribution",
     xlab = "Patient Results")
abline(reg, col = "red")
abline(h = c(-1.96,1.96), lty = 2)
uln.hoff <- unname((1.96 - coef(reg)[1])/coef(reg)[2])
lln.hoff <- unname((-1.96 - coef(reg)[1])/coef(reg)[2])
abline(v = c(lln.hoff,uln.hoff), lty = 2)

#get regression line - it's linear from about from 0 to 2

my.qq <- as.data.frame(my.qq)

linear.bit <- subset(my.qq, x <= 2)

#get the regression line of the linear section

reg <- lm(y ~ x, data = linear.bit)

plot(y ~ x,

data = my.qq,

type = "l",

ylab = "Quantiles of the Normal Distribution",

xlab = "Patient Results")

abline(reg, col = "red")

abline(h = c(-1.96,1.96), lty = 2)

uln.hoff <- unname((1.96 - coef(reg)[1])/coef(reg)[2])

lln.hoff <- unname((-1.96 - coef(reg)[1])/coef(reg)[2])

abline(v = c(lln.hoff,uln.hoff), lty = 2)

plot of chunk unnamed-chunk-5

lln.hoff

lln.hoff

## [1] 1.105849

1	## [1] 1.105849

uln.hoff

uln.hoff

## [1] 3.699254

1	## [1] 3.699254

So the Hoffmann reference interval becomes 1.11 to 3.70 which you can compare to the expected values of about 1 and 3 based on the random data. Not the greatest but not bad.

What not to do

Let's apply the correct approach to the Hoffmann method (QQ-Plot) and incorrect approach (CDF on a linear scale) to a pseudorandom sampling (n=10,000) of the standard normal distribution, which has a mean of 0 and a standard deviation of 1. Therefore the central 95% or “normal range” for this distribution will be -1.96 to 1.96. I will plot regression lines through the linear part of each curve and find the respective intersections with the appropriate horizontal lines.

# QQ-Norm plot of standard normal distribution
z <- sort(rnorm(10000,0,1))
par(mfrow = c(1,2))
my.qq <- qqnorm(z, type = "l", datax = TRUE, plot.it = FALSE)
plot(my.qq,
     ylim = c(-2.1,2.1),
     type = "l",
     main = "Normal QQ-plot",
     ylab = "Quantiles of the Normal Distribution",
     xlab = "Sample Quantiles")
qqline(z, col = "blue")
abline(h=c(1.96,-1.96), col = "blue")
abline(v = c(1.96,-1.96), col = "blue", lty = 2) #lower and upper limits are -2 and 2.

# CDF of standard normal distribution
my.ecdf <- ecdf(z)
df <- data.frame(z = z, y = my.ecdf(z))
plot(y ~ z, data = df, type = "l", main = "Cumulative Normal Distribution")
abline(v = c(1.96,-1.96), col = "blue", lty = 2)
abline(h = c(0.025,0.975), col = "blue")
linear.bit <- subset(df, z > -.5 & z < 0.5)
abline(lm(y ~ z, data = linear.bit), col = "blue")
abline(v = c(-(0.5 - 0.05/2)*sqrt(2*pi),(0.5 - 0.05/2)*sqrt(2*pi)), col  = rgb(1,0,0,0.5) )

# QQ-Norm plot of standard normal distribution

z <- sort(rnorm(10000,0,1))

par(mfrow = c(1,2))

my.qq <- qqnorm(z, type = "l", datax = TRUE, plot.it = FALSE)

plot(my.qq,

ylim = c(-2.1,2.1),

type = "l",

main = "Normal QQ-plot",

ylab = "Quantiles of the Normal Distribution",

xlab = "Sample Quantiles")

qqline(z, col = "blue")

abline(h=c(1.96,-1.96), col = "blue")

abline(v = c(1.96,-1.96), col = "blue", lty = 2) #lower and upper limits are -2 and 2.

# CDF of standard normal distribution

my.ecdf <- ecdf(z)

df <- data.frame(z = z, y = my.ecdf(z))

plot(y ~ z, data = df, type = "l", main = "Cumulative Normal Distribution")

abline(v = c(1.96,-1.96), col = "blue", lty = 2)

abline(h = c(0.025,0.975), col = "blue")

linear.bit <- subset(df, z > -.5 & z < 0.5)

abline(lm(y ~ z, data = linear.bit), col = "blue")

abline(v = c(-(0.5 - 0.05/2)*sqrt(2*pi),(0.5 - 0.05/2)*sqrt(2*pi)), col = rgb(1,0,0,0.5) )

plot of chunk unnamed-chunk-6

The QQ-plot generates estimates the limits of normal, $\mu \pm 1.96\sigma$, as about $\pm 1.96$ as it should. You can easily show that the same procedure on the CDF intersects the lines $y = \alpha /2$ and $y = 1 – \alpha/2$ at values of $\pm (1 – \alpha) \sqrt{\pi/2} \sigma$ which is about $\pm 1.19$ for $\sigma = 1$ and $\alpha = 0.05$. This erroneous estimate is shown with the pink vertical lines. So the Hoffmann method does not work if one attempts to extend the linear portion of the CDF if it is plotted in linear space and it will produce estimates of $\sigma$ that are 40% too low in this case. If you're puting this all together, this because the CDF is well away from its linear portion when the cumulative proportions are 0.025 and 0.975—not so for a QQ-plot. If you see a “Hoffmann plot” constructed from a sigmoidal CDF plotted on a linear scale, something is wrong.

Bhattacharya

This method is based on a much more highly cited paper in Biometrics published in 1967 by C.G. Bhattacharya. Loosely speaking, the method of Bhattacharya determines the parameter estimates of $\mu_i$ and $\sigma_i$ from the slope of the log of the distribution function. It was originally intended as a graphical method and so it also involves some human eye-balling.

We will need the log of the counts from the histogram. When we store the results of a histogram in R, we have the counts automatically.

str(dhist)

1 2	str(dhist)

## List of 6
##  $ breaks  : num [1:82] 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 ...
##  $ counts  : int [1:81] 2 4 5 19 49 100 140 210 174 180 ...
##  $ density : num [1:81] 0.004 0.008 0.01 0.038 0.098 ...
##  $ mids    : num [1:81] 0.125 0.375 0.625 0.875 1.125 ...
##  $ xname   : chr "d"
##  $ equidist: logi FALSE
##  - attr(*, "class")= chr "histogram"

## List of 6

## $ breaks : num [1:82] 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 ...

## $ counts : int [1:81] 2 4 5 19 49 100 140 210 174 180 ...

## $ density : num [1:81] 0.004 0.008 0.01 0.038 0.098 ...

## $ mids : num [1:81] 0.125 0.375 0.625 0.875 1.125 ...

## $ xname : chr "d"

## $ equidist: logi FALSE

## - attr(*, "class")= chr "histogram"

We can now calculate the log of the counts (denoted $y$) and $\Delta log(y)$ from bin to bin. We put these in a dataframe along with the counts and the midpoints of the bins. The bin width, which is chosen to be constant $h$, is the distance between the midpoints of each bin.

#alter the number of breaks to make the linear sections more obvious.
dhist <- hist(d,
              breaks = 30,
             plot = FALSE)
ly <- log(dhist$counts)
dly <- diff(ly)
df <- data.frame(xm = dhist$mids[-length(dhist$mids)],
                 ly = dly,
                 counts = dhist$counts[-length(dhist$mids)])
h <- diff(df$xm)[1]

#alter the number of breaks to make the linear sections more obvious.

dhist <- hist(d,

breaks = 30,

plot = FALSE)

ly <- log(dhist$counts)

dly <- diff(ly)

df <- data.frame(xm = dhist$mids[-length(dhist$mids)],

ly = dly,

counts = dhist$counts[-length(dhist$mids)])

h <- diff(df$xm)[1]

Now let's plot $\Delta log(y)$ as a function of the midpoints of the bins. I also number all the points to facilitate the next step.

plot(ly ~ xm, data = df, xlim = c(0,10))
abline(h = 0)
abline(v = df$xm, lty = 2, col = "#00000080")

#number all the points
library(calibrate)
num.df <- na.omit(df)
textxy(num.df$xm, num.df$ly, 1:nrow(num.df),
       row.names(num.df),
       cex = 0.8,
       offset = 1,
       col = "blue")

plot(ly ~ xm, data = df, xlim = c(0,10))

abline(h = 0)

abline(v = df$xm, lty = 2, col = "#00000080")

#number all the points

library(calibrate)

num.df <- na.omit(df)

textxy(num.df$xm, num.df$ly, 1:nrow(num.df),

row.names(num.df),

cex = 0.8,

offset = 1,

col = "blue")

plot of chunk unnamed-chunk-9

We can see from the figure that there are two sections where the plot shows a downsloping line: one between points 2 to 6 and another between points 10 to 21. How straight these lines appear is affected by how wide your bins are so if you get lines that are hard to discern, you can try making fewer bins.

In any case, using Bhattacharya's notation, the next step in the procedure is to draw regression lines through the $r_{th}$ linear section and determine the intercept $\hat{\lambda}_r$ with the x-axis. Bhattacharya intended this as a graphical procedure and advises,

“While matching the straight line it is better to fit closely to the points where the frequency is large even if the apparent discrepancy becomes somewhat large where the frequency is small.”

Since we are doing this by calculation, we can take his advice by weighting the linear regressions according to the counts. This allows the determination of the $\hat{\mu}_r$ by:

\[\hat{\mu}_r = \hat{\lambda}_r + h/2\]
and also the determination of $\hat{\sigma}_r$ by:

\[\hat{\sigma}^2_r = -h/ \text{slope}_r – h^2/12\]

#linear section 1
linear.bit1 <- subset(df[2:6,])
lm1 <- lm(ly ~ xm, data = linear.bit1, weights = linear.bit1$counts)
lambda1 <- -coef(lm1)[1]/coef(lm1)[2]
mu1 <- lambda1 + h/2
sigma1 <- sqrt(-h/coef(lm1)[2] - h^2/12) 

#linear section 2
linear.bit2 <- subset(df[10:21,])
lm2 <- lm(ly ~ xm, data = linear.bit2, weights = linear.bit2$counts)
lambda2 <- -coef(lm2)[1]/coef(lm2)[2]
mu2 <- lambda2 + h/2
sigma2 <- sqrt(-h/coef(lm2)[2] - h^2/12)

#normal range limits
lln.bhat <- qnorm(0.025,mu1, sigma1) 
uln.bhat <- qnorm(0.975,mu1, sigma1)

#linear section 1

linear.bit1 <- subset(df[2:6,])

lm1 <- lm(ly ~ xm, data = linear.bit1, weights = linear.bit1$counts)

lambda1 <- -coef(lm1)[1]/coef(lm1)[2]

mu1 <- lambda1 + h/2

sigma1 <- sqrt(-h/coef(lm1)[2] - h^2/12)

#linear section 2

linear.bit2 <- subset(df[10:21,])

lm2 <- lm(ly ~ xm, data = linear.bit2, weights = linear.bit2$counts)

lambda2 <- -coef(lm2)[1]/coef(lm2)[2]

mu2 <- lambda2 + h/2

sigma2 <- sqrt(-h/coef(lm2)[2] - h^2/12)

#normal range limits

lln.bhat <- qnorm(0.025,mu1, sigma1)

uln.bhat <- qnorm(0.975,mu1, sigma1)

And here are the results we get:

mu Values	sigma Values	Normal Range Limits
2.06	0.59	0.90
6.25	1.83	3.21

And here is what it all looks like

plot(ly ~ xm, data = df, xlim = c(0,10))
abline(h = 0)
abline(v = df$xm, lty = 2, col = "#00000080")
abline(lm1, col = "green")
abline(lm2, col = "red")

plot(ly ~ xm, data = df, xlim = c(0,10))

abline(h = 0)

abline(v = df$xm, lty = 2, col = "#00000080")

abline(lm1, col = "green")

abline(lm2, col = "red")

plot of chunk unnamed-chunk-12

In this demonstration, there are only two Gaussian distributions to resolve, but the method is not limited to the resolution of two Gaussian curves at all. If there are more, there will be more downsloping lines crossing the x-axis. So we get normal range estimates of 0.90 and 3.21 which compare much better with the expected values of about 1 and 3. We also get good estimates of $\mu_2=$ 6.3 and $\sigma_2=$ 1.8 which are about 6 and 2 respectively in our data set.

Bhattacharya also provides a means of calculating the mixing proportion of the two distributions—that is, the proportions of patients in the sick and abnormal populations. We don't need that here so I omit it.

Gaussian Mixture Model

In R there are a lot of ways to approach the separation of mixtures of distributions using maximum likelihood. Here I am using a function from the mixtools package that is particularly easy to use. The concept of using maximum likelihood for mining your reference interval is not new (see this paper) but many would be intimidated by the math required to do it from scratch.

With R, this is pretty easy but please be cautioned that real data does not play as nice as the data in this demonstration (even moreso for Hoffmann and Bhattacharya) and it is unlikely that you will get smashing results unless your data fits the assumptions of the model.

In any case,

#Gaussian Mixed Model - the right way to do this
library(mixtools)
fit <- normalmixEM(d, k = 2) #try to fit two Gaussians

#Gaussian Mixed Model - the right way to do this

library(mixtools)

fit <- normalmixEM(d, k = 2) #try to fit two Gaussians

## number of iterations= 28

1	## number of iterations= 28

summary(fit)

1 2	summary(fit)

## summary of normalmixEM object:
##          comp 1   comp 2
## lambda 0.519121 0.480879
## mu     2.014404 6.186571
## sigma  0.518210 1.966858
## loglik at estimate:  -3961.014

## summary of normalmixEM object:

## comp 1 comp 2

## lambda 0.519121 0.480879

## mu 2.014404 6.186571

## sigma 0.518210 1.966858

## loglik at estimate: -3961.014

which gives very good parameter estimates indeed! Estimates of $\mu_1$ and $\mu_2$ are 2.01 and 6.19 respectively and estimates of $\sigma_1$ and $\sigma_2$ are 0.52 and 1.97 respectively.

Looking at this graphically:

hist(d, freq = FALSE, breaks = 50, main = "Histogram of Patient Results")
#show the respective curves
lines(d,fit$lambda[1]*dnorm(d,fit$mu[1],fit$sigma[1]), col = "green")
lines(d,fit$lambda[2]*dnorm(d,fit$mu[2],fit$sigma[2]), col = "red")

hist(d, freq = FALSE, breaks = 50, main = "Histogram of Patient Results")

#show the respective curves

lines(d,fit$lambda[1]*dnorm(d,fit$mu[1],fit$sigma[1]), col = "green")

lines(d,fit$lambda[2]*dnorm(d,fit$mu[2],fit$sigma[2]), col = "red")

plot of chunk unnamed-chunk-14

#find the 2.5th 97.5th percentile from the mixed model fit
lln.EM <- qnorm(0.025,fit$mu[1], fit$sigma[1]) 
lln.EM

#find the 2.5th 97.5th percentile from the mixed model fit

lln.EM <- qnorm(0.025,fit$mu[1], fit$sigma[1])

lln.EM

## [1] 0.9987315

1	## [1] 0.9987315

uln.EM <- qnorm(0.975,fit$mu[1], fit$sigma[1]) 
uln.EM

uln.EM <- qnorm(0.975,fit$mu[1], fit$sigma[1])

uln.EM

## [1] 3.030077

1	## [1] 3.030077

So the normal range estimate from EM method is 1.00 to 3.03 which is pretty fantastic.

Summary of Results

	LLN	ULN
Raw Random Data	1.03	2.98
Hoffmann	1.11	3.70
Bhattacharya	0.90	3.21
mixtools EM – winner!	1.00	3.03

It's not too hard to figure out which one of these approaches works best. But what do you do if your patient data distribution is obviously not a mixture of Gaussians (ie when the distributions look skewed)? There are ways to do this in R for this but I will cover that another time–maybe in a paper.

Conclusion

Three methods of estimating the normal range from a mixture of Gaussians have been presented.
The Hoffmann method performs OK if you use a QQ-plot.
The Hoffmann method does not work for CDFs plotted on a linear scale.
The Bhattacharya method performs better but still requires some human oversight.
The normalmixEM() function from the mixtools package performs very well without any human oversight.
These results do not imply that any of these approaches will perform well on real patient data for which the components of the overall distribution are not likely to be Gaussian. Caution advised.

Parting Thought

Please don't fall on the wrong side of God's mixture separation procedures for wheat and chaff.

Said, John the Baptist, “But after me comes one who is more powerful than I, whose sandals I am not worthy to carry. He will baptize you with the Holy Spirit and fire. His winnowing fork is in his hand, and he will clear his threshing floor, gathering his wheat into the barn and burning up the chaff with unquenchable fire.”

Matt 3:11–12

Compare Tube Types with R – Repeated Measures ANOVA

August 21, 2017February 23, 2019 dtholmes@mail.ubc.ca

Background

Sometimes we might want to compare three or four tube types for a particular analyte on a group of patients or we might want to see if a particular analyte is stable over time in aliqioted samples. In these experiments are essentially doing the multivariable analogue of the paired t-test. In the tube-type experiment, the factor that is differing between the (‘paired’) groups is the container: serum separator tubes (SST), EDTA plasma tubes, plasma separator tubes (PST) etc. In a stability experiment, the factor that is differing is storage duration.

Since this is a fairly common clinical lab experiment, I thought I would just jot down how this is accomplished in R – though I must confess I know just about $\lim_{x\to0}x$ about statistics. In any case, the statistical test is a repeated-measures ANOVA and this is one way to do it (there are many) including an approach to the post-hoc testing.

Some Fake Data to Work With

I’m going to make some fake data. I tried to dig up the data from an experiment I did as a resident but alas, I think the raw data died on an old laptop. But fake data will do for demonstration purposes. Let’s suppose we are looking at parathyroid hormone (PTH) in three different blood collection tubes: SST, EDTA and PST. For the sake of argument, let’s say that we collect samples from 20 patients simultaneously and we anlayze them all as per our usual process. This means that each patient has three samples of material that should be otherwise identical outside of the effects of the collection contained.

library(magrittr)
set.seed(100) #to force the same pseudo-random each time
#data in pmol/L
#induce some heteroscedastic error
SST <- runif(20,3,50)  
PST <- 1.03*SST + rnorm(20,0,0.1)*SST #set the data up to show no difference
EDTA <- 1.15*SST + rnorm(20,0,0.1)*SST  #set the data up to show a difference
tube.data <- data.frame(SST,PST,EDTA) %>% round(.,1)
tube.data <- data.frame(Subject = factor(1:20), tube.data)

library(magrittr)

set.seed(100) #to force the same pseudo-random each time

#data in pmol/L

#induce some heteroscedastic error

SST <- runif(20,3,50)

PST <- 1.03*SST + rnorm(20,0,0.1)*SST #set the data up to show no difference

EDTA <- 1.15*SST + rnorm(20,0,0.1)*SST #set the data up to show a difference

tube.data <- data.frame(SST,PST,EDTA) %>% round(.,1)

tube.data <- data.frame(Subject = factor(1:20), tube.data)

This is the way we usually express (and receive) data like this in an Excel spreadsheet:

Subject	SST	PST	EDTA
1	17.5	18.1	19.9
2	15.1	15.7	20.0
3	29.0	29.2	32.9
4	5.7	6.2	6.4
5	25.0	26.1	27.0
6	25.7	26.4	29.0
7	41.2	40.8	48.1
8	20.4	22.1	24.3
9	28.7	26.9	36.0
10	11.0	13.9	13.7
11	32.4	31.9	36.9
12	44.5	49.2	57.4
13	16.2	17.1	15.7
14	21.7	24.1	26.3
15	38.8	36.8	42.6
16	34.4	34.0	44.2
17	12.6	12.1	14.1
18	19.8	20.9	25.4
19	19.9	18.2	23.0
20	35.4	37.4	34.1

This Excel-ish way of storing the data is referred to as the “datawide” format for obvious reasons.

Gather the Grain

As it turns out this is not the way that we want to store data to do the statistical analyses of interest. What we want to do is have the tube type in a single column because this is the factor that is different within the subjects. We want to gather() or melt() the data (depending on your package of choice) to be like so:

library(tidyr)
tube.data.2 <- gather(tube.data, key = "Subject")
tube.data.2 %>% kable()

library(tidyr)

tube.data.2 <- gather(tube.data, key = "Subject")

tube.data.2 %>% kable()

Subject	Subject	value
1	SST	17.5
2	SST	15.1
3	SST	29.0
4	SST	5.7
5	SST	25.0
6	SST	25.7
7	SST	41.2
8	SST	20.4
9	SST	28.7
10	SST	11.0
11	SST	32.4
12	SST	44.5
13	SST	16.2
14	SST	21.7
15	SST	38.8
16	SST	34.4
17	SST	12.6
18	SST	19.8
19	SST	19.9
20	SST	35.4
1	PST	18.1
2	PST	15.7
3	PST	29.2
4	PST	6.2
5	PST	26.1
6	PST	26.4
7	PST	40.8
8	PST	22.1
9	PST	26.9
10	PST	13.9
11	PST	31.9
12	PST	49.2
13	PST	17.1
14	PST	24.1
15	PST	36.8
16	PST	34.0
17	PST	12.1
18	PST	20.9
19	PST	18.2
20	PST	37.4
1	EDTA	19.9
2	EDTA	20.0
3	EDTA	32.9
4	EDTA	6.4
5	EDTA	27.0
6	EDTA	29.0
7	EDTA	48.1
8	EDTA	24.3
9	EDTA	36.0
10	EDTA	13.7
11	EDTA	36.9
12	EDTA	57.4
13	EDTA	15.7
14	EDTA	26.3
15	EDTA	42.6
16	EDTA	44.2
17	EDTA	14.1
18	EDTA	25.4
19	EDTA	23.0
20	EDTA	34.1

Now we see that there is a column for tube-type and a column for the PTH results which we can name accordingly. You can see why this called the “datalong” format.

names(tube.data.2) <- c("Subject", "Tube.Type", "PTH")
tube.data.2$Tube.Type <- as.factor(tube.data.2$Tube.Type) #turns tube type into factor

names(tube.data.2) <- c("Subject", "Tube.Type", "PTH")

tube.data.2$Tube.Type <- as.factor(tube.data.2$Tube.Type) #turns tube type into factor

Visualize

Summarize the data:

summary(tube.data)

1 2	summary(tube.data)

##     Subject        SST             PST             EDTA      
##  1      : 1   Min.   : 5.70   Min.   : 6.20   Min.   : 6.40  
##  2      : 1   1st Qu.:17.18   1st Qu.:17.85   1st Qu.:19.98  
##  3      : 1   Median :23.35   Median :25.10   Median :26.65  
##  4      : 1   Mean   :24.75   Mean   :25.36   Mean   :28.85  
##  5      : 1   3rd Qu.:32.90   3rd Qu.:32.42   3rd Qu.:36.23  
##  6      : 1   Max.   :44.50   Max.   :49.20   Max.   :57.40  
##  (Other):14

## Subject SST PST EDTA

## 1 : 1 Min. : 5.70 Min. : 6.20 Min. : 6.40

## 2 : 1 1st Qu.:17.18 1st Qu.:17.85 1st Qu.:19.98

## 3 : 1 Median :23.35 Median :25.10 Median :26.65

## 4 : 1 Mean :24.75 Mean :25.36 Mean :28.85

## 5 : 1 3rd Qu.:32.90 3rd Qu.:32.42 3rd Qu.:36.23

## 6 : 1 Max. :44.50 Max. :49.20 Max. :57.40

## (Other):14

Let’s just have a quick look graphically:

library(mcr)
plot(mcreg(SST, EDTA,
           method.reg = "PaBa",
           mref.name = "SST",
           mtest.name = "EDTA"))

library(mcr)

plot(mcreg(SST, EDTA,

method.reg = "PaBa",

mref.name = "SST",

mtest.name = "EDTA"))

plot of chunk unnamed-chunk-6

plot(mcreg(SST, PST,
           method.reg = "PaBa",
           mref.name = "SST",
           mtest.name = "PST"))

plot(mcreg(SST, PST,

method.reg = "PaBa",

mref.name = "SST",

mtest.name = "PST"))

plot of chunk unnamed-chunk-6

And as a boxplot with the points overtop:

boxplot(PTH ~ Tube.Type,
        data = tube.data.2,
        col = c("purple", "lightgreen", "gold"))
stripchart(PTH ~ Tube.Type,
           vertical = TRUE,
           data = tube.data.2, 
           method = "jitter",
           add = TRUE,
           pch = 20,
           col = rgb(0,0,0,0.5))

boxplot(PTH ~ Tube.Type,

data = tube.data.2,

col = c("purple", "lightgreen", "gold"))

stripchart(PTH ~ Tube.Type,

vertical = TRUE,

data = tube.data.2,

method = "jitter",

add = TRUE,

pch = 20,

col = rgb(0,0,0,0.5))

plot of chunk unnamed-chunk-7

Separate the Wheat from the Chaff

Now we want to make comparisons to see if these are different. To accomplish this, we will use the aov() function. This requires us to have data formatted “datalong” as it is in the tube.data.2 dataframe.

fit <- aov(PTH ~ Tube.Type + Error(Subject/Tube.Type), data=tube.data.2)

1 2	fit <- aov(PTH ~ Tube.Type + Error(Subject/Tube.Type), data=tube.data.2)

If you are like me, this syntax is confusing. But it goes like this. PTH is a function of Tube.Type which is straight forward–hence the PTH ~ Tube.Type bit. The error term has the Subject in front of the / and the factor that is different within the subjects (Tube.Type) after the /. That’s my grade 2 explanation from reading this and this and this.

summary(fit)

1 2	summary(fit)

## 
## Error: Subject
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 19   7307   384.6               
## 
## Error: Subject:Tube.Type
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## Tube.Type  2  195.9   97.97   22.47 3.63e-07 ***
## Residuals 38  165.7    4.36                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Error: Subject

## Df Sum Sq Mean Sq F value Pr(>F)

## Residuals 19 7307 384.6

## Error: Subject:Tube.Type

## Df Sum Sq Mean Sq F value Pr(>F)

## Tube.Type 2 195.9 97.97 22.47 3.63e-07 ***

## Residuals 38 165.7 4.36

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This tells us that there is a difference between the groups but it does not specify where the difference is.

I can’t see the difference. Can you see the difference?

Sorry – I just had to make a pop-culture reference to this. We want to be specific about where the differences are without making a Type I error which might arise if we blindly charge ahead and do multiple paired t-tests. One easy way to accomplish this is to use the pairwise.t.test() function which does corrections for multiple comparisons. You can choose from a number of approaches for adjustment for pairwise comparison. This requires the “response vector” which is PTH and the “grouping factor” which is the tube type.

# choices for p.adjust.method are: c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
pwt <- pairwise.t.test(tube.data.2$PTH, tube.data.2$Tube.Type, p.adj = "bonferroni", paired = TRUE)
pwt

# choices for p.adjust.method are: c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")

pwt <- pairwise.t.test(tube.data.2$PTH, tube.data.2$Tube.Type, p.adj = "bonferroni", paired = TRUE)

pwt

## 
##  Pairwise comparisons using paired t tests 
## 
## data:  tube.data.2$PTH and tube.data.2$Tube.Type 
## 
##     EDTA    PST    
## PST 0.00083 -      
## SST 7.9e-05 0.35033
## 
## P value adjustment method: bonferroni

## Pairwise comparisons using paired t tests

## data: tube.data.2$PTH and tube.data.2$Tube.Type

## EDTA PST

## PST 0.00083 -

## SST 7.9e-05 0.35033

## P value adjustment method: bonferroni

This is pretty easy to understand. There are statistically significant differences found between the EDTA and PST (p = 0.00083) and the EDTA and PST (p = 0.00008) but none between SST and PST (p = 0.35033).

Conclusion

Non-statistician’s approach to tube-type comparisons, which is also applicable to analyte stability studies. This is a one-way repeated measures ANOVA with one within-subjects factor. There is a great deal more to say on the matter by people who know much more in the citations in the links provided above.

God probably uses datawide format

All the nations will be gathered before him, and he will separate the people one from another as a shepherd separates the sheep from the goats. He will put the sheep on his right and the goats on his left.

(Matt 25:32-33)