CLSI – The Lab-R-torian

Make Easy Heatmaps to Visualize your Turnaround Times

September 8, 2016 dtholmes@mail.ubc.ca

The Problem

In two previous posts, I discussed visualizing your turnaround times (TATs). These posts are here and here. One other nice way to visualize your TAT is by means of a heatmap. In particular, we would like to look at the TAT for every hour of the week in a single figure. This manner of dataviz bling seems to be particularly attractive to managers because it costs you $0 to do this with R, but with commercial tools like Tableau, you'd have to pay a fortune and, as with Excel, your report would not be readily reproducible. Further, to make it autogenerate a PDF would mean you had to fork out more money for a report-generation module. Pffft.

The Data

We're going to read in a year's worth of order times and result times for a stat immunoassay test offered to a particular ward. The data, as I've formatted it, has two columns, ord and res.

test.data <- read.csv("test_data.csv")
head(test.data)

test.data <- read.csv("test_data.csv")

head(test.data)

##                   ord                 res
## 1 2015-01-01 13:24:00 2015-01-01 14:29:00
## 2 2015-01-01 06:16:00 2015-01-01 07:43:00
## 3 2015-01-01 06:32:00 2015-01-01 07:43:00
## 4 2015-01-01 06:32:00 2015-01-01 07:43:00
## 5 2015-01-01 12:12:00 2015-01-01 13:13:00
## 6 2015-01-01 12:12:00 2015-01-01 13:13:00

## ord res

## 1 2015-01-01 13:24:00 2015-01-01 14:29:00

## 2 2015-01-01 06:16:00 2015-01-01 07:43:00

## 3 2015-01-01 06:32:00 2015-01-01 07:43:00

## 4 2015-01-01 06:32:00 2015-01-01 07:43:00

## 5 2015-01-01 12:12:00 2015-01-01 13:13:00

## 6 2015-01-01 12:12:00 2015-01-01 13:13:00

Now, of course, we want to look at data collected from a long period of time so that we can be sure that the observations we are not simply an artifact of recent instrument downtime, maintenance, or whoever happened to be running the instrument. This is why I chose a year's worth of data. We are going to visualize the median order-to-file TAT for this test.

Formatting and Calculations

To calculate the hourly medians, we'll need to be able to label every TAT with the day it was run and the hour in the day it was run. This is pretty easy with the lubridate package. We'll do three things:

We'll convert the dates to POSIXct objects
We'll use the difftime() function to calculate the TATs
We'll use the wday() function to determine which day of the week the specimen was run on
We'll pull out the hour of the day on which it was run with the format() function.

library("dplyr")
library("lubridate")
library("fields")
library("magrittr")

test.data$ord <- ymd_hms(test.data$ord)
test.data$res <- ymd_hms(test.data$res)
test.data <- mutate(test.data,otf = difftime(res,ord,units="min"))
test.data <- mutate(test.data,dow = wday(ord))
test.data <- mutate(test.data,hod = as.numeric(format(test.data$ord, "%H")))

library("dplyr")

library("lubridate")

library("fields")

library("magrittr")

test.data$ord <- ymd_hms(test.data$ord)

test.data$res <- ymd_hms(test.data$res)

test.data <- mutate(test.data,otf = difftime(res,ord,units="min"))

test.data <- mutate(test.data,dow = wday(ord))

test.data <- mutate(test.data,hod = as.numeric(format(test.data$ord, "%H")))

And now the data will look like this:

head(test.data)

1 2	head(test.data)

##                   ord                 res     otf dow hod
## 1 2015-01-01 13:24:00 2015-01-01 14:29:00 65 mins   5  13
## 2 2015-01-01 06:16:00 2015-01-01 07:43:00 87 mins   5   6
## 3 2015-01-01 06:32:00 2015-01-01 07:43:00 71 mins   5   6
## 4 2015-01-01 06:32:00 2015-01-01 07:43:00 71 mins   5   6
## 5 2015-01-01 12:12:00 2015-01-01 13:13:00 61 mins   5  12
## 6 2015-01-01 12:12:00 2015-01-01 13:13:00 61 mins   5  12

## ord res otf dow hod

## 1 2015-01-01 13:24:00 2015-01-01 14:29:00 65 mins 5 13

## 2 2015-01-01 06:16:00 2015-01-01 07:43:00 87 mins 5 6

## 3 2015-01-01 06:32:00 2015-01-01 07:43:00 71 mins 5 6

## 4 2015-01-01 06:32:00 2015-01-01 07:43:00 71 mins 5 6

## 5 2015-01-01 12:12:00 2015-01-01 13:13:00 61 mins 5 12

## 6 2015-01-01 12:12:00 2015-01-01 13:13:00 61 mins 5 12

where the order-to-file TAT is in the otf column, the day-of-week is in the dow column and the hour-of-day is in the hod column. Now we can cycle though the days of the week and the hours of the day and calculate the year's median TAT for each hour, storing it in a matrix:

#prepare an empty matrix
heat.data <- matrix(rep(NA,7*24),nrow = 7, ncol = 24)
#loop over the days and hours and calculate the median TAT
for(i in 1:7){
  for(j in 0:23){
    heat.data[i,j+1] <- subset(test.data, test.data$dow==i & test.data$hod==j)$otf %>% median
  }
}

#prepare an empty matrix

heat.data <- matrix(rep(NA,7*24),nrow = 7, ncol = 24)

#loop over the days and hours and calculate the median TAT

for(i in 1:7){

for(j in 0:23){

heat.data[i,j+1] <- subset(test.data, test.data$dow==i & test.data$hod==j)$otf %>% median

}

Making the Heatmap

There are many ways to make the heatmap but I am particularly fond of the appearance of surface plots made with the fields package.

image.plot(1:7,seq(from=0.5, to=23.5, by = 1),heat.data,axes=FALSE, 
           xlab = "Day of Week", ylab = "Hour of Day", ylim=c(0,24))
# the following pointless command is necessary to make the custom axis labels non-transparent
# google revealed this among a number of other workarounds.
points(0,0)
# now these will display properly
axis(side=1, at=1:7, labels=as.character(wday(1:7, label=TRUE)), las=2, cex.axis = 0.8)
axis(side=2, at= 0:24, labels=0:24, las=1, cex.axis=0.8)

image.plot(1:7,seq(from=0.5, to=23.5, by = 1),heat.data,axes=FALSE,

xlab = "Day of Week", ylab = "Hour of Day", ylim=c(0,24))

# the following pointless command is necessary to make the custom axis labels non-transparent

# google revealed this among a number of other workarounds.

points(0,0)

# now these will display properly

axis(side=1, at=1:7, labels=as.character(wday(1:7, label=TRUE)), las=2, cex.axis = 0.8)

axis(side=2, at= 0:24, labels=0:24, las=1, cex.axis=0.8)

plot of chunk unnamed-chunk-5

Overlay Printed Times

We can see that there is a morning slowdown that is particularly bad on Saturday. But what if we wanted to know the exact value for these eye-catching problem times? We'd have trouble, unless we overlaid some text.

It turns out that if you use white printing, you can't read the numbers when the background colour is yellow and green. There is a 64 colour gradient used in the image.plot() function, so I calculated which integers in 0–64 were the problem and found the TATs that would correspond. It turned out that colours 20–45 out of the 64 colours in the gradient are the problem. By this means, I can make the printing black over the yellows and greens but white everywhere else:

image.plot(1:7,seq(from=0.5, to=23.5, by = 1),heat.data,axes=FALSE, 
           xlab = "Day of Week", ylab = "Hour of Day", ylim=c(0,24))
points(0,0) #random command that resets par
axis(side=1, at=1:7, labels=as.character(wday(1:7, label=TRUE)), las=2, cex.axis = 0.8)
axis(side=2, at= 0:24, labels=0:24, las=1, cex.axis=0.8)

# calculate the lowest and highest TAT
min.z <- min(heat.data)
max.z <- max(heat.data)
# determine which TAT's will have yellow to green shading
z.yellows <- min.z + (max.z - min.z)/64*c(20,45) 
# print the labels
for(i in 1:7){
  for(j in 1:24){
    if((heat.data[i,j] > z.yellows[1])&(heat.data[i,j] < z.yellows[2])){
      text(i,j-0.5,heat.data[i,j], col="black", cex = 0.8)
    }else{
      text(i,j-0.5,heat.data[i,j], col="white", cex = 0.8)     
    }
  }
}

image.plot(1:7,seq(from=0.5, to=23.5, by = 1),heat.data,axes=FALSE,

xlab = "Day of Week", ylab = "Hour of Day", ylim=c(0,24))

points(0,0) #random command that resets par

axis(side=1, at=1:7, labels=as.character(wday(1:7, label=TRUE)), las=2, cex.axis = 0.8)

axis(side=2, at= 0:24, labels=0:24, las=1, cex.axis=0.8)

# calculate the lowest and highest TAT

min.z <- min(heat.data)

max.z <- max(heat.data)

# determine which TAT's will have yellow to green shading

z.yellows <- min.z + (max.z - min.z)/64*c(20,45)

# print the labels

for(i in 1:7){

for(j in 1:24){

if((heat.data[i,j] > z.yellows[1])&(heat.data[i,j] < z.yellows[2])){

text(i,j-0.5,heat.data[i,j], col="black", cex = 0.8)

}else{

text(i,j-0.5,heat.data[i,j], col="white", cex = 0.8)

}

plot of chunk unnamed-chunk-6

So, that is not too bad, and if you wanted to look at the 75th percentile instead you would only have to adjust the heat.data calculation as follows:

#prepare an empty matrix
heat.data <- matrix(rep(NA,7*24),nrow = 7, ncol = 24)
#loop over the days and hours and calculate the median TAT
for(i in 1:7){
  for(j in 0:23){
    heat.data[i,j+1] <- subset(test.data, test.data$dow==i & test.data$hod==j)$otf %>% quantile(.,probs=0.75)
  }
}

#prepare an empty matrix

heat.data <- matrix(rep(NA,7*24),nrow = 7, ncol = 24)

#loop over the days and hours and calculate the median TAT

for(i in 1:7){

for(j in 0:23){

heat.data[i,j+1] <- subset(test.data, test.data$dow==i & test.data$hod==j)$otf %>% quantile(.,probs=0.75)

}

And this is what you will get.

plot of chunk unnamed-chunk-8

Hmmm…we'd better look at Saturday morning, 6 am. I hope you have found this helpful.

And as for heat

“He will sit as a refiner and purifier of silver”

Malachi 3:3

Make Bland Altman Plots with Marginal Histograms using ggExtra

August 29, 2016August 30, 2016 dtholmes@mail.ubc.ca

The Problem

As you know in Clinical Chemistry, we are not always writing a major paper but sometimes just preparing a short-report to answer a technical question that we've encounted at work. For shorter papers, journals often have more stringent rules about how many figures you can submit and even sometimes forbid multipanelled figures. In these situations, we might want to cram a little more into your figure than we might otherwise. In a recent submission, I wanted to produce a difference plot of immunoassay results before and after storage but I also wanted to show the distribution of the results using a histogram–but this would have counted as two separate figures.

However, thanks to some fine work by Dean Attali of UBC Department of Statistics where he works with R-legend Jenny Bryan, it is quite easy to add marginal histograms to a Bland Altman (or any other scatter) plot using the ggExtra package.

How To

Let's make some fake data for a Bland Altman plot. Let's pretend that we are measuring the same quantity by immunoassay at baseline and after 1 year of storage at -80 degrees. We'll add some heteroscedastic error and create some apparent degradation of about 20%:

set.seed(10) #make predictable random data
baseline <- rlnorm(100, 0, 1)
post <- 0.8*baseline + rnorm(100, 0, 0.10*baseline)
plot(baseline,post)
abline(lm(post ~ baseline))
abline(0, 1, col="red", lty = 2)

set.seed(10) #make predictable random data

baseline <- rlnorm(100, 0, 1)

post <- 0.8*baseline + rnorm(100, 0, 0.10*baseline)

plot(baseline,post)

abline(lm(post ~ baseline))

abline(0, 1, col="red", lty = 2)

plot of chunk unnamed-chunk-1

Or if we plot this in the ggplot() paradigm

library(ggplot2)
my.data <- data.frame(baseline, post)
ggplot(my.data, aes(x=baseline, y=post)) +
    theme_bw() + 
    geom_point(shape=1) +    # Use hollow circles
    geom_smooth(method=lm) +  # Add linear regression line 
    geom_abline(slope = 1, intercept = 0, linetype = 2, colour = "red")

library(ggplot2)

my.data <- data.frame(baseline, post)

ggplot(my.data, aes(x=baseline, y=post)) +

theme_bw() +

geom_point(shape=1) + # Use hollow circles

geom_smooth(method=lm) + # Add linear regression line

geom_abline(slope = 1, intercept = 0, linetype = 2, colour = "red")

plot of chunk unnamed-chunk-2

Now we will prepare the difference data:

diff <- (post - baseline)
diffp <- (post - baseline)/baseline*100
sd.diff <- sd(diff)
sd.diffp <- sd(diffp)
my.data <- data.frame(baseline, post, diff, diffp)

diff <- (post - baseline)

diffp <- (post - baseline)/baseline*100

sd.diff <- sd(diff)

sd.diffp <- sd(diffp)

my.data <- data.frame(baseline, post, diff, diffp)

In standard Bland Altman plots, one plots the difference between methods against the average of the methods, but in this case, the x-axis should be the baseline result, because that is the closest thing we have to the truth.

library(ggExtra)
diffplot <- ggplot(my.data, aes(baseline, diff)) + 
  geom_point(size=2, colour = rgb(0,0,0, alpha = 0.5)) + 
  theme_bw() + 
  #when the +/- 2SD lines will fall outside the default plot limits 
  #they need to be pre-stated explicitly to make the histogram line up properly. 
  #Thanks to commenter for noticing this.
  ylim(mean(my.data$diff) - 3*sd.diff, mean(my.data$diff) + 3*sd.diff) +
  geom_hline(yintercept = 0, linetype = 3) +
  geom_hline(yintercept = mean(my.data$diff)) +
  geom_hline(yintercept = mean(my.data$diff) + 2*sd.diff, linetype = 2) +
  geom_hline(yintercept = mean(my.data$diff) - 2*sd.diff, linetype = 2) +
  ylab("Difference pre and post Storage (mg/L)") +
  xlab("Baseline Concentration (mg/L)")

#And now for the magic - we'll use 25 bins
ggMarginal(diffplot, type="histogram", bins = 25)

library(ggExtra)

diffplot <- ggplot(my.data, aes(baseline, diff)) +

geom_point(size=2, colour = rgb(0,0,0, alpha = 0.5)) +

theme_bw() +

#when the +/- 2SD lines will fall outside the default plot limits

#they need to be pre-stated explicitly to make the histogram line up properly.

#Thanks to commenter for noticing this.

ylim(mean(my.data$diff) - 3*sd.diff, mean(my.data$diff) + 3*sd.diff) +

geom_hline(yintercept = 0, linetype = 3) +

geom_hline(yintercept = mean(my.data$diff)) +

geom_hline(yintercept = mean(my.data$diff) + 2*sd.diff, linetype = 2) +

geom_hline(yintercept = mean(my.data$diff) - 2*sd.diff, linetype = 2) +

ylab("Difference pre and post Storage (mg/L)") +

xlab("Baseline Concentration (mg/L)")

#And now for the magic - we'll use 25 bins

ggMarginal(diffplot, type="histogram", bins = 25)

plot of chunk unnamed-chunk-4

So that is the difference plot for the absolute difference. We can also obviously do the percent difference.

diffplotp <- ggplot(my.data, aes(baseline, diffp)) + 
  geom_point(size=2, colour = rgb(0,0,0, alpha = 0.5)) + 
  theme_bw() + 
  geom_hline(yintercept = 0, linetype = 3) +
  geom_hline(yintercept = mean(my.data$diffp)) +
  geom_hline(yintercept = mean(my.data$diffp) + 2*sd.diffp, linetype = 2) +
  geom_hline(yintercept = mean(my.data$diffp) - 2*sd.diffp, linetype = 2) +
  ylab("Difference pre and post Storage (%)") +
  xlab("Baseline Concentration (mg/L)")


ggMarginal(diffplotp, type="histogram", bins = 25)

diffplotp <- ggplot(my.data, aes(baseline, diffp)) +

geom_point(size=2, colour = rgb(0,0,0, alpha = 0.5)) +

theme_bw() +

geom_hline(yintercept = 0, linetype = 3) +

geom_hline(yintercept = mean(my.data$diffp)) +

geom_hline(yintercept = mean(my.data$diffp) + 2*sd.diffp, linetype = 2) +

geom_hline(yintercept = mean(my.data$diffp) - 2*sd.diffp, linetype = 2) +

ylab("Difference pre and post Storage (%)") +

xlab("Baseline Concentration (mg/L)")

ggMarginal(diffplotp, type="histogram", bins = 25)

plot of chunk unnamed-chunk-5

Kickin' it Old School

You can also do this in a non-ggplot() paradigm using base plotting utilities as described in this R-bloggers post.

Conclusion

And that, friends, is a way of squishing in a histogram of your sample concentrations into your difference plot which allows you to graphically display your sampling distribution and justify whether you would use parametric or non-parametric statistics to assess the extent of loss of immunoreactivity from storage.

And speaking of scatterplots

“…then the Lord your God will restore your fortunes and have compassion on you and gather you again from all the nations where he scattered you.”
Deut 30:3

A Shiny App for Passing Bablok and Deming Regression

August 15, 2016August 15, 2016 dtholmes@mail.ubc.ca

Background

Back in 2011 I was not aware of any tool in R for Passing Bablok (PB) regression, a form of robust regression described in a series of three papers in Clinical Chemistry and Laboratory Medicine (then J Clin Chem and Biochem) available here, here and here. For reasons that are not entirely clear to me, this regression methodology is favoured by clinical chemists but seems largely ignored by other disciplines. However since reviewers clinical chemistry journals will demand the use of PB regression, it seemed expeditious to me to code it in R. This is what spawned a small project for a piece of software to do PB (and Deming and ordinary least squares) regression using a self-contained executable that could be downloaded, unzipped on a Windows Desktop and just ran. You can download here and instructions for installation and use are here and here respectively. The calculations are all done in R, the GUI is built with Python and Py-Qt4 and the executable with cx_freeze. I made it run without an installer because hospital IT often refuse to install software that has not been officially vetted and purchased. The tool was a lot more popular than I anticipated now having about 2000 downloads. In any case, maintenance, upgrades, bug fixing and dealing with operating system updates that break things (like OSX El Capitan's security policies) are no-fun so a Shiny based solution to the same problem makes a lot of sense.

Update

Since 2011, statisticians at Roche Diagnostics programmed the mcr package for PB and Deming regression. Additionally, there is also the MethComp package and the deming package from the Mayo Clinic which both offer PB regression.

Shiny App

Enter Burak Bahar, a like-minded Clinical Pathologist who is currently doing a fellowship at Yale. He liked my cp-R program but he saw the need for a web-based equivalent.

Burak and his wife Ayse, also a physician, have coded a Shiny App for doing Deming, PB and least squares regression in R which is capable of producing publication quality figures and provides all the regression statistics you would need for method-validation or publication. It can also produce a regression report in PDF, Word or HTML. The dynamic duo of the Bahar-MDs deserve all credit here as my only contribution related to suggestions related to usability. This project was presented at the 2016 American Association of Clinical Chemistry meeting in Philadelphia.

The app URL is bahar.shinyapps.io/method_compare. Go to the data tab on the left and then cut and paste your data from an spreadsheet program. Shortcuts CTRL-C (copy) and CTRL-V (paste) work natively in the table. The table is pre-populated with some random data for demonstration purposes. Once your data is pasted in, click on the Plots tab and choose the Bland-Altman or Scatter Plot.

Example

Here is an image generated with the Bahar Shiny app using method comparison data obtained from St. Paul's Hospital Laboratory in migrating from Siemens Immulite 2000 XPi to Roche Cobas e601 for Calcitonin determination. Don't worry, we did more than 33 comparison–I am just showing the low end.

Try adjusting some of the plot parameters. The figures will update in real time. Thanks to Burak and Ayse Bahar for your work!

(Dan's) Parting Thought

There are straight lines that matter a lot more than regression.

I will make justice the measuring line and righteousness the plumb line
(Isa 28:17)

Deming and Passing Bablok Regression in R

September 14, 2015September 21, 2015 dtholmes@mail.ubc.ca

Regression Methods

In this post we will be discussing how to perform Passing Bablok and Deming regression in R. Those who work in Clinical Chemistry know that these two approaches are required by the journals in the field. The idiosyncratic affection for these two forms of regression appears to be historical but this is something unlikely to change in my lifetime–hence the need to cover it here.

Along the way, we shall touch on the ways in which Deming and Passing Bablok differ from ordinary least squares (OLS) and from one another.

Creating some random data

Let's start by making some heteroscedastic random data that we can use for regression. We will use the command set.seed() to begin with because by this means, the reader can generate the same random data as the post. This function takes any number you wish as its argument, but if you set the same seed, you will get the same random numbers. We will generate 100 random $x$ values in the uniform distribution and then an accompanying 100 random $y$ values with proportional bias, constant bias and random noise that increases with $x$. I have added a bit of non–linearity because we do see this a fair bit in our work.

set.seed(20)
x <- runif(100,0,100)
y <- 1.10*x - 0.001*x^2 + rnorm(100,0,1)*(2 + 0.05*x) + 15

set.seed(20)

x <- runif(100,0,100)

y <- 1.10*x - 0.001*x^2 + rnorm(100,0,1)*(2 + 0.05*x) + 15

The constants I chose are arbitrary. I chose them to produce something resembling a comparison of, say, two automated immunoassays.

Let's quickly produce a scatter plot to see what our data looks like:

plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")

1	plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")

plot of chunk unnamed-chunk-2

Residuals in OLS

OLS regression minimizes the sum of squared residuals. In the case of OLS, the residual of a point is defined as the vertical distance from that point to the regression line. The regression line is chosen so that the sum of the squares of the residuals in minimal.

OLS regression assumes that there is no error in the $x$–axis values and that there is no heteroscedasticity, that is, the scatter of $y$ is constant. Neither of these assumptions is true in the case of bioanaytical method comparisons. In contrast, for calibration curves in mass–spectrometry, a linear response is plotted as a function of pre–defined calibrator concentration. This means that the $x$–axis has very little error and so OLS regression is an appropriate choice (though I doubt that the assumption about homoscedasticity is generally met).

OLS is part of R's base package. We can find the OLS regression line using lm() and we will store the results in the variable lin.reg.

plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")
lin.reg <- lm(y~x)
abline(lin.reg, col="blue")

plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")

lin.reg <- lm(y~x)

abline(lin.reg, col="blue")

plot of chunk unnamed-chunk-3

Just to demonstrate the point about residuals graphically, the following shows them in vertical red lines.

plot of chunk unnamed-chunk-4

Deming Regression

Deming regression differs from OLS regression in that it does not make the assumption that the $x$ values are free of error. It (more or less) defines the residual as the perpendicular distance from a point to its fitted value on the regression line.

Deming regression does not come as part of R's base package but can be performed using the MethComp and mcr packages. In this case, we will use the latter. If not already installed, you must install the mcr package with install.packages("mcr").

Then to perform Deming regression, we will load the mcr library and execute the following using the mcreg() command, storing the output in the variable dem.reg.

library(mcr)
dem.reg <- mcreg(x,y, method.reg = "Deming")

1 2	library(mcr) dem.reg <- mcreg(x,y, method.reg = "Deming")

By performing the str() command on dem.reg, we can see that the regression parameters are stored in the slot @para. Because the authors have used an S4 object as the output of their function, we don't address output as we would in lists (with a $), but rather with an @.

str(dem.reg)

1	str(dem.reg)

## Formal class 'MCResultResampling' [package "mcr"] with 21 slots
##   ..@ glob.coef  : num [1:2] 15.58 1.04
##   ..@ glob.sigma : num [1:2] 0.8165 0.0147
##   ..@ xmean      : num 46.8
##   ..@ nsamples   : int 999
##   ..@ nnested    : num 25
##   ..@ B0         : num [1:999] 15.9 15.4 16 16.1 15.6 ...
##   ..@ B1         : num [1:999] 1.01 1.04 1.02 1.04 1.03 ...
##   ..@ sigmaB0    : num [1:999] 0.794 0.766 0.846 0.815 0.737 ...
##   ..@ sigmaB1    : num [1:999] 0.0141 0.0142 0.0155 0.0141 0.0135 ...
##   ..@ MX         : num [1:999] 46.8 45.9 45.4 48.9 45.5 ...
##   ..@ bootcimeth : chr "quantile"
##   ..@ rng.seed   : num NA
##   ..@ rng.kind   : chr NA
##   ..@ data       :'data.frame':  100 obs. of  3 variables:
##   .. ..$ sid: Factor w/ 100 levels "S1","S10","S100",..: 1 13 24 35 46 57 68 79 90 2 ...
##   .. ..$ x  : num [1:100] 87.8 76.9 27.9 52.9 96.3 ...
##   .. ..$ y  : num [1:100] 110.8 93.5 45.6 76.6 116.6 ...
##   ..@ para       : num [1:2, 1:4] 15.58 1.04 NA NA 14.45 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:2] "Intercept" "Slope"
##   .. .. ..$ : chr [1:4] "EST" "SE" "LCI" "UCI"
##   ..@ mnames     : chr [1:2] "Method1" "Method2"
##   ..@ regmeth    : chr "Deming"
##   ..@ cimeth     : chr "bootstrap"
##   ..@ error.ratio: num 1
##   ..@ alpha      : num 0.05
##   ..@ weight     : Named num [1:100] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:100] "S1" "S2" "S3" "S4" ...

## Formal class 'MCResultResampling' [package "mcr"] with 21 slots

## ..@ glob.coef : num [1:2] 15.58 1.04

## ..@ glob.sigma : num [1:2] 0.8165 0.0147

## ..@ xmean : num 46.8

## ..@ nsamples : int 999

## ..@ nnested : num 25

## ..@ B0 : num [1:999] 15.9 15.4 16 16.1 15.6 ...

## ..@ B1 : num [1:999] 1.01 1.04 1.02 1.04 1.03 ...

## ..@ sigmaB0 : num [1:999] 0.794 0.766 0.846 0.815 0.737 ...

## ..@ sigmaB1 : num [1:999] 0.0141 0.0142 0.0155 0.0141 0.0135 ...

## ..@ MX : num [1:999] 46.8 45.9 45.4 48.9 45.5 ...

## ..@ bootcimeth : chr "quantile"

## ..@ rng.seed : num NA

## ..@ rng.kind : chr NA

## ..@ data :'data.frame': 100 obs. of 3 variables:

## .. ..$ sid: Factor w/ 100 levels "S1","S10","S100",..: 1 13 24 35 46 57 68 79 90 2 ...

## .. ..$ x : num [1:100] 87.8 76.9 27.9 52.9 96.3 ...

## .. ..$ y : num [1:100] 110.8 93.5 45.6 76.6 116.6 ...

## ..@ para : num [1:2, 1:4] 15.58 1.04 NA NA 14.45 ...

## .. ..- attr(*, "dimnames")=List of 2

## .. .. ..$ : chr [1:2] "Intercept" "Slope"

## .. .. ..$ : chr [1:4] "EST" "SE" "LCI" "UCI"

## ..@ mnames : chr [1:2] "Method1" "Method2"

## ..@ regmeth : chr "Deming"

## ..@ cimeth : chr "bootstrap"

## ..@ error.ratio: num 1

## ..@ alpha : num 0.05

## ..@ weight : Named num [1:100] 1 1 1 1 1 1 1 1 1 1 ...

## .. ..- attr(*, "names")= chr [1:100] "S1" "S2" "S3" "S4" ...

dem.reg@para

1	dem.reg@para

##                EST SE       LCI       UCI
## Intercept 15.57790 NA 14.446677 16.810321
## Slope      1.03658 NA  1.006434  1.066066

## EST SE LCI UCI

## Intercept 15.57790 NA 14.446677 16.810321

## Slope 1.03658 NA 1.006434 1.066066

The intercept and slope are stored in demreg@para[1] and dem.reg@para[2] respectively. Therefore, we can add the regression line as follows:

plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")
abline(dem.reg@para[1:2], col = "blue")

1 2	plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method") abline(dem.reg@para[1:2], col = "blue")

plot of chunk unnamed-chunk-7

To emphasize how the residuals are different from OLS we can plot them as before:

plot of chunk unnamed-chunk-8

We present the figure above for instructional purposes only. The usual way to present a residuals plot is to show the same picture rotated until the line is horizontal–this is a slight simplification but is essentially what is happening:

plot of chunk unnamed-chunk-9

Ratio of Variances

It is important to mention that if one knows that the $x$–axis method is subject to a different amount of random analytical variability than the $y$–axis method, one should provide the ratio of the variances of the two methods to mcreg(). In general, this requires us to have “CV” data from precision studies already available. Another approach is to perform every analysis in duplicate by both methods and use the data to estimate this ratio.

If the methods happen to have similar CVs throughout the analytical range, the default value of 1 is assumed. But suppose that the ratio of the CVs of the $x$ axis method to the $y$–axis method was 1.2, we could provide this in the regression call by setting the error.ratio parameter. The resulting regression parameters will be slightly different.

mcreg(x,y, method.reg = "Deming", error.ratio = 1.2)@para

1	mcreg(x,y, method.reg = "Deming", error.ratio = 1.2)@para

##                 EST SE      LCI       UCI
## Intercept 15.534921 NA 14.39904 16.777065
## Slope      1.037499 NA  1.00792  1.067316

## EST SE LCI UCI

## Intercept 15.534921 NA 14.39904 16.777065

## Slope 1.037499 NA 1.00792 1.067316

Weighting

In the case of heteroscedastic data, it would be customary to weight the regression which in the case of the mcr package is weighted as $1/x^2$. This means that having 0's in your $x$–data will cause the calculation to “crump”. In any case, if we wanted weighted regression parameters we would make the call:

w.dem.reg <- mcreg(x,y, method.reg = "WDeming")

1	w.dem.reg <- mcreg(x,y, method.reg = "WDeming")

## The global.sigma is calculated with Linnet's method

1	## The global.sigma is calculated with Linnet's method

w.dem.reg@para

1	w.dem.reg@para

##                 EST SE       LCI       UCI
## Intercept 13.788450 NA 12.858803 14.861006
## Slope      1.088119 NA  1.058042  1.116879

## EST SE LCI UCI

## Intercept 13.788450 NA 12.858803 14.861006

## Slope 1.088119 NA 1.058042 1.116879

And plotting both on the same figure:

plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")
abline(dem.reg@para[1:2], col = "blue")
abline(w.dem.reg@para[1:2], col = "green")
legend("topleft", c("Deming","Weighted Deming"), lty=c(1,1), col = c("blue","green"))

plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")

abline(dem.reg@para[1:2], col = "blue")

abline(w.dem.reg@para[1:2], col = "green")

legend("topleft", c("Deming","Weighted Deming"), lty=c(1,1), col = c("blue","green"))

plot of chunk unnamed-chunk-12

Passing Bablok

Passing Bablok regression is not performed by the minimization of residuals. Rather, all possible pairs of $x$–$y$ points are determined and slopes are calculated using each pair of points. Work–arounds are undertaken for pairs of points that generate infinite slopes and other peculiarities. In any case, the median of the $\frac{N(N-1)}{2!}$ possible slopes becomes the final slope estimate and the corresponding intercept can be calculated. With regards to weighted Passing Bablok regression, I’d like to acknowledge commenter glen_b for bringing to my attention that there is a paradigm for calculating the weighted median of pairwise slopes. See the comment section for a discussion.

Passing Bablok regression takes a lot of computational time as the number of points grows, so expect some delays on data sets larger than $N=100$ if you are using an ordinary computer. To get the Passing Bablok regression equation, we just change the method.reg parameter:

PB.reg <- mcreg(x,y, method.reg = "PaBa")
PB.reg@para

1 2	PB.reg <- mcreg(x,y, method.reg = "PaBa") PB.reg@para

##                 EST SE       LCI       UCI
## Intercept 14.684463 NA 13.648554 16.495846
## Slope      1.046021 NA  1.015893  1.075632

## EST SE LCI UCI

## Intercept 14.684463 NA 13.648554 16.495846

## Slope 1.046021 NA 1.015893 1.075632

and the procedures to plot this regression are identical. The mcreg() function does have an option for Passing Bablok regression on large data sets. See the instructions by typing help("mcreg") in the R terminal.

Outlier Effects

As a consequence of the means by which the slope is determined, the Passing Bablok method is relatively resistant to the effect of outlier(s) as compared to OLS and Deming. To demonstrate this, we can add on outlier to some data scattered about the line $y=x$ and show how all three methods are affected.

x <- 1:20
y <- c(1:19,10) + rnorm(20,0,0.5)

1 2	x <- 1:20 y <- c(1:19,10) + rnorm(20,0,0.5)

plot of chunk unnamed-chunk-15

Because of this outlier, the OLS slope drops to 0.84, the Deming slope to 0.91, while the Passing Bablok is much better off at 0.99.

Generating a Pretty Plot

The code authors of the mcr package have created a feature such that if you put the regression model inside the plot function, you can quickly generate a figure for yourself that has all the required information on it. For example,

plot(PB.reg)

1	plot(PB.reg)

plot of chunk unnamed-chunk-16

But this method of out–of–the–box figure is not very customizable and you may want it to appear differently for your publication. Never fear. There is a solution. The MCResult.plot() function offers complete customization of the figure so that you can show it exactly as you wish for your publication.

MCResult.plot(PB.reg, equal.axis = TRUE, x.lab = "x method", y.lab = "y method", points.col = "#FF7F5060", points.pch = 19, ci.area = TRUE, ci.area.col = "#0000FF50", main = "My Passing Bablok Regression", sub = "", add.grid = FALSE, points.cex = 1)

1	MCResult.plot(PB.reg, equal.axis = TRUE, x.lab = "x method", y.lab = "y method", points.col = "#FF7F5060", points.pch = 19, ci.area = TRUE, ci.area.col = "#0000FF50", main = "My Passing Bablok Regression", sub = "", add.grid = FALSE, points.cex = 1)

custom mcr plot

In this example, I have created semi–transparent “darkorchid4” (hex = #68228B) points and a semi–transparent blue (hex = #0000FF) confidence band of the regression. Maybe darkorchid would not be my first choice for a publication after all, but it demonstrates the customization. Additionally, I have suppressed my least favourite features of the default plot method. Specifically, the sub="" term removes the sentence at the bottom margin and the add.grid = FALSE prevents the grid from being plotted. Enter help(MCResult.plot) for the complete low–down on customization.

Conclusion

We have seen how to perform Deming and Passing Bablok regression in the R programming language and have touched on how the methods differ “under the hood”. We have used the mcr to perform the regressions and have shown how you can beautify your plot.

The reader should have a look at the rlm() function in the MASS package and the rq() function in the quantreg package to see other robust (outlier–resistant) regression approaches. A good tutorial can be found here

I hope that makes it easy for you.

-Dan

May all your paths (and regressions) be straight:

Trust in the Lord with all your heart
and lean not on your own understanding;
in all your ways submit to him,
and he will make your paths straight.

Proverbs 3:5-6

The Problem

The Data

Formatting and Calculations

Making the Heatmap

Overlay Printed Times

The Problem

How To

Kickin' it Old School

Conclusion

Background

Update

Shiny App

Example

Regression Methods

Creating some random data

Residuals in OLS

Deming Regression

Ratio of Variances

Weighting

Passing Bablok

Outlier Effects

Generating a Pretty Plot

Conclusion

Trust in the Lord with all your heart and lean not on your own understanding; in all your ways submit to him, and he will make your paths straight. Proverbs 3:5-6

Trust in the Lord with all your heart
and lean not on your own understanding;
in all your ways submit to him,
and he will make your paths straight.

Proverbs 3:5-6