laboratory – The Lab-R-torian

Mining Your Routine Data for Reference Intervals: Hoffmann, Bhattacharya and Maximum Likelihood

September 5, 2017February 1, 2018 dtholmes@mail.ubc.ca

Background

Let me preface this by saying I am not making a recommendation to use the Hoffmann method. Neither am I advocating for reference interval mining from routine data. There are many challenges associated with this kind of effort. That's for another post I think. However, I am going to how one does the calculations for two methods I have seen used: the Hoffmann Method and the Bhattacharya Method. Then I will show how to do this using the mixtools package in R which uses the expectation maximum algorithm to determine the maximum likelihood.

The Concept

When you look at histograms of routine clinical data from allcomers, on some occasions the data will form a bimodal looking distribution formed by the putatively sick and well. If you could statistically determine the distribution of the well subjects, then you could, in principle, determine the reference interval without performing a reference interval study. We can all dream, right?

All three of the approaches I show assume that the two distributions are Gaussian. This is almost never true. But for the purposes of the calculations, I will provide each approach data that meets the assumptions it makes. So, let's make a fake bimodal distribution and see how each method does. We will assume equal numbers of sick and well so that the bimodal distribution is obvious. One will have $\mu_1 = 2$ and $\sigma_1 = 0.5$ and the other will have $\mu_2 = 6$ and $\sigma_2 = 2$. The expected normal range for this population is based on $\mu_1$ and $\sigma_1$ and is $2 – 0.5 \times 1.96$ and $2 + 0.5 \times 1.96$ or about 1–3.

#two Gaussian distributions with means of 2 and 6 respectively and sd's of 1 and 2
set.seed(10) #to make sure you generate the same data
mode1 <- rnorm(1000,2,0.5)
mode1 <- mode1[mode1 > 0] #get rid of negative results
mode2 <- rnorm(1000,6,2)
mode2 <- mode2[mode2 > 0] 
d <- sort(c(mode1,mode2))
dhist <- hist(d,
              breaks = c(seq(0,20,0.25),100),
              xlim = c(0,10),
              main = "Histogram of Patient Results",
              xlab = "Concentration of Analyte")

#two Gaussian distributions with means of 2 and 6 respectively and sd's of 1 and 2

set.seed(10) #to make sure you generate the same data

mode1 <- rnorm(1000,2,0.5)

mode1 <- mode1[mode1 > 0] #get rid of negative results

mode2 <- rnorm(1000,6,2)

mode2 <- mode2[mode2 > 0]

d <- sort(c(mode1,mode2))

dhist <- hist(d,

breaks = c(seq(0,20,0.25),100),

xlim = c(0,10),

main = "Histogram of Patient Results",

xlab = "Concentration of Analyte")

plot of chunk unnamed-chunk-1

To illustrate how the two populations add you can plot one in green and one in pink. The overlap shows in a yucky brown.

hist(d,
     breaks = c(seq(0,10,0.25),100),
     freq = TRUE,
     xlim = c(0,10),
     main = "Histogram of Patient Results",
     xlab = "Concentration of Analyte")
hist(mode1, breaks = c(seq(0,10,0.25),100), add = TRUE, col = rgb(0,1,0,0.3), freq = TRUE)
hist(mode2, breaks = c(seq(0,10,0.25),100), add = TRUE, col = rgb(1,0,0,0.3), freq = TRUE)

hist(d,

breaks = c(seq(0,10,0.25),100),

freq = TRUE,

xlim = c(0,10),

main = "Histogram of Patient Results",

xlab = "Concentration of Analyte")

hist(mode1, breaks = c(seq(0,10,0.25),100), add = TRUE, col = rgb(0,1,0,0.3), freq = TRUE)

hist(mode2, breaks = c(seq(0,10,0.25),100), add = TRUE, col = rgb(1,0,0,0.3), freq = TRUE)

plot of chunk unnamed-chunk-2

Hoffmann

In 1963 Robert Hoffmann proposed a simple graphical approach to this problem and use of his method is alive and well—see here for example. The method assumes that both modes are Gaussian and that if one eye-balls (yes…the paper says “eye-fit”) the first linear-looking portion of the cumulative probability distribution (CDF) function as plotted on normal probability paper and finds its intersection with the lines y = 0.025 and y = 0.975, one can impute the normal range.

What do I plot for Hoffmann: a QQ-plot or the CDF?

It is very important to understand that the use normal probability paper, as Hoffmann described, was mandatory because it produces a normal probablity plot. As he says,

“This special graph paper serves the useful purpose of 'straightening out' a cumulative gaussian distribution. It forms a straight line.”

A CDF plotted on linear scale is sigmoidal. This is not what we want. We want a normal probability plot which is just a special case of the QQ-plot where the comparator distribution is the normal distribution. Inadvertently plotting a plain old CDF will not produce correct estimates of the lower and upper limits of normal (ie $\mu \pm 1.96\sigma$). The reason I emphasize this is that I have seen this error made in a number of reference interval papers—but not the one I cited above—it is correct. The importance of the distinction becomes not-very-subtle when you apply the Hoffmann approach to a pure Gaussian distribution. In short, use of the CDF in linear space generates erroneous results as we will show later on.

The Correct Approach

Here is the standard r-base normal QQ-plot of our mock data set:

qqnorm(d, type = "l")

1 2	qqnorm(d, type = "l")

plot of chunk unnamed-chunk-3

To prevent reader confusion, I am going present the plots the way Hoffmann originally showed them. So I will put the patient data on the x-axis. It doesn't change anything. You can do it as you like.

my.qq <- qqnorm(d, datax = TRUE, type = "l", ylab = "Patient Results", xlab = "Quantiles of the Normal Distribution")

1 2	my.qq <- qqnorm(d, datax = TRUE, type = "l", ylab = "Patient Results", xlab = "Quantiles of the Normal Distribution")

plot of chunk unnamed-chunk-4

From this you can see that there is obviously linear section between about x = 0 to x = 2 (and with the eye of faith, there is a second after x = 6). This is what Hoffmann calls the “eye-fit”. Since the first linear section is attributable to the first of the two normal distributions which form the overall distribution, we can use the it to determine properties of the first distribution. If I look only at the data between x = 0 and x = 2, I am sort-of guaranteed to be in the first linear section. You don't have to kill yourself to correctly identify where the linearity ends because the density of the points should be highest near the middle of the linear section and this will weight the regression for you.

Next if I extend this line to find its intersection with y = -1.96 and y = 1.96 (ie the z-scores corresponding the limits of normal, namely the 2.5th and 97.5th centiles), I can estimate the reference interval, by dropping perpendicular lines from the two respective intersections. Here is what I get:

#get regression line - it's linear from about from 0 to 2
my.qq <- as.data.frame(my.qq)
linear.bit <- subset(my.qq, x <= 2)
#get the regression line of the linear section
reg <- lm(y ~ x, data = linear.bit)
plot(y ~ x,
     data = my.qq,
     type = "l",
     ylab = "Quantiles of the Normal Distribution",
     xlab = "Patient Results")
abline(reg, col = "red")
abline(h = c(-1.96,1.96), lty = 2)
uln.hoff <- unname((1.96 - coef(reg)[1])/coef(reg)[2])
lln.hoff <- unname((-1.96 - coef(reg)[1])/coef(reg)[2])
abline(v = c(lln.hoff,uln.hoff), lty = 2)

#get regression line - it's linear from about from 0 to 2

my.qq <- as.data.frame(my.qq)

linear.bit <- subset(my.qq, x <= 2)

#get the regression line of the linear section

reg <- lm(y ~ x, data = linear.bit)

plot(y ~ x,

data = my.qq,

type = "l",

ylab = "Quantiles of the Normal Distribution",

xlab = "Patient Results")

abline(reg, col = "red")

abline(h = c(-1.96,1.96), lty = 2)

uln.hoff <- unname((1.96 - coef(reg)[1])/coef(reg)[2])

lln.hoff <- unname((-1.96 - coef(reg)[1])/coef(reg)[2])

abline(v = c(lln.hoff,uln.hoff), lty = 2)

plot of chunk unnamed-chunk-5

lln.hoff

lln.hoff

## [1] 1.105849

1	## [1] 1.105849

uln.hoff

uln.hoff

## [1] 3.699254

1	## [1] 3.699254

So the Hoffmann reference interval becomes 1.11 to 3.70 which you can compare to the expected values of about 1 and 3 based on the random data. Not the greatest but not bad.

What not to do

Let's apply the correct approach to the Hoffmann method (QQ-Plot) and incorrect approach (CDF on a linear scale) to a pseudorandom sampling (n=10,000) of the standard normal distribution, which has a mean of 0 and a standard deviation of 1. Therefore the central 95% or “normal range” for this distribution will be -1.96 to 1.96. I will plot regression lines through the linear part of each curve and find the respective intersections with the appropriate horizontal lines.

# QQ-Norm plot of standard normal distribution
z <- sort(rnorm(10000,0,1))
par(mfrow = c(1,2))
my.qq <- qqnorm(z, type = "l", datax = TRUE, plot.it = FALSE)
plot(my.qq,
     ylim = c(-2.1,2.1),
     type = "l",
     main = "Normal QQ-plot",
     ylab = "Quantiles of the Normal Distribution",
     xlab = "Sample Quantiles")
qqline(z, col = "blue")
abline(h=c(1.96,-1.96), col = "blue")
abline(v = c(1.96,-1.96), col = "blue", lty = 2) #lower and upper limits are -2 and 2.

# CDF of standard normal distribution
my.ecdf <- ecdf(z)
df <- data.frame(z = z, y = my.ecdf(z))
plot(y ~ z, data = df, type = "l", main = "Cumulative Normal Distribution")
abline(v = c(1.96,-1.96), col = "blue", lty = 2)
abline(h = c(0.025,0.975), col = "blue")
linear.bit <- subset(df, z > -.5 & z < 0.5)
abline(lm(y ~ z, data = linear.bit), col = "blue")
abline(v = c(-(0.5 - 0.05/2)*sqrt(2*pi),(0.5 - 0.05/2)*sqrt(2*pi)), col  = rgb(1,0,0,0.5) )

# QQ-Norm plot of standard normal distribution

z <- sort(rnorm(10000,0,1))

par(mfrow = c(1,2))

my.qq <- qqnorm(z, type = "l", datax = TRUE, plot.it = FALSE)

plot(my.qq,

ylim = c(-2.1,2.1),

type = "l",

main = "Normal QQ-plot",

ylab = "Quantiles of the Normal Distribution",

xlab = "Sample Quantiles")

qqline(z, col = "blue")

abline(h=c(1.96,-1.96), col = "blue")

abline(v = c(1.96,-1.96), col = "blue", lty = 2) #lower and upper limits are -2 and 2.

# CDF of standard normal distribution

my.ecdf <- ecdf(z)

df <- data.frame(z = z, y = my.ecdf(z))

plot(y ~ z, data = df, type = "l", main = "Cumulative Normal Distribution")

abline(v = c(1.96,-1.96), col = "blue", lty = 2)

abline(h = c(0.025,0.975), col = "blue")

linear.bit <- subset(df, z > -.5 & z < 0.5)

abline(lm(y ~ z, data = linear.bit), col = "blue")

abline(v = c(-(0.5 - 0.05/2)*sqrt(2*pi),(0.5 - 0.05/2)*sqrt(2*pi)), col = rgb(1,0,0,0.5) )

plot of chunk unnamed-chunk-6

The QQ-plot generates estimates the limits of normal, $\mu \pm 1.96\sigma$, as about $\pm 1.96$ as it should. You can easily show that the same procedure on the CDF intersects the lines $y = \alpha /2$ and $y = 1 – \alpha/2$ at values of $\pm (1 – \alpha) \sqrt{\pi/2} \sigma$ which is about $\pm 1.19$ for $\sigma = 1$ and $\alpha = 0.05$. This erroneous estimate is shown with the pink vertical lines. So the Hoffmann method does not work if one attempts to extend the linear portion of the CDF if it is plotted in linear space and it will produce estimates of $\sigma$ that are 40% too low in this case. If you're puting this all together, this because the CDF is well away from its linear portion when the cumulative proportions are 0.025 and 0.975—not so for a QQ-plot. If you see a “Hoffmann plot” constructed from a sigmoidal CDF plotted on a linear scale, something is wrong.

Bhattacharya

This method is based on a much more highly cited paper in Biometrics published in 1967 by C.G. Bhattacharya. Loosely speaking, the method of Bhattacharya determines the parameter estimates of $\mu_i$ and $\sigma_i$ from the slope of the log of the distribution function. It was originally intended as a graphical method and so it also involves some human eye-balling.

We will need the log of the counts from the histogram. When we store the results of a histogram in R, we have the counts automatically.

str(dhist)

1 2	str(dhist)

## List of 6
##  $ breaks  : num [1:82] 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 ...
##  $ counts  : int [1:81] 2 4 5 19 49 100 140 210 174 180 ...
##  $ density : num [1:81] 0.004 0.008 0.01 0.038 0.098 ...
##  $ mids    : num [1:81] 0.125 0.375 0.625 0.875 1.125 ...
##  $ xname   : chr "d"
##  $ equidist: logi FALSE
##  - attr(*, "class")= chr "histogram"

## List of 6

## $ breaks : num [1:82] 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 ...

## $ counts : int [1:81] 2 4 5 19 49 100 140 210 174 180 ...

## $ density : num [1:81] 0.004 0.008 0.01 0.038 0.098 ...

## $ mids : num [1:81] 0.125 0.375 0.625 0.875 1.125 ...

## $ xname : chr "d"

## $ equidist: logi FALSE

## - attr(*, "class")= chr "histogram"

We can now calculate the log of the counts (denoted $y$) and $\Delta log(y)$ from bin to bin. We put these in a dataframe along with the counts and the midpoints of the bins. The bin width, which is chosen to be constant $h$, is the distance between the midpoints of each bin.

#alter the number of breaks to make the linear sections more obvious.
dhist <- hist(d,
              breaks = 30,
             plot = FALSE)
ly <- log(dhist$counts)
dly <- diff(ly)
df <- data.frame(xm = dhist$mids[-length(dhist$mids)],
                 ly = dly,
                 counts = dhist$counts[-length(dhist$mids)])
h <- diff(df$xm)[1]

#alter the number of breaks to make the linear sections more obvious.

dhist <- hist(d,

breaks = 30,

plot = FALSE)

ly <- log(dhist$counts)

dly <- diff(ly)

df <- data.frame(xm = dhist$mids[-length(dhist$mids)],

ly = dly,

counts = dhist$counts[-length(dhist$mids)])

h <- diff(df$xm)[1]

Now let's plot $\Delta log(y)$ as a function of the midpoints of the bins. I also number all the points to facilitate the next step.

plot(ly ~ xm, data = df, xlim = c(0,10))
abline(h = 0)
abline(v = df$xm, lty = 2, col = "#00000080")

#number all the points
library(calibrate)
num.df <- na.omit(df)
textxy(num.df$xm, num.df$ly, 1:nrow(num.df),
       row.names(num.df),
       cex = 0.8,
       offset = 1,
       col = "blue")

plot(ly ~ xm, data = df, xlim = c(0,10))

abline(h = 0)

abline(v = df$xm, lty = 2, col = "#00000080")

#number all the points

library(calibrate)

num.df <- na.omit(df)

textxy(num.df$xm, num.df$ly, 1:nrow(num.df),

row.names(num.df),

cex = 0.8,

offset = 1,

col = "blue")

plot of chunk unnamed-chunk-9

We can see from the figure that there are two sections where the plot shows a downsloping line: one between points 2 to 6 and another between points 10 to 21. How straight these lines appear is affected by how wide your bins are so if you get lines that are hard to discern, you can try making fewer bins.

In any case, using Bhattacharya's notation, the next step in the procedure is to draw regression lines through the $r_{th}$ linear section and determine the intercept $\hat{\lambda}_r$ with the x-axis. Bhattacharya intended this as a graphical procedure and advises,

“While matching the straight line it is better to fit closely to the points where the frequency is large even if the apparent discrepancy becomes somewhat large where the frequency is small.”

Since we are doing this by calculation, we can take his advice by weighting the linear regressions according to the counts. This allows the determination of the $\hat{\mu}_r$ by:

\[\hat{\mu}_r = \hat{\lambda}_r + h/2\]
and also the determination of $\hat{\sigma}_r$ by:

\[\hat{\sigma}^2_r = -h/ \text{slope}_r – h^2/12\]

#linear section 1
linear.bit1 <- subset(df[2:6,])
lm1 <- lm(ly ~ xm, data = linear.bit1, weights = linear.bit1$counts)
lambda1 <- -coef(lm1)[1]/coef(lm1)[2]
mu1 <- lambda1 + h/2
sigma1 <- sqrt(-h/coef(lm1)[2] - h^2/12) 

#linear section 2
linear.bit2 <- subset(df[10:21,])
lm2 <- lm(ly ~ xm, data = linear.bit2, weights = linear.bit2$counts)
lambda2 <- -coef(lm2)[1]/coef(lm2)[2]
mu2 <- lambda2 + h/2
sigma2 <- sqrt(-h/coef(lm2)[2] - h^2/12)

#normal range limits
lln.bhat <- qnorm(0.025,mu1, sigma1) 
uln.bhat <- qnorm(0.975,mu1, sigma1)

#linear section 1

linear.bit1 <- subset(df[2:6,])

lm1 <- lm(ly ~ xm, data = linear.bit1, weights = linear.bit1$counts)

lambda1 <- -coef(lm1)[1]/coef(lm1)[2]

mu1 <- lambda1 + h/2

sigma1 <- sqrt(-h/coef(lm1)[2] - h^2/12)

#linear section 2

linear.bit2 <- subset(df[10:21,])

lm2 <- lm(ly ~ xm, data = linear.bit2, weights = linear.bit2$counts)

lambda2 <- -coef(lm2)[1]/coef(lm2)[2]

mu2 <- lambda2 + h/2

sigma2 <- sqrt(-h/coef(lm2)[2] - h^2/12)

#normal range limits

lln.bhat <- qnorm(0.025,mu1, sigma1)

uln.bhat <- qnorm(0.975,mu1, sigma1)

And here are the results we get:

mu Values	sigma Values	Normal Range Limits
2.06	0.59	0.90
6.25	1.83	3.21

And here is what it all looks like

plot(ly ~ xm, data = df, xlim = c(0,10))
abline(h = 0)
abline(v = df$xm, lty = 2, col = "#00000080")
abline(lm1, col = "green")
abline(lm2, col = "red")

plot(ly ~ xm, data = df, xlim = c(0,10))

abline(h = 0)

abline(v = df$xm, lty = 2, col = "#00000080")

abline(lm1, col = "green")

abline(lm2, col = "red")

plot of chunk unnamed-chunk-12

In this demonstration, there are only two Gaussian distributions to resolve, but the method is not limited to the resolution of two Gaussian curves at all. If there are more, there will be more downsloping lines crossing the x-axis. So we get normal range estimates of 0.90 and 3.21 which compare much better with the expected values of about 1 and 3. We also get good estimates of $\mu_2=$ 6.3 and $\sigma_2=$ 1.8 which are about 6 and 2 respectively in our data set.

Bhattacharya also provides a means of calculating the mixing proportion of the two distributions—that is, the proportions of patients in the sick and abnormal populations. We don't need that here so I omit it.

Gaussian Mixture Model

In R there are a lot of ways to approach the separation of mixtures of distributions using maximum likelihood. Here I am using a function from the mixtools package that is particularly easy to use. The concept of using maximum likelihood for mining your reference interval is not new (see this paper) but many would be intimidated by the math required to do it from scratch.

With R, this is pretty easy but please be cautioned that real data does not play as nice as the data in this demonstration (even moreso for Hoffmann and Bhattacharya) and it is unlikely that you will get smashing results unless your data fits the assumptions of the model.

In any case,

#Gaussian Mixed Model - the right way to do this
library(mixtools)
fit <- normalmixEM(d, k = 2) #try to fit two Gaussians

#Gaussian Mixed Model - the right way to do this

library(mixtools)

fit <- normalmixEM(d, k = 2) #try to fit two Gaussians

## number of iterations= 28

1	## number of iterations= 28

summary(fit)

1 2	summary(fit)

## summary of normalmixEM object:
##          comp 1   comp 2
## lambda 0.519121 0.480879
## mu     2.014404 6.186571
## sigma  0.518210 1.966858
## loglik at estimate:  -3961.014

## summary of normalmixEM object:

## comp 1 comp 2

## lambda 0.519121 0.480879

## mu 2.014404 6.186571

## sigma 0.518210 1.966858

## loglik at estimate: -3961.014

which gives very good parameter estimates indeed! Estimates of $\mu_1$ and $\mu_2$ are 2.01 and 6.19 respectively and estimates of $\sigma_1$ and $\sigma_2$ are 0.52 and 1.97 respectively.

Looking at this graphically:

hist(d, freq = FALSE, breaks = 50, main = "Histogram of Patient Results")
#show the respective curves
lines(d,fit$lambda[1]*dnorm(d,fit$mu[1],fit$sigma[1]), col = "green")
lines(d,fit$lambda[2]*dnorm(d,fit$mu[2],fit$sigma[2]), col = "red")

hist(d, freq = FALSE, breaks = 50, main = "Histogram of Patient Results")

#show the respective curves

lines(d,fit$lambda[1]*dnorm(d,fit$mu[1],fit$sigma[1]), col = "green")

lines(d,fit$lambda[2]*dnorm(d,fit$mu[2],fit$sigma[2]), col = "red")

plot of chunk unnamed-chunk-14

#find the 2.5th 97.5th percentile from the mixed model fit
lln.EM <- qnorm(0.025,fit$mu[1], fit$sigma[1]) 
lln.EM

#find the 2.5th 97.5th percentile from the mixed model fit

lln.EM <- qnorm(0.025,fit$mu[1], fit$sigma[1])

lln.EM

## [1] 0.9987315

1	## [1] 0.9987315

uln.EM <- qnorm(0.975,fit$mu[1], fit$sigma[1]) 
uln.EM

uln.EM <- qnorm(0.975,fit$mu[1], fit$sigma[1])

uln.EM

## [1] 3.030077

1	## [1] 3.030077

So the normal range estimate from EM method is 1.00 to 3.03 which is pretty fantastic.

Summary of Results

	LLN	ULN
Raw Random Data	1.03	2.98
Hoffmann	1.11	3.70
Bhattacharya	0.90	3.21
mixtools EM – winner!	1.00	3.03

It's not too hard to figure out which one of these approaches works best. But what do you do if your patient data distribution is obviously not a mixture of Gaussians (ie when the distributions look skewed)? There are ways to do this in R for this but I will cover that another time–maybe in a paper.

Conclusion

Three methods of estimating the normal range from a mixture of Gaussians have been presented.
The Hoffmann method performs OK if you use a QQ-plot.
The Hoffmann method does not work for CDFs plotted on a linear scale.
The Bhattacharya method performs better but still requires some human oversight.
The normalmixEM() function from the mixtools package performs very well without any human oversight.
These results do not imply that any of these approaches will perform well on real patient data for which the components of the overall distribution are not likely to be Gaussian. Caution advised.

Parting Thought

Please don't fall on the wrong side of God's mixture separation procedures for wheat and chaff.

Said, John the Baptist, “But after me comes one who is more powerful than I, whose sandals I am not worthy to carry. He will baptize you with the Holy Spirit and fire. His winnowing fork is in his hand, and he will clear his threshing floor, gathering his wheat into the barn and burning up the chaff with unquenchable fire.”

Matt 3:11–12

Non-Linear Regression: Application to Monoclonal Peak Integration in Serum Protein Electrophoresis

August 28, 2017August 28, 2017 dtholmes@mail.ubc.ca

Background

At the AACC meeting recently, there was an enthusiastic discussion of standardization of reporting for serum protein electrophoresis (SPEP) presented by a working group headed up by Dr. Chris McCudden and Dr. Ron Booth, both of the University of Ottawa. One of the discussions pertained to how monoclonal bands, especially small ones, should be integrated. While many use the default manual vertical gating or “drop” method offered by Sebia's Phoresis software, Dr. David Keren was discussing the value of tangent skimming as a more repeatable and effective means of monoclonal protein quantitation. He was also discussing some biochemical approaches distinguishing monoclonal proteins from the background gamma proteins.

The drop method is essentially an eye-ball approach to where the peak starts and ends and is represented by the vertical lines and the enclosed shaded area.

plot of chunk unnamed-chunk-1

The tangent skimming approach is easier to make reproducible. In the mass spectrometry world it is a well-developed approach with a long history and multiple algorithms in use. This is apparently the book. However, when tangent skimming is employed in SPEP, unless I am mistaken, it seems to be done by eye. The integration would look like this:

plot of chunk unnamed-chunk-2

During the discussion it was point out that peak deconvolution of the monoclonal protein from the background gamma might be preferable to either of the two described procedures. By this I mean integration as follows:

plot of chunk unnamed-chunk-3

There was discussion this procedure is challenging for number of reasons. Further, it should be noted that there will only likely be any clinical value in a deconvolution approach when the concentration of the monoclonal protein is low enough that manual integration will show poor repeatability, say < 5 g/L = 0.5 g/dL.

Easy Peaks

Fitting samples with larger monoclonal peaks is fairly easy. Fitting tends to converge nicely and produce something meaningful. For example, using the approach I am about to show below, an electropherogram like this:

plot of chunk unnamed-chunk-4

with a gamma region looking like this:

plot of chunk unnamed-chunk-5

can be deconvoluted with straightforward non-linear regression (and no baseline subtraction) to yield this:

plot of chunk unnamed-chunk-6

and the area of the green monoclonal peak is found to be 5.3%.

More Difficult Peaks

What is more challenging is the problem of small monoclonals buried in normal $\gamma$-globulins. These could be difficult to integrate using a tangent skimming approach, particularly without image magnification. For the remainder of this post we will use a gel with a small monoclonal in the fast gamma region shown at the arrow.

plot of chunk unnamed-chunk-7

Getting the Data

EP data can be extracted from the PDF output from any electrophoresis software. This is not complicated and can be accomplished with pdf2svg or Inkscape and some Linux bash scripting. I'm sure we can get it straight from the instrument but it is not obvious to me how to do this. One could also rescan a gel and use ImageJ to produce a densitometry scan which is discussed in the ImageJ documentation and on YouTube. ImageJ also has a macro language for situations where the same kind of processing is done repeatedly.

Smoothing

The data has 10284 pairs of (x,y) data. But if you blow up on it and look carefully you find that it is a series of staircases.

plot(y~x, data = head(ep.data,100), type = "o", cex = 0.5)

1 2	plot(y~x, data = head(ep.data,100), type = "o", cex = 0.5)

plot of chunk unnamed-chunk-8

It turns out that this jaggedness significantly impairs attempts to numerically identify the peaks and valleys. So, I smoothed it a little using the handy rle() function to identify the midpoint of each step. This keeps the total area as close to its original value as possible–though this probably does not matter too much.

ep.rle <- rle(ep.data$y)
stair.midpoints <- cumsum(ep.rle$lengths) - floor(ep.rle$lengths/2)
ep.data.sm <- ep.data[stair.midpoints,]
plot(y~x, data = head(ep.data,300), type = "o", cex = 0.5)
points(y~x, data = head(ep.data.sm,300), type = "o", cex = 0.5, col = "red")

ep.rle <- rle(ep.data$y)

stair.midpoints <- cumsum(ep.rle$lengths) - floor(ep.rle$lengths/2)

ep.data.sm <- ep.data[stair.midpoints,]

plot(y~x, data = head(ep.data,300), type = "o", cex = 0.5)

points(y~x, data = head(ep.data.sm,300), type = "o", cex = 0.5, col = "red")

plot of chunk unnamed-chunk-9

Now that we are satisfied that the new data is OK, I will overwrite the original dataframe.

ep.data <- ep.data.sm

1 2	ep.data <- ep.data.sm

Transformation

The units on the x and y-axes are arbitrary and come from page coordinates of the PDF. We can normalize the scan by making the x-axis go from 0 to 1 and by making the total area 1.

library(Bolstad) #A package containing a function for Simpon's Rule integration
ep.data$x <- ep.data$x/max(ep.data$x)
A.tot <- sintegral(ep.data$x,ep.data$y)$value
ep.data$y <- ep.data$y/A.tot

#sanity check
sintegral(ep.data$x,ep.data$y)$value

library(Bolstad) #A package containing a function for Simpon's Rule integration

ep.data$x <- ep.data$x/max(ep.data$x)

A.tot <- sintegral(ep.data$x,ep.data$y)$value

ep.data$y <- ep.data$y/A.tot

#sanity check

sintegral(ep.data$x,ep.data$y)$value

## [1] 1

## [1] 1

plot(y~x, data = ep.data, type = "l")

1 2	plot(y~x, data = ep.data, type = "l")

plot of chunk unnamed-chunk-11

Find Extrema

Using the findPeaks function from the quantmod package we can find the minima and maxima:

library(quantmod)
ep.max <- findPeaks(ep.data$y)
plot(y~x, data = ep.data, type = "l", main = "Maxima")
abline(v = ep.data$x[ep.max], col = "red", lty = 2)

library(quantmod)

ep.max <- findPeaks(ep.data$y)

plot(y~x, data = ep.data, type = "l", main = "Maxima")

abline(v = ep.data$x[ep.max], col = "red", lty = 2)

plot of chunk unnamed-chunk-12

ep.min <- findValleys(ep.data$y)
plot(y~x, data = ep.data, type = "l", main = "Minima")
abline(v = ep.data$x[ep.min], col = "blue", lty = 2)

ep.min <- findValleys(ep.data$y)

plot(y~x, data = ep.data, type = "l", main = "Minima")

abline(v = ep.data$x[ep.min], col = "blue", lty = 2)

plot of chunk unnamed-chunk-12

Not surprisingly, there are some extraneous local extrema that we do not want. I simply manually removed them. Generally, this kind of thing could be tackled with more smoothing of the data prior to analysis.

ep.max <- ep.max[-1]
ep.min <- ep.min[-c(1,length(ep.min))]

ep.max <- ep.max[-1]

ep.min <- ep.min[-c(1,length(ep.min))]

Fitting

Now it's possible with the nls() function to fit the entire SPEP with a series of Gaussian curves simultaneously. It works just fine (provided you have decent initial estimates of $\mu_i$ and $\sigma_i$) but there is no particular clinical value to fitting the albumin, $\alpha_1$, $\alpha_2$, $\beta_1$ and $\beta_2$ domains with Gaussians. What is of interest is separately quantifying the two peaks in $\gamma$ with two separate Gaussians so let's isolate the $\gamma$ region based on the location of the minimum between $\beta_2$ and $\gamma$.

Isolate the $\gamma$ Region

gamma.ind <- max(ep.min):nrow(ep.data)
gamma.data <- data.frame(x = ep.data$x[gamma.ind], y = ep.data$y[gamma.ind])
plot(y ~ x, gamma.data, type  = "l")

gamma.ind <- max(ep.min):nrow(ep.data)

gamma.data <- data.frame(x = ep.data$x[gamma.ind], y = ep.data$y[gamma.ind])

plot(y ~ x, gamma.data, type = "l")

plot of chunk unnamed-chunk-14

Attempt Something that Ultimately Does Not Work

At first I thought I could just throw two normal distributions at this and it would work. However, it does not work well at all and this kind of not-so-helpful fit turns out to happen a fair bit. I use the nls() function here which is easy to call. It requires a functional form which I set to be:

\[y = C_1 \exp\Big(-{\frac{(x-\mu_1)^2}{2\sigma_1^2}}\Big) + C_2 \exp \Big({-\frac{(x-\mu_2)^2}{2\sigma_2^2}}\Big)\]

where $\mu_1$ is the $x$ location of the first peak in $\gamma$ and $\mu_2$ is the $x$ location of the second peak in $\gamma$. The estimates of $\sigma_1$ and $\sigma_2$ can be obtained by trying to estimate the full-width-half-maximum (FWHM) of the peaks, which is related to $\sigma$ by

\[FWHM_i = 2 \sqrt{2\ln2} \times \sigma_i = 2.355 \times \sigma_i\]

I had to first make a little function that returns the respective half-widths at half-maximum and then uses them to estimate the $FWHM$. Because the peaks are poorly resolved, it also tries to get the smallest possible estimate returning this as FWHM2.

FWHM.finder <- function(ep.data, mu.index){
  peak.height <- ep.data$y[mu.index]
  fxn.for.roots <- ep.data$y - peak.height/2
  indices <- 1:nrow(ep.data)
  root.indices <- which(diff(sign(fxn.for.roots))!=0)
  tmp <- c(root.indices,mu.index) %>% sort
  tmp2 <- which(tmp == mu.index)
  first.root <- root.indices[tmp2 -1]
  second.root <- root.indices[tmp2]
  HWHM1 <- ep.data$x[mu.index] - ep.data$x[first.root]
  HWHM2 <- ep.data$x[second.root] - ep.data$x[mu.index]
  FWHM <- HWHM2 + HWHM1
  FWHM2 = 2*min(c(HWHM1,HWHM2))
  return(list(HWHM1 = HWHM1,HWHM2 = HWHM2,FWHM = FWHM,FWHM2 = FWHM2))
}

FWHM.finder <- function(ep.data, mu.index){

peak.height <- ep.data$y[mu.index]

fxn.for.roots <- ep.data$y - peak.height/2

indices <- 1:nrow(ep.data)

root.indices <- which(diff(sign(fxn.for.roots))!=0)

tmp <- c(root.indices,mu.index) %>% sort

tmp2 <- which(tmp == mu.index)

first.root <- root.indices[tmp2 -1]

second.root <- root.indices[tmp2]

HWHM1 <- ep.data$x[mu.index] - ep.data$x[first.root]

HWHM2 <- ep.data$x[second.root] - ep.data$x[mu.index]

FWHM <- HWHM2 + HWHM1

FWHM2 = 2*min(c(HWHM1,HWHM2))

return(list(HWHM1 = HWHM1,HWHM2 = HWHM2,FWHM = FWHM,FWHM2 = FWHM2))

}

The peak in the $\gamma$ region was obtained previously:

plot(y ~ x, gamma.data, type  = "l")
gamma.max <- findPeaks(gamma.data$y)
abline(v = gamma.data$x[gamma.max])

plot(y ~ x, gamma.data, type = "l")

gamma.max <- findPeaks(gamma.data$y)

abline(v = gamma.data$x[gamma.max])

plot of chunk unnamed-chunk-16

and from them $\mu_1$ is determined to be 0.7. We have to guess where the second peak is, which is at about $x=0.75$ and has an index of 252 in the gamma.data dataframe.

gamma.data[252,]

1 2	gamma.data[252,]

##             x         y
## 252 0.7487757 0.6381026

1 2	## x y ## 252 0.7487757 0.6381026

#append the second peak
gamma.max <- c(gamma.max,252)
gamma.mu <- gamma.data$x[gamma.max]
gamma.mu

#append the second peak

gamma.max <- c(gamma.max,252)

gamma.mu <- gamma.data$x[gamma.max]

gamma.mu

## [1] 0.6983350 0.7487757

1	## [1] 0.6983350 0.7487757

plot(y ~ x, gamma.data, type  = "l")
abline(v = gamma.data$x[gamma.max])

plot(y ~ x, gamma.data, type = "l")

abline(v = gamma.data$x[gamma.max])

plot of chunk unnamed-chunk-17

Now we can find the estimates of the standard deviations:

#find the FWHM estimates of sigma_1 and sigma_2:
FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.data)
gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355

#find the FWHM estimates of sigma_1 and sigma_2:

FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.data)

gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355

The estimates of $\sigma_1$ and $\sigma_2$ are now obtained. The estimates of $C_1$ and $C_2$ are just the peak heights.

peak.heights <- gamma.data$y[gamma.max]

1 2	peak.heights <- gamma.data$y[gamma.max]

We can now use nls() to determine the fit.

fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) +
                  C2*exp(-(x-mean2)**2/(2 * sigma2**2))),
           data = gamma.data,
           start = list(mean1 = gamma.mu[1],
                        mean2 = gamma.mu[2],
                        sigma1 = gamma.sigma[1],
                        sigma2 = gamma.sigma[2],
                        C1 = peak.heights[1],
                        C2 = peak.heights[2]),
           algorithm = "port")

fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) +

C2*exp(-(x-mean2)**2/(2 * sigma2**2))),

data = gamma.data,

start = list(mean1 = gamma.mu[1],

mean2 = gamma.mu[2],

sigma1 = gamma.sigma[1],

sigma2 = gamma.sigma[2],

C1 = peak.heights[1],

C2 = peak.heights[2]),

algorithm = "port")

Determining the fitted values of our unknown coefficients:

dffit <- data.frame(x=seq(0, 1 , 0.001))
dffit$y <- predict(fit, newdata=dffit)

fit.sum <- summary(fit)
fit.sum #show the fitted coefficients

dffit <- data.frame(x=seq(0, 1 , 0.001))

dffit$y <- predict(fit, newdata=dffit)

fit.sum <- summary(fit)

fit.sum #show the fitted coefficients

## 
## Formula: y ~ (C1 * exp(-(x - mean1)^2/(2 * sigma1^2)) + C2 * exp(-(x - 
##     mean2)^2/(2 * sigma2^2)))
## 
## Parameters:
##         Estimate Std. Error t value Pr(>|t|)    
## mean1  0.7094793  0.0003312 2142.23   <2e-16 ***
## mean2  0.7813900  0.0007213 1083.24   <2e-16 ***
## sigma1 0.0731113  0.0002382  306.94   <2e-16 ***
## sigma2 0.0250850  0.0011115   22.57   <2e-16 ***
## C1     0.6983921  0.0018462  378.29   <2e-16 ***
## C2     0.0819704  0.0032625   25.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01291 on 611 degrees of freedom
## 
## Algorithm "port", convergence message: both X-convergence and relative convergence (5)

## Formula: y ~ (C1 * exp(-(x - mean1)^2/(2 * sigma1^2)) + C2 * exp(-(x -

## mean2)^2/(2 * sigma2^2)))

## Parameters:

## Estimate Std. Error t value Pr(>|t|)

## mean1 0.7094793 0.0003312 2142.23 <2e-16 ***

## mean2 0.7813900 0.0007213 1083.24 <2e-16 ***

## sigma1 0.0731113 0.0002382 306.94 <2e-16 ***

## sigma2 0.0250850 0.0011115 22.57 <2e-16 ***

## C1 0.6983921 0.0018462 378.29 <2e-16 ***

## C2 0.0819704 0.0032625 25.12 <2e-16 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Residual standard error: 0.01291 on 611 degrees of freedom

## Algorithm "port", convergence message: both X-convergence and relative convergence (5)

coef.fit <- fit.sum$coefficients[,1]
mu.fit <- coef.fit[1:2]
sigma.fit <- coef.fit[3:4]
C.fit <- coef.fit[5:6]

coef.fit <- fit.sum$coefficients[,1]

mu.fit <- coef.fit[1:2]

sigma.fit <- coef.fit[3:4]

C.fit <- coef.fit[5:6]

And now we can plot the fitted results against the original results:

#original
plot(y ~ x, data = gamma.data, type = "l", main = "This is Garbage") 
#overall fit
lines(y ~ x, data = dffit, col ="red", cex = 0.2) 
legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum"))
#components of the fit
for(i in 1:2){
  x <- dffit$x
  y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2))
  lines(x,y, col = i + 2)
}

#original

plot(y ~ x, data = gamma.data, type = "l", main = "This is Garbage")

#overall fit

lines(y ~ x, data = dffit, col ="red", cex = 0.2)

legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum"))

#components of the fit

for(i in 1:2){

x <- dffit$x

y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2))

lines(x,y, col = i + 2)

}

plot of chunk unnamed-chunk-22

And this is garbage. The green curve is supposed to be the monoclonal peak, the blue curve is supposed to be the $\gamma$ background, and the red curve is their sum, the overall fit. This is a horrible failure.

Subsequently, I tried fixing the locations of $\mu_1$ and $\mu_2$ but this also yielded similar nonsensical fitting. So, with a lot of messing around trying different functions like the lognormal distribution, the Bi-Gaussian distribution and the Exponentially Modified Gaussian distribution, and applying various arbitrary weighting functions, and simultaneously fitting the other regions of the SPEP, I concluded that nothing could predictably produce results that represented the clinical reality.

I thought maybe the challenge to obtain a reasonable fit related to the sloping baseline, so I though I would try to remove it. I will model the baseline in the most simplistic manner possible: as a sloped line.

Baseline Removal

I will arbitrarily define the tail of the $\gamma$ region to be those values having $y \leq 0.02$. Then I will connect the first (x,y) point from the $\gamma$ region and connect it to the tail.

gamma.tail <- filter(gamma.data, y <= 0.02) 
baseline.data <- rbind(gamma.data[1,],gamma.tail)
names(baseline.data) <- c("x","y")
baseline.fun <- approxfun(baseline.data)
plot(y~x, data = gamma.data, type = "l")
lines(baseline.data$x,baseline.fun(baseline.data$x), col = "blue")

gamma.tail <- filter(gamma.data, y <= 0.02)

baseline.data <- rbind(gamma.data[1,],gamma.tail)

names(baseline.data) <- c("x","y")

baseline.fun <- approxfun(baseline.data)

plot(y~x, data = gamma.data, type = "l")

lines(baseline.data$x,baseline.fun(baseline.data$x), col = "blue")

plot of chunk unnamed-chunk-24

Now we can define a new dataframe gamma.no.base that has the baseline removed:

gamma.no.base <- data.frame(x = gamma.data$x, y = gamma.data$y - baseline.fun(gamma.data$x))
plot(y~x, data = gamma.data, type = "l")
lines(y ~ x, data = gamma.no.base, lty = 2)
gamma.max <- findPeaks(gamma.no.base$y)[1:2] #rejects a number of extraneous peaks
abline(v = gamma.no.base$x[gamma.max])

gamma.no.base <- data.frame(x = gamma.data$x, y = gamma.data$y - baseline.fun(gamma.data$x))

plot(y~x, data = gamma.data, type = "l")

lines(y ~ x, data = gamma.no.base, lty = 2)

gamma.max <- findPeaks(gamma.no.base$y)[1:2] #rejects a number of extraneous peaks

abline(v = gamma.no.base$x[gamma.max])

plot of chunk unnamed-chunk-25

The black is the original $\gamma$ and the dashed has the baseline removed. This becomes and easy fit.

#Estimate the Ci
peak.heights <- gamma.no.base$y[gamma.max]
#Estimate the mu_i
gamma.mu <- gamma.no.base$x[gamma.max] #the same values as before
#Estimate the sigma_i from the FWHM
FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.no.base)
gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355

#Perform the fit
fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) +
                  C2*exp(-(x-mean2)**2/(2 * sigma2**2))),
           data = gamma.no.base,
           start = list(mean1 = gamma.mu[1],
                        mean2 = gamma.mu[2],
                        sigma1 = gamma.sigma[1],
                        sigma2 = gamma.sigma[2],
                        C1 = peak.heights[1],
                        C2 = peak.heights[2]),
           algorithm = "port")

#Plot the fit
dffit <- data.frame(x=seq(0, 1 , 0.001))
dffit$y <- predict(fit, newdata=dffit)
fit.sum <- summary(fit)
coef.fit <- fit.sum$coefficients[,1]
mu.fit <- coef.fit[1:2]
sigma.fit <- coef.fit[3:4]
C.fit <- coef.fit[5:6]

plot(y ~ x, data = gamma.no.base, type = "l")
legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum"))
lines(y ~ x, data = dffit, col ="red", cex = 0.2)
for(i in 1:2){
  x <- dffit$x
  y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2))
  lines(x,y, col = i + 2)
}

#Estimate the Ci

peak.heights <- gamma.no.base$y[gamma.max]

#Estimate the mu_i

gamma.mu <- gamma.no.base$x[gamma.max] #the same values as before

#Estimate the sigma_i from the FWHM

FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.no.base)

gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355

#Perform the fit

fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) +

C2*exp(-(x-mean2)**2/(2 * sigma2**2))),

data = gamma.no.base,

start = list(mean1 = gamma.mu[1],

mean2 = gamma.mu[2],

sigma1 = gamma.sigma[1],

sigma2 = gamma.sigma[2],

C1 = peak.heights[1],

C2 = peak.heights[2]),

algorithm = "port")

#Plot the fit

dffit <- data.frame(x=seq(0, 1 , 0.001))

dffit$y <- predict(fit, newdata=dffit)

fit.sum <- summary(fit)

coef.fit <- fit.sum$coefficients[,1]

mu.fit <- coef.fit[1:2]

sigma.fit <- coef.fit[3:4]

C.fit <- coef.fit[5:6]

plot(y ~ x, data = gamma.no.base, type = "l")

legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum"))

lines(y ~ x, data = dffit, col ="red", cex = 0.2)

for(i in 1:2){

x <- dffit$x

y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2))

lines(x,y, col = i + 2)

}

plot of chunk unnamed-chunk-26

Lo and behold…something that is not completely insane. The green is the monoclonal, the blue is the $\gamma$ background and the red is their sum, that is, the overall fit. A better fit could now we sought with weighting or with a more flexible distribution shape. In any case, the green peak is now easily determined. Since

\[\int_{-\infty}^{\infty} C_1 \exp\Big(-{\frac{(x-\mu_1)^2}{2\sigma_1^2}}\Big)dx = \sqrt{2\pi}\sigma C_1\]

A.mono <- sqrt(2*pi)*sigma.fit[1]*C.fit[1] %>% unname() 
A.mono <- round(A.mono,3)
A.mono

A.mono <- sqrt(2*pi)*sigma.fit[1]*C.fit[1] %>% unname()

A.mono <- round(A.mono,3)

A.mono

## sigma1 
##  0.024

1 2	## sigma1 ## 0.024

So this peak is 2.4% of the total area. Now, of course, this assumes that nothing under the baseline is attributable to the monoclonal peak and all belongs to normal $\gamma$-globulins, which is very unlikely to be true. However, the drop and tangent skimming methods also make assumptions about how the area under the curve contributes to the monoclonal protein. The point is to try to do something that will produce consistent results that can be followed over time. Obviously, if you thought there were three peaks in the $\gamma$-region, you'd have to set up your model accordingly.

All about that Base(line)

There are obviously better ways to model the baseline because this approach of a linear baseline is not going to work in situations where, for example, there is a small monoclonal in fast $\gamma$ dwarfed by normal $\gamma$-globulins. That is, like this:

plot of chunk unnamed-chunk-28

Something curvilinear or piecewise continuous and flexible enough for more circumstances is generally required.

There is also no guarantee that baseline removal, whatever the approach, is going to be a good solution in other circumstances. Given the diversity of monoclonal peak locations, sizes and shapes, I suspect one would need a few different approaches for different circumstances.

Conclusions

The data in the PDFs generated by EP software are processed (probably with splining or similar) followed by the stair-stepping seen above. It would be better to work with raw data from the scanner.
- This is particularly important if you are using nls() because nls() does not play nice with data having no noise (“Do not use nls on artificial 'zero-residual' data”)
Integrating monoclonal peaks under the $\gamma$ baseline (or $\beta$) is unlikely to be a one-size-fits all approach and may require application of a number of strategies to get meaningful results.
- Basline removal might be helpful at times.
Peak integration will require human adjudication.
While most monoclonal peaks show little skewing, better fitting is likely to be obtained with distributions that afford some skewing.
MASSFIX may soon make this entire discussion irrelevant.

Parting Thought

On the matter of fitting

In bringing many sons and daughters to glory, it was fitting that God, for whom and through whom everything exists, should make the pioneer of their salvation perfect through what he suffered.

Heb 2:10

Compare Tube Types with R – Repeated Measures ANOVA

August 21, 2017February 23, 2019 dtholmes@mail.ubc.ca

Background

Sometimes we might want to compare three or four tube types for a particular analyte on a group of patients or we might want to see if a particular analyte is stable over time in aliqioted samples. In these experiments are essentially doing the multivariable analogue of the paired t-test. In the tube-type experiment, the factor that is differing between the (‘paired’) groups is the container: serum separator tubes (SST), EDTA plasma tubes, plasma separator tubes (PST) etc. In a stability experiment, the factor that is differing is storage duration.

Since this is a fairly common clinical lab experiment, I thought I would just jot down how this is accomplished in R – though I must confess I know just about $\lim_{x\to0}x$ about statistics. In any case, the statistical test is a repeated-measures ANOVA and this is one way to do it (there are many) including an approach to the post-hoc testing.

Some Fake Data to Work With

I’m going to make some fake data. I tried to dig up the data from an experiment I did as a resident but alas, I think the raw data died on an old laptop. But fake data will do for demonstration purposes. Let’s suppose we are looking at parathyroid hormone (PTH) in three different blood collection tubes: SST, EDTA and PST. For the sake of argument, let’s say that we collect samples from 20 patients simultaneously and we anlayze them all as per our usual process. This means that each patient has three samples of material that should be otherwise identical outside of the effects of the collection contained.

library(magrittr)
set.seed(100) #to force the same pseudo-random each time
#data in pmol/L
#induce some heteroscedastic error
SST <- runif(20,3,50)  
PST <- 1.03*SST + rnorm(20,0,0.1)*SST #set the data up to show no difference
EDTA <- 1.15*SST + rnorm(20,0,0.1)*SST  #set the data up to show a difference
tube.data <- data.frame(SST,PST,EDTA) %>% round(.,1)
tube.data <- data.frame(Subject = factor(1:20), tube.data)

library(magrittr)

set.seed(100) #to force the same pseudo-random each time

#data in pmol/L

#induce some heteroscedastic error

SST <- runif(20,3,50)

PST <- 1.03*SST + rnorm(20,0,0.1)*SST #set the data up to show no difference

EDTA <- 1.15*SST + rnorm(20,0,0.1)*SST #set the data up to show a difference

tube.data <- data.frame(SST,PST,EDTA) %>% round(.,1)

tube.data <- data.frame(Subject = factor(1:20), tube.data)

This is the way we usually express (and receive) data like this in an Excel spreadsheet:

Subject	SST	PST	EDTA
1	17.5	18.1	19.9
2	15.1	15.7	20.0
3	29.0	29.2	32.9
4	5.7	6.2	6.4
5	25.0	26.1	27.0
6	25.7	26.4	29.0
7	41.2	40.8	48.1
8	20.4	22.1	24.3
9	28.7	26.9	36.0
10	11.0	13.9	13.7
11	32.4	31.9	36.9
12	44.5	49.2	57.4
13	16.2	17.1	15.7
14	21.7	24.1	26.3
15	38.8	36.8	42.6
16	34.4	34.0	44.2
17	12.6	12.1	14.1
18	19.8	20.9	25.4
19	19.9	18.2	23.0
20	35.4	37.4	34.1

This Excel-ish way of storing the data is referred to as the “datawide” format for obvious reasons.

Gather the Grain

As it turns out this is not the way that we want to store data to do the statistical analyses of interest. What we want to do is have the tube type in a single column because this is the factor that is different within the subjects. We want to gather() or melt() the data (depending on your package of choice) to be like so:

library(tidyr)
tube.data.2 <- gather(tube.data, key = "Subject")
tube.data.2 %>% kable()

library(tidyr)

tube.data.2 <- gather(tube.data, key = "Subject")

tube.data.2 %>% kable()

Subject	Subject	value
1	SST	17.5
2	SST	15.1
3	SST	29.0
4	SST	5.7
5	SST	25.0
6	SST	25.7
7	SST	41.2
8	SST	20.4
9	SST	28.7
10	SST	11.0
11	SST	32.4
12	SST	44.5
13	SST	16.2
14	SST	21.7
15	SST	38.8
16	SST	34.4
17	SST	12.6
18	SST	19.8
19	SST	19.9
20	SST	35.4
1	PST	18.1
2	PST	15.7
3	PST	29.2
4	PST	6.2
5	PST	26.1
6	PST	26.4
7	PST	40.8
8	PST	22.1
9	PST	26.9
10	PST	13.9
11	PST	31.9
12	PST	49.2
13	PST	17.1
14	PST	24.1
15	PST	36.8
16	PST	34.0
17	PST	12.1
18	PST	20.9
19	PST	18.2
20	PST	37.4
1	EDTA	19.9
2	EDTA	20.0
3	EDTA	32.9
4	EDTA	6.4
5	EDTA	27.0
6	EDTA	29.0
7	EDTA	48.1
8	EDTA	24.3
9	EDTA	36.0
10	EDTA	13.7
11	EDTA	36.9
12	EDTA	57.4
13	EDTA	15.7
14	EDTA	26.3
15	EDTA	42.6
16	EDTA	44.2
17	EDTA	14.1
18	EDTA	25.4
19	EDTA	23.0
20	EDTA	34.1

Now we see that there is a column for tube-type and a column for the PTH results which we can name accordingly. You can see why this called the “datalong” format.

names(tube.data.2) <- c("Subject", "Tube.Type", "PTH")
tube.data.2$Tube.Type <- as.factor(tube.data.2$Tube.Type) #turns tube type into factor

names(tube.data.2) <- c("Subject", "Tube.Type", "PTH")

tube.data.2$Tube.Type <- as.factor(tube.data.2$Tube.Type) #turns tube type into factor

Visualize

Summarize the data:

summary(tube.data)

1 2	summary(tube.data)

##     Subject        SST             PST             EDTA      
##  1      : 1   Min.   : 5.70   Min.   : 6.20   Min.   : 6.40  
##  2      : 1   1st Qu.:17.18   1st Qu.:17.85   1st Qu.:19.98  
##  3      : 1   Median :23.35   Median :25.10   Median :26.65  
##  4      : 1   Mean   :24.75   Mean   :25.36   Mean   :28.85  
##  5      : 1   3rd Qu.:32.90   3rd Qu.:32.42   3rd Qu.:36.23  
##  6      : 1   Max.   :44.50   Max.   :49.20   Max.   :57.40  
##  (Other):14

## Subject SST PST EDTA

## 1 : 1 Min. : 5.70 Min. : 6.20 Min. : 6.40

## 2 : 1 1st Qu.:17.18 1st Qu.:17.85 1st Qu.:19.98

## 3 : 1 Median :23.35 Median :25.10 Median :26.65

## 4 : 1 Mean :24.75 Mean :25.36 Mean :28.85

## 5 : 1 3rd Qu.:32.90 3rd Qu.:32.42 3rd Qu.:36.23

## 6 : 1 Max. :44.50 Max. :49.20 Max. :57.40

## (Other):14

Let’s just have a quick look graphically:

library(mcr)
plot(mcreg(SST, EDTA,
           method.reg = "PaBa",
           mref.name = "SST",
           mtest.name = "EDTA"))

library(mcr)

plot(mcreg(SST, EDTA,

method.reg = "PaBa",

mref.name = "SST",

mtest.name = "EDTA"))

plot of chunk unnamed-chunk-6

plot(mcreg(SST, PST,
           method.reg = "PaBa",
           mref.name = "SST",
           mtest.name = "PST"))

plot(mcreg(SST, PST,

method.reg = "PaBa",

mref.name = "SST",

mtest.name = "PST"))

plot of chunk unnamed-chunk-6

And as a boxplot with the points overtop:

boxplot(PTH ~ Tube.Type,
        data = tube.data.2,
        col = c("purple", "lightgreen", "gold"))
stripchart(PTH ~ Tube.Type,
           vertical = TRUE,
           data = tube.data.2, 
           method = "jitter",
           add = TRUE,
           pch = 20,
           col = rgb(0,0,0,0.5))

boxplot(PTH ~ Tube.Type,

data = tube.data.2,

col = c("purple", "lightgreen", "gold"))

stripchart(PTH ~ Tube.Type,

vertical = TRUE,

data = tube.data.2,

method = "jitter",

add = TRUE,

pch = 20,

col = rgb(0,0,0,0.5))

plot of chunk unnamed-chunk-7

Separate the Wheat from the Chaff

Now we want to make comparisons to see if these are different. To accomplish this, we will use the aov() function. This requires us to have data formatted “datalong” as it is in the tube.data.2 dataframe.

fit <- aov(PTH ~ Tube.Type + Error(Subject/Tube.Type), data=tube.data.2)

1 2	fit <- aov(PTH ~ Tube.Type + Error(Subject/Tube.Type), data=tube.data.2)

If you are like me, this syntax is confusing. But it goes like this. PTH is a function of Tube.Type which is straight forward–hence the PTH ~ Tube.Type bit. The error term has the Subject in front of the / and the factor that is different within the subjects (Tube.Type) after the /. That’s my grade 2 explanation from reading this and this and this.

summary(fit)

1 2	summary(fit)

## 
## Error: Subject
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 19   7307   384.6               
## 
## Error: Subject:Tube.Type
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## Tube.Type  2  195.9   97.97   22.47 3.63e-07 ***
## Residuals 38  165.7    4.36                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Error: Subject

## Df Sum Sq Mean Sq F value Pr(>F)

## Residuals 19 7307 384.6

## Error: Subject:Tube.Type

## Df Sum Sq Mean Sq F value Pr(>F)

## Tube.Type 2 195.9 97.97 22.47 3.63e-07 ***

## Residuals 38 165.7 4.36

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This tells us that there is a difference between the groups but it does not specify where the difference is.

I can’t see the difference. Can you see the difference?

Sorry – I just had to make a pop-culture reference to this. We want to be specific about where the differences are without making a Type I error which might arise if we blindly charge ahead and do multiple paired t-tests. One easy way to accomplish this is to use the pairwise.t.test() function which does corrections for multiple comparisons. You can choose from a number of approaches for adjustment for pairwise comparison. This requires the “response vector” which is PTH and the “grouping factor” which is the tube type.

# choices for p.adjust.method are: c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
pwt <- pairwise.t.test(tube.data.2$PTH, tube.data.2$Tube.Type, p.adj = "bonferroni", paired = TRUE)
pwt

# choices for p.adjust.method are: c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")

pwt <- pairwise.t.test(tube.data.2$PTH, tube.data.2$Tube.Type, p.adj = "bonferroni", paired = TRUE)

pwt

## 
##  Pairwise comparisons using paired t tests 
## 
## data:  tube.data.2$PTH and tube.data.2$Tube.Type 
## 
##     EDTA    PST    
## PST 0.00083 -      
## SST 7.9e-05 0.35033
## 
## P value adjustment method: bonferroni

## Pairwise comparisons using paired t tests

## data: tube.data.2$PTH and tube.data.2$Tube.Type

## EDTA PST

## PST 0.00083 -

## SST 7.9e-05 0.35033

## P value adjustment method: bonferroni

This is pretty easy to understand. There are statistically significant differences found between the EDTA and PST (p = 0.00083) and the EDTA and PST (p = 0.00008) but none between SST and PST (p = 0.35033).

Conclusion

Non-statistician’s approach to tube-type comparisons, which is also applicable to analyte stability studies. This is a one-way repeated measures ANOVA with one within-subjects factor. There is a great deal more to say on the matter by people who know much more in the citations in the links provided above.

God probably uses datawide format

All the nations will be gathered before him, and he will separate the people one from another as a shepherd separates the sheep from the goats. He will put the sheep on his right and the goats on his left.

(Matt 25:32-33)

Parse an Online Table into an R Dataframe – Westgard’s Biological Variation Database

August 14, 2017August 14, 2017 dtholmes@mail.ubc.ca

Background

From time to time I have wanted to bring an online table into an R dataframe. While in principle, the data can be cut and paste into Excel, sometimes the table is very large and sometimes the columns get goofed up in the process. Fortunately, there are a number of R tools for accomplishing this. I am just going to show one approach using the rvest package. The rvest package also makes it possible to interact with forms on webpages to request specific material which can then be scraped. I think you will see the potential if you look here.

In our (simple) case, we will apply this process to Westgard's desirable assay specifications as shown on his website. The goal is to parse out the biological variation tables, get them into a dataframe and the write to csv or xlsx.

Reading in the Data

The first thing to do is to load the rvest and httr packages and define an html session with the html_session() function.

library(rvest)
library(httr)
wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("LabRtorian"))

library(rvest)

library(httr)

wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("LabRtorian"))

Now looking at the webpage, you can see that there are 8 columns in the tables of interest. So, we will define an empty dataframe with 8 columns.

#define empty table to hold all the content
biotable = data.frame(matrix(NA,0, 8))

#define empty table to hold all the content

biotable = data.frame(matrix(NA,0, 8))

We need to know which part of the document to scrape. This is a little obscure, but following the instructions in this post, we can determine that the xpaths we need are:

/html/body/div[1]/div[3]/div/main/article/div/table[1]

/html/body/div[1]/div[3]/div/main/article/div/table[2]

/html/body/div[1]/div[3]/div/main/article/div/table[3]

…

etc.

There are 8 such tables in the whole webpage. We can define a character vector for these as such:

xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

1 2	xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

Now we make a loop to scrape the 8 tables and with each iteration of the loop, append the scraped subtable to the main dataframe called biotable using the rbind() function. We have to use the parameter fill = TRUE in the html_table() function because the table does not happen to always a uniform number of columns.

for (j in 1:8){                
  subtable <- wg %>%
  read_html() %>%
  html_nodes(xpath =  xpaths[j]) %>%
  html_table(., fill = TRUE) 
  subtable <- subtable[[1]]
  biotable <- rbind(biotable,subtable)
}

for (j in 1:8){

subtable <- wg %>%

read_html() %>%

html_nodes(xpath = xpaths[j]) %>%

html_table(., fill = TRUE)

subtable <- subtable[[1]]

biotable <- rbind(biotable,subtable)

}

Clean Up

Now that we have the raw data out, we can have a quick look at it:

X1	X2	X3	X4	X5	X6	X7	X8
	Analyte	Number of Papers	Biological Variation	Biological Variation	Desirable specification	Desirable specification	Desirable specification
	Analyte	Number of Papers	CVI	CVg	I(%)	B(%)	TE(%)
S-	11-Desoxycortisol	2	21.3	31.5	10.7	9.5	27.1
S-	17-Hydroxyprogesterone	2	19.6	50.4	9.8	13.5	29.7
U-	4-hydroxy-3-methoximandelate (VMA)	1	22.2	47.0	11.1	13.0	31.3
S-	5' Nucleotidase	2	23.2	19.9	11.6	7.6	26.8
U-	5'-Hydroxyindolacetate, concentration	1	20.3	33.2	10.2	9.7	26.5
S-	α1-Acid Glycoprotein	3	11.3	24.9	5.7	6.8	16.2
S-	α1-Antichymotrypsin	1	13.5	18.3	6.8	5.7	16.8
S-	α1-Antitrypsin	3	5.9	16.3	3.0	4.3	9.2

We can see that we need define column names and we need to get rid of some rows containing extraneous column header information. There are actually 8 such sets of headers to remove.

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE")
names(biotable) <- table.header

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE")

names(biotable) <- table.header

Let's now find rows we don't want and remove them.

for.removal <- grep("Analyte", biotable$Analyte)
biotable <- biotable[-for.removal,]

for.removal <- grep("Analyte", biotable$Analyte)

biotable <- biotable[-for.removal,]

You will find that the table has missing data which is written as “- – -”. This should be now replaced by NA and the column names should be assigned to sequential integers. Also, we will remove all the minus signs after the specimen type. I'm not sure what they add.

biotable[biotable == "---"] <- NA
row.names(biotable) <- 1:nrow(biotable)
biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

biotable[biotable == "---"] <- NA

row.names(biotable) <- 1:nrow(biotable)

biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

Check it Out

Just having another look at the first 10 rows:

Sample	Analyte	NumPapers	CVI	CVG	I	B	TE
S	11-Desoxycortisol	2	21.3	31.5	10.7	9.5	27.1
S	17-Hydroxyprogesterone	2	19.6	50.4	9.8	13.5	29.7
U	4-hydroxy-3-methoximandelate (VMA)	1	22.2	47.0	11.1	13.0	31.3
S	5' Nucleotidase	2	23.2	19.9	11.6	7.6	26.8
U	5'-Hydroxyindolacetate, concentration	1	20.3	33.2	10.2	9.7	26.5
S	α1-Acid Glycoprotein	3	11.3	24.9	5.7	6.8	16.2
S	α1-Antichymotrypsin	1	13.5	18.3	6.8	5.7	16.8
S	α1-Antitrypsin	3	5.9	16.3	3.0	4.3	9.2
S	α1-Globulins	2	11.4	22.6	5.7	6.3	15.7
U	α1-Microglobulin, concentration, first morning	1	33.0	58.0	16.5	16.7	43.9

Now examining the structure:

str(biotable)

1 2	str(biotable)

## 'data.frame':    370 obs. of  8 variables:
##  $ Sample   : chr  "S" "S" "U" "S" ...
##  $ Analyte  : chr  "11-Desoxycortisol" "17-Hydroxyprogesterone" "4-hydroxy-3-methoximandelate (VMA)" "5' Nucleotidase" ...
##  $ NumPapers: chr  "2" "2" "1" "2" ...
##  $ CVI      : chr  "21.3" "19.6" "22.2" "23.2" ...
##  $ CVG      : chr  "31.5" "50.4" "47.0" "19.9" ...
##  $ I        : chr  "10.7" "9.8" "11.1" "11.6" ...
##  $ B        : chr  "9.5" "13.5" "13.0" "7.6" ...
##  $ TE       : chr  "27.1" "29.7" "31.3" "26.8" ...

## 'data.frame': 370 obs. of 8 variables:

## $ Sample : chr "S" "S" "U" "S" ...

## $ Analyte : chr "11-Desoxycortisol" "17-Hydroxyprogesterone" "4-hydroxy-3-methoximandelate (VMA)" "5' Nucleotidase" ...

## $ NumPapers: chr "2" "2" "1" "2" ...

## $ CVI : chr "21.3" "19.6" "22.2" "23.2" ...

## $ CVG : chr "31.5" "50.4" "47.0" "19.9" ...

## $ I : chr "10.7" "9.8" "11.1" "11.6" ...

## $ B : chr "9.5" "13.5" "13.0" "7.6" ...

## $ TE : chr "27.1" "29.7" "31.3" "26.8" ...

It's kind-of undesirable to have numbers as characters so…

#convert appropriate columns to numeric
biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

#convert appropriate columns to numeric

biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

Write the Data

Using the xlsx package, you can output the table to an Excel file in the current working directory.

library(xlsx)
write.xlsx(biotable,
            file = "Westgard_Biological_Variation.xlsx",
            row.names = FALSE)

library(xlsx)

write.xlsx(biotable,

file = "Westgard_Biological_Variation.xlsx",

row.names = FALSE)

If you are having trouble getting xlsx to install, then just write as csv.

write.csv(biotable,
            file = "Westgard_Biological_Variation.csv",
            row.names = FALSE)

write.csv(biotable,

file = "Westgard_Biological_Variation.csv",

row.names = FALSE)

Conclusion

You can now use the same general approach to parse any table you have web access to, no mater how small or big it is. Here is a complete script in one place:

library(httr)
library(rvest)
library(xlsx)

wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("yournamehere"))
xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

#define empty dataframe
biotable = data.frame(matrix(NA,0, 8))

#loop over the 8 html tables
for (j in 1:8){                
  subtable <- wg %>%
  read_html() %>%
  html_nodes(xpath =  xpaths[j] ) %>%
  html_table(., fill = TRUE) 
  subtable <- subtable[[1]]
  biotable <- rbind(biotable,subtable)
}

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE")
names(biotable) <- table.header

#remove extraneous rows
for.removal <- grep("Analyte", biotable$Analyte)
biotable <- biotable[-for.removal,]

#make missing data into NA
biotable[ biotable == "---" ] <- NA
row.names(biotable) <- 1:nrow(biotable)

#convert appropriate columns to numeric
biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

#get rid of minus signs in column 1
biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

write.xlsx(biotable,
            file = "Westgard_Biological_Variation.xlsx",
            row.names = FALSE)

write.csv(biotable,
            file = "Westgard_Biological_Variation.csv",
            row.names = FALSE)

library(httr)

library(rvest)

library(xlsx)

wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("yournamehere"))

xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

#define empty dataframe

biotable = data.frame(matrix(NA,0, 8))

#loop over the 8 html tables

for (j in 1:8){

subtable <- wg %>%

read_html() %>%

html_nodes(xpath = xpaths[j] ) %>%

html_table(., fill = TRUE)

subtable <- subtable[[1]]

biotable <- rbind(biotable,subtable)

}

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE")

names(biotable) <- table.header

#remove extraneous rows

for.removal <- grep("Analyte", biotable$Analyte)

biotable <- biotable[-for.removal,]

#make missing data into NA

biotable[ biotable == "---" ] <- NA

row.names(biotable) <- 1:nrow(biotable)

#convert appropriate columns to numeric

biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

#get rid of minus signs in column 1

biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

write.xlsx(biotable,

file = "Westgard_Biological_Variation.xlsx",

row.names = FALSE)

write.csv(biotable,

file = "Westgard_Biological_Variation.csv",

row.names = FALSE)

Parting Thought on Tables

You prepare a table before me in the presence of my enemies. You anoint my head with oil; my cup overflows.

(Psalm 23:5)

Determine the CV of a Calculated Lab Reportable – Bioavailable Testosterone

August 7, 2017August 7, 2017 dtholmes@mail.ubc.ca

Background

At the AACC meeting last week, some of my friends were bugging me that I had not made a blog post in 10 months. Without getting into it too much, let's just say I can blame Cerner. Thanks also to a prod from a friend, here is an approach to a fairly common problem.

We all report calculated quantities out of our laboratories–quantities such as LDL cholesterol, non-HDL cholesterol, aldosterone:renin ratio, free testosterone, eGFR etc. How does one determine the precision (i.e. imprecision) of a calculated quantity. While earlier in my life, I might go to the trouble of trying to do such calculations analytically using the rules of error propagation, in my later years, I am more pragmatic and I'm happy to use a computational approach.

In this example, we will model the precision in calculated bioavailable testosterone (CBAT). Without explanation, I provide an R function for CBAT (and free testosterone) where testosterone is reported in nmol/L, sex hormone binding globulin (SHBG) is reported in nmol/L, and albumin is reported in g/L. Using the Vermeulen Equation as discussed in this publication, you can calculate CBAT as follows:

cbat <- function(TT,SHBG,ALB = 43){
    Kalb <- 3.6*10^4
    Kshbg <- 10^9
    N <- 1 + Kalb*ALB/69000
    a <- N*Kshbg
    b <- N + Kshbg*(SHBG - TT)/10^9
    c <- -TT/10^9
    FT <- (-b + sqrt(b^2 - 4*a*c))/(2*a)*10^9
    cbat <- N*FT
    return(list(free.T = FT, cbat = cbat))
}

cbat <- function(TT,SHBG,ALB = 43){

Kalb <- 3.6*10^4

Kshbg <- 10^9

N <- 1 + Kalb*ALB/69000

a <- N*Kshbg

b <- N + Kshbg*(SHBG - TT)/10^9

c <- -TT/10^9

FT <- (-b + sqrt(b^2 - 4*a*c))/(2*a)*10^9

cbat <- N*FT

return(list(free.T = FT, cbat = cbat))

}

To sanity-check this, we can use this online calculator. Taking a typical male testosterone of 20 nmol/L, an SHBG of 50 nmol/L and an albumin of 43 g/L, we get the following:

cbat(20,50)

1 2	cbat(20,50)

## $free.T
## [1] 0.3273049
## 
## $cbat
## [1] 7.670319

## $free.T

## [1] 0.3273049

## $cbat

## [1] 7.670319

which is confirmed by the online calculator. Because the function is vectorized, we an submit a vector of testosterone results and SHBG results and get a vector of CBAT results.

cbat(c(10,20,30), c(40,50,60))

1 2	cbat(c(10,20,30), c(40,50,60))

## $free.T
## [1] 0.1738837 0.3273049 0.4661380
## 
## $cbat
## [1]  4.074926  7.670319 10.923842

## $free.T

## [1] 0.1738837 0.3273049 0.4661380

## $cbat

## [1] 4.074926 7.670319 10.923842

Precision of Components

We now need some precision data for the three components. However, in our lab, we just substitute 43 g/L for the albumin, so we will leave that term out of the analysis and limit our precision calculation to testosterone and SHBG. This will allow us to present the precision as surface plots as a function of total testosterone and SHBG.

We do testosterone by LC-MS/MS using Deborah French's method. In the last three months, the precision has been 3.9% at 0.78 nmol/L, 5.5% at 6.7 nmol/L, 5.2% at 18.0 nmol/L, and 6.0% at 28.2 nmol/L. We are using the Roche Cobas e601 SHBG method which, according to the package insert, has precision of 1.8% at 14.9 nmol/L, 2.1 % at 45.7 nmol/L, and 4.0% at 219 nmol/L.

cv.tt <- c(3.9, 5.5, 5.2, 6.0)
conc.tt <- c(0.78, 6.7, 18.0, 28.2)
tt.df <- data.frame(conc.tt,cv.tt)

plot(cv.tt ~ conc.tt, data = tt.df,
                    main = "Precision Profile of Testosterone",
                    xlab = "Testosterone (nmol/L)",
                    ylab = "CV Testosterone (%)",
                    ylim = c(0,8),
                    type = "o")

cv.tt <- c(3.9, 5.5, 5.2, 6.0)

conc.tt <- c(0.78, 6.7, 18.0, 28.2)

tt.df <- data.frame(conc.tt,cv.tt)

plot(cv.tt ~ conc.tt, data = tt.df,

main = "Precision Profile of Testosterone",

xlab = "Testosterone (nmol/L)",

ylab = "CV Testosterone (%)",

ylim = c(0,8),

type = "o")

plot of chunk unnamed-chunk-4

cv.shbg <- c(1.8, 2.1, 4.0)
conc.shbg <- c(14.9,45.7,219)
shbg.df <- data.frame(cv.shbg, conc.shbg)
plot(cv.shbg ~ conc.shbg, data = shbg.df,
                    main = "Precision Profile of SHBG",
                    xlab = "SHBG (nmol/L)",
                    ylab = "CV SHGB (%)",
                    ylim = c(0,5),
                    type = "o")

cv.shbg <- c(1.8, 2.1, 4.0)

conc.shbg <- c(14.9,45.7,219)

shbg.df <- data.frame(cv.shbg, conc.shbg)

plot(cv.shbg ~ conc.shbg, data = shbg.df,

main = "Precision Profile of SHBG",

xlab = "SHBG (nmol/L)",

ylab = "CV SHGB (%)",

ylim = c(0,5),

type = "o")

plot of chunk unnamed-chunk-4

Build Approximation Functions

We will want to generate linear interpolations of these precision profiles. Generally, we might watnt to use non-linear regression to do this but I will just linearly interpolate with the approxfun() function. This will allow us to just call a function to get the approximate CV at concentrations other than those for which we have data.

tt.fun <- approxfun(x = tt.df$conc.tt, y = tt.df$cv.tt)
shbg.fun <- approxfun(x = shbg.df$conc.shbg, y = shbg.df$cv.shbg)

tt.fun <- approxfun(x = tt.df$conc.tt, y = tt.df$cv.tt)

shbg.fun <- approxfun(x = shbg.df$conc.shbg, y = shbg.df$cv.shbg)

Now, if we want to know the precision of SHBG at, say, 100 nmol/L, we can just write,

shbg.fun(100)

1 2	shbg.fun(100)

## [1] 2.695326

1	## [1] 2.695326

to obtain our precision result.

Random Simulation

Now let's build a grid of SHBG and total testosterone (TT) values at which we will calculate the precision for CBAT.

shbg <- seq(from = 15, to = 200, by = 5)
tt <- seq(from = 1, to = 28, by = 1)

shbg <- seq(from = 15, to = 200, by = 5)

tt <- seq(from = 1, to = 28, by = 1)

At each point on the grid, we will have to generate, say, 100000 random TT values and 100000 random SHBG values with the appropriate precision and then calculate the expected precision of CBAT at those concentrations.

Let's do this for a single pair of concentrations by way of example modelling the random analytical error as Gaussian using the rnorm() function.

# [SHBG] = 15 nmol/L
# [TT] = 5.0 nmol/L
set.seed(100) #just to get consistent results
rng.tt <- rnorm(100000, mean = 5.0, sd = tt.fun(5.0)/100*5.0)
rng.shbg <- rnorm(100000, mean = 15, sd = shbg.fun(15)/100*15)
rng.cbat <- cbat(rng.tt, rng.shbg)
cv.cbat <- sd(rng.cbat$cbat)/mean(rng.cbat$cbat)*100
cv.cbat

# [SHBG] = 15 nmol/L

# [TT] = 5.0 nmol/L

set.seed(100) #just to get consistent results

rng.tt <- rnorm(100000, mean = 5.0, sd = tt.fun(5.0)/100*5.0)

rng.shbg <- rnorm(100000, mean = 15, sd = shbg.fun(15)/100*15)

rng.cbat <- cbat(rng.tt, rng.shbg)

cv.cbat <- sd(rng.cbat$cbat)/mean(rng.cbat$cbat)*100

cv.cbat

## [1] 5.30598

1	## [1] 5.30598

So, we can build the process of calculating the CV of CBAT into a function as follows:

cbat.cv <- function(TT, SHBG, N = 100000){
  rng.tt <- rnorm(N, mean = TT, sd = tt.fun(TT)/100*TT)
  rng.shbg <- rnorm(N, mean = SHBG, sd = shbg.fun(SHBG)/100*SHBG)
  rng.cbat <- cbat(rng.tt, rng.shbg)
  cv <- sd(rng.cbat$cbat)/mean(rng.cbat$cbat)*100
  return(cv)
}

cbat.cv <- function(TT, SHBG, N = 100000){

rng.tt <- rnorm(N, mean = TT, sd = tt.fun(TT)/100*TT)

rng.shbg <- rnorm(N, mean = SHBG, sd = shbg.fun(SHBG)/100*SHBG)

rng.cbat <- cbat(rng.tt, rng.shbg)

cv <- sd(rng.cbat$cbat)/mean(rng.cbat$cbat)*100

return(cv)

}

Now, we can make a matrix of the data for presenting a plot, calculating the CV and appending it to the dataframe.

cv.grid <- expand.grid(tt, shbg)
names(cv.grid) <- c("tt", "shbg")
cv.grid$cv.cbat <- mapply(cbat.cv, cv.grid$tt, cv.grid$shbg)

cv.grid <- expand.grid(tt, shbg)

names(cv.grid) <- c("tt", "shbg")

cv.grid$cv.cbat <- mapply(cbat.cv, cv.grid$tt, cv.grid$shbg)

Now make plot using the wireframe() function.

library(lattice)
wireframe(cv.cbat ~ tt*shbg, data = cv.grid,
          xlab = "Testo \n (nmol/L)",
          ylab = "SHBG \n (nmol/L)",
          zlab = "CV \n (%)",
          drape = TRUE,
          colorkey = TRUE,
          col.regions = colorRampPalette(c("blue", "red", "yellow"))(100),
          scales = list(arrows=FALSE,cex=.5,tick.number = 10)
          )

library(lattice)

wireframe(cv.cbat ~ tt*shbg, data = cv.grid,

xlab = "Testo \n (nmol/L)",

ylab = "SHBG \n (nmol/L)",

zlab = "CV \n (%)",

drape = TRUE,

colorkey = TRUE,

col.regions = colorRampPalette(c("blue", "red", "yellow"))(100),

scales = list(arrows=FALSE,cex=.5,tick.number = 10)

)

plot of chunk unnamed-chunk-11

This shows us that the CV of CBAT ranges from about 4–8% over the TT and SHBG ranges we have looked at.

Conclusion

We have determined the CV of calculated bioavailable testosterone using random number simulations using empirical CV data and produced a surface plot of CV. This allows us to comment on the CV of this lab reportable as a function of the two variables by which it is determined.

Parting Thought on Monte Carlo Simulations

The die is cast into the lap, but its every decision is from the LORD.

(Prov 16:33)

Count The Mondays in a Time Interval with Lubridate

November 25, 2015November 25, 2015 dtholmes@mail.ubc.ca

Recently, while working on quantifying the inpatient workload volume of routine tests as a function of the day of the week, I needed to be able to count the number of Mondays, Tuesdays, etc in a time–interval so I could calculate the average volume for each weekday in a time–interval.

The lubridate package makes this a very easy thing to do. Suppose the first date in your series is 21-May-2015 and the last date is 19-Aug-2015.

library(lubridate)
startDate <- dmy("21-May-2015")
endDate <- dmy("19-Aug-2015")

library(lubridate)

startDate <- dmy("21-May-2015")

endDate <- dmy("19-Aug-2015")

Now build a sequence between the dates:

myDates <-seq(from = startDate, to = endDate, by = "days")
head(myDates)

myDates <-seq(from = startDate, to = endDate, by = "days")

head(myDates)

## [1] "2015-05-21 UTC" "2015-05-22 UTC" "2015-05-23 UTC" "2015-05-24 UTC"
## [5] "2015-05-25 UTC" "2015-05-26 UTC"

1 2	## [1] "2015-05-21 UTC" "2015-05-22 UTC" "2015-05-23 UTC" "2015-05-24 UTC" ## [5] "2015-05-25 UTC" "2015-05-26 UTC"

The function wday() tells you which weekday a date corresponds to with Sunday being 1, Monday being 2 etc.

wday(startDate)

1 2	wday(startDate)

## [1] 5

## [1] 5

This means that 2015-05-21 was a Thursday. To get the abbreviation, you can enter:

wday(startDate, label = TRUE)

1 2	wday(startDate, label = TRUE)

## [1] Thurs
## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat

1 2	## [1] Thurs ## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat

and to get the full name of the day:

wday(startDate, label = TRUE, abbr = FALSE)

1 2	wday(startDate, label = TRUE, abbr = FALSE)

## [1] Thursday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

1 2	## [1] Thursday ## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

Leap years are accounted for:

wday(dmy("29-Feb-1504"), label = TRUE, abbr = FALSE)

1 2	wday(dmy("29-Feb-1504"), label = TRUE, abbr = FALSE)

## [1] Monday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

1 2	## [1] Monday ## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

So, we can use this as follows to find the Mondays:

which(wday(myDates)==2)

1 2	which(wday(myDates)==2)

##  [1]  5 12 19 26 33 40 47 54 61 68 75 82 89

1	## [1] 5 12 19 26 33 40 47 54 61 68 75 82 89

So the whole code to count them is:

startDate <- dmy("21-May-2015")
endDate <- dmy("19-Aug-2015")
myDates <-seq(from = startDate, to = endDate, by = "days")
length(which(wday(myDates)==2))

startDate <- dmy("21-May-2015")

endDate <- dmy("19-Aug-2015")

myDates <-seq(from = startDate, to = endDate, by = "days")

length(which(wday(myDates)==2))

## [1] 13

## [1] 13

I was born on August 04, 1971. This was a Wednesday. How many Wednesdays since I was born?

startDate <- dmy("04-Aug-1971")
endDate <- dmy("25-Nov-2015")
myDates <-seq(from = startDate, to = endDate, by = "days")
length(which(wday(myDates, label = TRUE)=="Wed"))

startDate <- dmy("04-Aug-1971")

endDate <- dmy("25-Nov-2015")

myDates <-seq(from = startDate, to = endDate, by = "days")

length(which(wday(myDates, label = TRUE)=="Wed"))

## [1] 2313

1	## [1] 2313

Which means, today I am 2312 weeks old today! Hurray. This is not a typo. The time interval is flanked by Wednesdays so there is one more Wednesday than the number of weeks in the interval. I thank my first–year calculus prof for beating this into me with reference to Simpson's Rule numerical integration.

Hope that comes in handy.

-Dan

Teach us to number our days, that we may gain a heart of wisdom. Psalm 90:12.

NA NA NA NA, Hey Hey Hey, Goodbye

September 5, 2015September 7, 2015 dtholmes@mail.ubc.ca

Removing NA’s from a Data Frame in R

The Problem

Suppose you are doing a method comparison for which some results are above or below the linear range of your assay(s). Generally, these will appear in your spreadsheet (gasp!) program as $< x$ or $> y$ or, in the case of our mass spectrometer, “No Peak”. When you read these data into R using read.csv(), R will turn then into factors, which I personally find super–annoying and which inspired this conference badge (see bottom right) as I learned from University of British Columbia prof Jenny Bryan.

For this reason, when we read the data in, it is convenient to choose the option stringsAsFactors = FALSE. In doing so, the data will be treated as strings and be in the character class. But for regression comparison purposes, we need to make the data numeric and all of the $< x$ and $> y$ results will be converted to NA. In this post, we want to address a few questions that follow:

How do we find all the NA results?
How can we replace them with a numeric (like 0)?
How can we rid ourselves of rows containing NA?

Finding NA's

Let's read in the data which comes from a method comparison of serum aldosterone between our laboratory and Russ Grant's laboratory (LabCorp) published here. I'll read in the data with stringsAsFactors = FALSE. These are aldosterone results in pmol/L. To convert to ng/dL, divide by 27.7.

myData<-read.csv("Comparison.csv", sep=",", stringsAsFactors = FALSE)
str(myData)

1 2	myData<-read.csv("Comparison.csv", sep=",", stringsAsFactors = FALSE) str(myData)

## 'data.frame':    96 obs. of  3 variables:
##  $ Sample.Num: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Aldo.Us   : chr  "462.3" "433.2" "37.7" "137.7" ...
##  $ Aldo.Them : num  457.2 418.1 42.1 133.9 27.4 ...

## 'data.frame': 96 obs. of 3 variables:

## $ Sample.Num: int 1 2 3 4 5 6 7 8 9 10 ...

## $ Aldo.Us : chr "462.3" "433.2" "37.7" "137.7" ...

## $ Aldo.Them : num 457.2 418.1 42.1 133.9 27.4 ...

head(myData)

1	head(myData)

##   Sample.Num Aldo.Us Aldo.Them
## 1          1   462.3     457.2
## 2          2   433.2     418.1
## 3          3    37.7      42.1
## 4          4   137.7     133.9
## 5          5    29.4      27.4
## 6          6   552.1     639.7

## Sample.Num Aldo.Us Aldo.Them

## 1 1 462.3 457.2

## 2 2 433.2 418.1

## 3 3 37.7 42.1

## 4 4 137.7 133.9

## 5 5 29.4 27.4

## 6 6 552.1 639.7

You can see the problem immediately, our data (“Aldo.Us”) is a character vector. This is not good for regression. Why did this happen? We can find out:

myData$Aldo.Us

1	myData$Aldo.Us

##  [1] "462.3"   "433.2"   "37.7"    "137.7"   "29.4"    "552.1"   "41.6"   
##  [8] "158.7"   "1198"    "478.4"   "160.7"   "167.9"   "211.6"   "493.3"  
## [15] "195.6"   "649.8"   "644"     "534.1"   "212.7"   "413.3"   "150.7"  
## [22] "451.2"   "25.8"    "118.8"   "496.1"   "486.1"   "846.8"   "139.9"  
## [29] "No Peak" "98.3"    "113.8"   "230.7"   "530.2"   "26.6"    "390.3"  
## [36] "782.8"   "886.7"   "83.4"    "44"      "71.2"    "657"     "321.6"  
## [43] "188.6"   "451.2"   "485.3"   "No Peak" "144.9"   "249.6"   "682"    
## [50] "601.9"   "330.5"   "216.6"   "500.3"   "20.5"    "271.5"   "196.7"  
## [57] "309.4"   "235.7"   "171.7"   "124.9"   "293.6"   "345.4"   "243.5"  
## [64] "75.1"    "508.3"   "442.4"   "531.3"   "317.4"   "647.9"   "562"    
## [71] "366.5"   "37.1"    "231.6"   "73.7"    "526.3"   "No Peak" "165.6"  
## [78] "105.8"   "77.8"    "211.6"   "125.8"   "76.5"    "58.2"    "111.9"  
## [85] "238.5"   "31.6"    "156.8"   "191.7"   "402.5"   "108.9"   "183.7"  
## [92] "314.4"   "90"      "98.9"    "144.9"   "971.4"

## [1] "462.3" "433.2" "37.7" "137.7" "29.4" "552.1" "41.6"

## [8] "158.7" "1198" "478.4" "160.7" "167.9" "211.6" "493.3"

## [15] "195.6" "649.8" "644" "534.1" "212.7" "413.3" "150.7"

## [22] "451.2" "25.8" "118.8" "496.1" "486.1" "846.8" "139.9"

## [29] "No Peak" "98.3" "113.8" "230.7" "530.2" "26.6" "390.3"

## [36] "782.8" "886.7" "83.4" "44" "71.2" "657" "321.6"

## [43] "188.6" "451.2" "485.3" "No Peak" "144.9" "249.6" "682"

## [50] "601.9" "330.5" "216.6" "500.3" "20.5" "271.5" "196.7"

## [57] "309.4" "235.7" "171.7" "124.9" "293.6" "345.4" "243.5"

## [64] "75.1" "508.3" "442.4" "531.3" "317.4" "647.9" "562"

## [71] "366.5" "37.1" "231.6" "73.7" "526.3" "No Peak" "165.6"

## [78] "105.8" "77.8" "211.6" "125.8" "76.5" "58.2" "111.9"

## [85] "238.5" "31.6" "156.8" "191.7" "402.5" "108.9" "183.7"

## [92] "314.4" "90" "98.9" "144.9" "971.4"

Ahhh…it's the dreaded “No Peak”. This is what the mass spectrometer has put in its data file. So, let's force everything to numeric:

myData$Aldo.Us <- as.numeric(myData$Aldo.Us)

1	myData$Aldo.Us <- as.numeric(myData$Aldo.Us)

## Warning: NAs introduced by coercion

1	## Warning: NAs introduced by coercion

We see the warnings about the introduction of NAs. And we get:

myData$Aldo.Us

1	myData$Aldo.Us

##  [1]  462.3  433.2   37.7  137.7   29.4  552.1   41.6  158.7 1198.0  478.4
## [11]  160.7  167.9  211.6  493.3  195.6  649.8  644.0  534.1  212.7  413.3
## [21]  150.7  451.2   25.8  118.8  496.1  486.1  846.8  139.9     NA   98.3
## [31]  113.8  230.7  530.2   26.6  390.3  782.8  886.7   83.4   44.0   71.2
## [41]  657.0  321.6  188.6  451.2  485.3     NA  144.9  249.6  682.0  601.9
## [51]  330.5  216.6  500.3   20.5  271.5  196.7  309.4  235.7  171.7  124.9
## [61]  293.6  345.4  243.5   75.1  508.3  442.4  531.3  317.4  647.9  562.0
## [71]  366.5   37.1  231.6   73.7  526.3     NA  165.6  105.8   77.8  211.6
## [81]  125.8   76.5   58.2  111.9  238.5   31.6  156.8  191.7  402.5  108.9
## [91]  183.7  314.4   90.0   98.9  144.9  971.4

## [1] 462.3 433.2 37.7 137.7 29.4 552.1 41.6 158.7 1198.0 478.4

## [11] 160.7 167.9 211.6 493.3 195.6 649.8 644.0 534.1 212.7 413.3

## [21] 150.7 451.2 25.8 118.8 496.1 486.1 846.8 139.9 NA 98.3

## [31] 113.8 230.7 530.2 26.6 390.3 782.8 886.7 83.4 44.0 71.2

## [41] 657.0 321.6 188.6 451.2 485.3 NA 144.9 249.6 682.0 601.9

## [51] 330.5 216.6 500.3 20.5 271.5 196.7 309.4 235.7 171.7 124.9

## [61] 293.6 345.4 243.5 75.1 508.3 442.4 531.3 317.4 647.9 562.0

## [71] 366.5 37.1 231.6 73.7 526.3 NA 165.6 105.8 77.8 211.6

## [81] 125.8 76.5 58.2 111.9 238.5 31.6 156.8 191.7 402.5 108.9

## [91] 183.7 314.4 90.0 98.9 144.9 971.4

summary(myData$Aldo.Us)

1	summary(myData$Aldo.Us)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    20.5   118.8   230.7   305.5   478.4  1198.0       3

1 2	## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 20.5 118.8 230.7 305.5 478.4 1198.0 3

Now we have 3 NAs. We want to find them and get rid of them. From the screen we could figure out where the NAs were and manually replace them. This is OK on such a small data set but when you start dealing with data sets having thousands or millions of rows, approaches like this are impractical. So, let's do it right.

If we naively try to use an equality we find out nothing.

which(myData$Aldo.Us==NA)

1	which(myData$Aldo.Us==NA)

## integer(0)

1	## integer(0)

Hunh? Whasgoinon?

This occurs because NA means “unknown”. Think about it this way. If one patient's result is NA and another patient's result is NA, then are the results equal? No, they are not (necessarily) equal, they are both unknown and so the comparison should be unknown also. This is why we do not get a result of TRUE when we ask the following question:

NA==NA

NA==NA

## [1] NA

## [1] NA

So, when we ask R if unknown #1 is equal to unknown #2, it responds with “I dunno.”, or “NA”. So if we want to find the NAs, we should inquire as follows:

is.na(myData$Aldo.Us)

1	is.na(myData$Aldo.Us)

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [23] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE

## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [45] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE

## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

or, for less verbose output:

which(is.na(myData$Aldo.Us))

1	which(is.na(myData$Aldo.Us))

## [1] 29 46 76

1	## [1] 29 46 76

Hey Hey! Ho Ho! Those NAs have got to go!

Now we know where they are, in rows 29, 46, and 76. We can replace them with 0, which is OK but may pose problems if we use weighted regression (i.e. if we have a 0 in the x-data and we weight data by 1/x). Alternatively, we can delete the rows entirely.

To replace them with 0, we can write:

myData$Aldo.Us[which(is.na(myData$Aldo.Us))] <- 0

1	myData$Aldo.Us[which(is.na(myData$Aldo.Us))] <- 0

and this is equivalent:

myData$Aldo.Us[is.na(myData$Aldo.Us)] <- 0

1	myData$Aldo.Us[is.na(myData$Aldo.Us)] <- 0

To remove the whole corresponding row, we can write:

myDataBeGoneNA <- myData[-which(is.na(myData$Aldo.Us)),]

1	myDataBeGoneNA <- myData[-which(is.na(myData$Aldo.Us)),]

or:

myDataBeGoneNA <- myData[!is.na(myData$Aldo.Us),]

1	myDataBeGoneNA <- myData[!is.na(myData$Aldo.Us),]

Complete Cases

What if there were NA's hiding all over the place in multiple columns and we wanted to banish any row containing one or more NA? In this case, the complete.cases() function is one way to go:

complete.cases(myData)

1	complete.cases(myData)

##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [23]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [34]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [45]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [56]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [67]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [78]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [89]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## [23] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE

## [34] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## [45] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## [56] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## [67] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE

## [78] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## [89] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

This function shows us which rows have no NAs (the ones with TRUE as the result) and which rows have NAs (the three with FALSE). We can banish all rows containing any NAs generally as follows:

myDataBeGoneNA <- myData[complete.cases(myData),]

1	myDataBeGoneNA <- myData[complete.cases(myData),]

This data set now has 93 rows:

nrow(myDataBeGoneNA)

1	nrow(myDataBeGoneNA)

## [1] 93

## [1] 93

You could peruse the excluded data like this:

myData[!complete.cases(myData),]

1	myData[!complete.cases(myData),]

##    Sample.Num Aldo.Us Aldo.Them
## 29         29      NA       6.6
## 46         46      NA       7.0
## 76         76      NA       5.7

## Sample.Num Aldo.Us Aldo.Them

## 29 29 NA 6.6

## 46 46 NA 7.0

## 76 76 NA 5.7

na.omit()

Another way to remove incomplete cases is the na.omit() function (as Dr. Shannon Haymond pointed out to me). So this works too:

myDataBeGoneNA <- na.omit(myData)

1	myDataBeGoneNA <- na.omit(myData)

Row Numbers are Actually Names

In all of these approaches, you will notice something peculiar. Even though we have excluded the three rows, the row numbering still appears to imply that there are 96 rows:

tail(myDataBeGoneNA)

1	tail(myDataBeGoneNA)

##    Sample.Num Aldo.Us Aldo.Them
## 91         91   183.7     170.4
## 92         92   314.4     307.6
## 93         93    90.0     214.0
## 94         94    98.9      75.1
## 95         95   144.9     129.3
## 96         96   971.4     807.7

## Sample.Num Aldo.Us Aldo.Them

## 91 91 183.7 170.4

## 92 92 314.4 307.6

## 93 93 90.0 214.0

## 94 94 98.9 75.1

## 95 95 144.9 129.3

## 96 96 971.4 807.7

but if you check the dimensions, there are 93 rows:

nrow(myDataBeGoneNA)

1	nrow(myDataBeGoneNA)

## [1] 93

## [1] 93

Why? This is because the row numbers are not row numbers; they are numerical row names. When you exclude a row, none of the other row names change. This was bewildering to me in the beginning. I thought my exclusions had failed somehow.

Now we can move on

Once this is done, you can go on and do your regression, which, in this case, looks like this.

Comparison of Serum Aldosterone

Finally, if you are ever wondering what fraction of your data is comprised of NA, rather than the absolute number, you can do this as follows:

mean(is.na(myData$Aldo.Us))

1	mean(is.na(myData$Aldo.Us))

## [1] 0.03125

1	## [1] 0.03125

If you applied this to the whole dataframe, you get the fraction of NA's in the whole dataframe (again–thank you Shannon):

mean(is.na(myData))

1	mean(is.na(myData))

## [1] 0.01041667

1	## [1] 0.01041667

Final Thought:

is.na(newunderthesun)

1	is.na(newunderthesun)

## [1] TRUE

1	## [1] TRUE

Ecclesiastes 1:9.

-Dan

Unit Converter

September 2, 2015September 5, 2015 Stephen Master

Introduction

Dan continues to crank out book chapter-length posts, which probably means that I should jump in before getting further behind…so here we go.

In the next few posts, I’d like to cover some work to help you to process aggregated proficiency testing (PT) data. Interpreting PT data from groups such as the College of American Pathologists (CAP) is, of course, a fundamental task for lab management. Comparing your lab’s results to peer group data from other users of the same instrumentation helps to ensure that your patients receive consistent results, and it provides at least a crude measure to ensure that your instrument performance is “in the ballpark”. Of course, many assays show significant differences between instrument models and manufacturers that can lead to results that are not comparable as a patient moves from institution to institution (or when your own lab changes instruments!). There are a number of standardization and harmonization initiatives underway (see http://harmonization.net, for example) to address this, and understanding which assays show significant bias compared to benchmark studies or national guidelines is a critical task for laboratorians. All of this is further complicated by the fact that sample matrix can significantly affect assay results, and sample commutability is one important reason why we can’t just take, say, CAP PT survey results (not counting the accuracy-based surveys) and determine which assays aren’t harmonized.

However.

With all of those caveats, it can still be useful to look through PT data in a systematic way to compare instruments. Ideally, we’d like to have everything in an R-friendly format that would allow us to ask systematic questions about data (things like “for how many assays does instrument X differ from instrument Y by >30% using PT material?”, or “how many PT materials give good concordance across all manufacturers?”). If we have good, commutable, accuracy-based testing materials, we can do even better. The first task is all of this fun, however, is getting the data into a format that R is happy with; no one I know likes the idea of retyping numbers from paper reports. I’m hoping to talk more about this in a future post, as there are lots of fun R text processing issues lurking here. In the mean time, though, we have a much more modest preliminary task to tackle.

Simple unit conversion

I’m currently staring at a CAP PT booklet. It happens to be D-dimer, but you can pick your own favorite analyte (and PT provider, for that matter). Some of the results are in ng/mL, some are ug/mL, and one is in mg/L. Let’s create an R function that allows us to convert between sets of comparable units. Now, although I know that Dan is in love with SI units (#murica), we’ll start by simply converting molar→molar and gravimetric→gravimetric. Yes, we can add fancy analyte-by-analyte conversion tables in the future…but right now we just want to get things on the same scale. In the process, we’ll cover three useful R command families.

First of all, we should probably decide how we want the final function to look. I’m thinking of something like this:

results <- labunit.convert(2.3, "mg/dL", "g/L")
results

## [1] 0.023

…which converts 2.3 mg/dL to 0.023 g/L. We should also give ourselves bonus points if we can make it work with vectors. For example, we may have this data frame:

mydata

##   Value   Units Target.Units
## 1  2.30    g/dL         mg/L
## 2 47.00 nmol/mL      mmol/dL
## 3  0.19    IU/L        mIU/L

and we would like to be able to use our function like this:

labunit.convert(mydata$Value, mydata$Units, mydata$Target.Units)

## [1] 2.3e+04 4.7e-03 1.9e+02

We should also handle things that are simpler

labunit.convert(0.23, "g", "mg")

## [1] 230

Getting started

Now that we know where we’re going, let’s start by writing a function that just converts between two units and returns the log difference. We’ll call this function convert.one.unit(), and it will take two arguments:

convert.one.unit("mg", "ng")

## [1] 6

Basically, we want to take a character variable (like, say, “dL”) and break it into two pieces: the metric prefix (“d”) and the base unit (“L”). If it isn’t something we recognize, the function should quit and complain (you could also make it return ‘NA’ and just give a warning instead, but we’ll hold off on that for now). We’ll start with a list of things that we want to recognize.

convert.one.unit <- function (unitin, unitout) {
  metric.prefixes <- c("y", "z", "a", "f", "p", "n", "u", "m", "c", "d", "", "da", "h", "k", "M", "G", "T", "P", "E", "Z", "Y")
  metric.logmultipliers <- c(-24, -21, -18, -15, -12, -9, -6, -3, -2, -1, 0, 1, 2, 3, 6, 9, 12, 15, 18, 21, 24)
  units.for.lab <- c("mol", "g", "L", "U", "IU")

Notice that the metric.prefixes variable contains the appropriate one- or two-character prefixes, and metric.logmultipliers has the corresponding log multiplier (for example, metric.prefixes[8] = “m”, and metric.logmultipliers[8] is -3). It’s also worth noting the "" (metric.prefixes[11]), which corresponds to a log multiplier of 0. The fact that "" is a zero-length string instead of a null means that we can search for it in a vector…which will be very handy!

And now for some regular expressions

This is the point where we tackle the first of the three command families that I told you about. If you’re not familiar with “regular expressions” in R or another language (Perl, Python, whatever), this is your entry point into some very useful text searching capabilities. Basically, a regular expression is a way of specifying a search for a matching text pattern, and it’s used with a number of R commands (grep(), grepl(), gsub(), regexpr(), regexec(), etc.). We’ll use gsub() as an example, since it’s one that many people are familiar with. Suppose that I have the character string “This is not a test”, and I want to change it to “This is a test”. I can feed gsub() a pattern that I want to recognize and some text that I want to use to replace the pattern. For example:

my.string <- "This is not a test"
my.altered.string <- gsub("not a ", "", my.string)   # replace "not a " with an empty string, ""
my.altered.string

## [1] "This is test"

That’s fine as far as it goes, but we will drive ourselves crazy if we’re limited to explicit matches. What if, for example, we also to also recognize “This is not…a test”, or “This is not my kind of a test”? We could write three different gsub statements, but that would get old fairly quickly. Instead of exactly matching the text, we’ll use a pattern. A regular expression that will match all three of our input statements is "not.+a ", so we can do the following:

gsub("not.+a ", "", "This is not a test")

## [1] "This is test"

gsub("not.+a ", "", "This is not my kind of a test")

## [1] "This is test"

You can read the regular expression "not.+a " as “match the letters ‘not’ followed by a group of one or more characters (denoted by the special symbol ‘.’) followed by an ‘a’”. You can find some very nice tutorials on regular expressions through Google, but for the purposes of this brief lesson I’ll give you a mini-cheat sheet that probably handles 90% of the regular expressions that I have to write:

Special Character	Meaning
.	match any character
\d	match any digit
\D	match anything that isn’t a digit
\s	match white space
\S	match anything that isn’t white space
\t	match a tab (less important in R, since you usually already have things in a data frame)
^	match the beginning of the string (i.e. “^Bob” matches “Bob my uncle” but not “Uncle Bob”)
$	match the end of the string
*	match the previous thing when it occurs 0 or more times
+	match the previous thing when it occurs 1 or more times
?	match the previous thing when it occurs 0 or 1 times
( .. )	(parentheses) enclose a group of choices or a particular substring in the match
\|	match this OR that (e.g. “(Bob\|Pete)” matches “Dr. Bob Smith” or “Dr. Pete Jones” but not “Dr. Sam Jones”

It’s also important to remember for things like "\d" that R uses backslashes as the escape character…so you actually have to write a double backslash, like this: "\\d". A regular expression to match one or more digits would be "\\d+".

OK, back to work. Our next step is to remove all white space from the unit text (we want "dL" to be handled the same way as " dL" or "dL "), so we’ll add the following lines:

  unitin <- gsub("\\s", "", unitin)
  unitout <- gsub("\\s", "", unitout)

See what we’ve done? We asked gsub() to replace every instance of white space (the regular expression is "\\s") with "". Easy.

Paste, briefly

Next, we want to put together a regular expression that will detect any of our metric.prefixes or units.for.lab. To save typing, we’ll do it with paste(), the second of our three R command families for the day. You probably already know about paste(), but if not, it’s basically the way to join R character variables into one big string. paste("Hi", "there") gives “Hi there” (paste() defaults to joining things with a space), paste("Super", "cali", "fragi", "listic", sep="") changes the separator to "" and gives us “Supercalifragilistic”. paste0() does the same thing as paste(..., sep=""). The little nuance that it’s worth noting today is that we are going to join together elements from a single vector rather than a bunch of separate variables…so we need to use the collapse = "..." option, where we set collapse to whatever character we want. You remember from the last section that | (OR) lets us put a bunch of alternative matches into our regular expression, so we will join all of the prefixes like this:

  prefix.combo <- paste0(metric.prefixes, collapse = "|")
  prefix.combo

## [1] "y|z|a|f|p|n|u|m|c|d||da|h|k|M|G|T|P|E|Z|Y"

What we’re really after is a regular expression that matches the beginning of the string, followed by 0 or 1 matches to one of the prefixes, followed by a match to one of the units. Soooo…

  prefix.combo <- paste0(metric.prefixes, collapse = "|")
  unit.combo <- paste0(units.for.lab, collapse = "|")
  
  unit.search <- paste0("^(", prefix.combo, ")?(", unit.combo, ")$")

  unit.search

## [1] "^(y|z|a|f|p|n|u|m|c|d||da|h|k|M|G|T|P|E|Z|Y)?(mol|g|L|U|IU)$"

So much nicer than trying to type that by hand. Next we’ll do actual pattern matching using the regexec() command. regexec(), as the documentation so nicely states, returns a list of vectors of substring matches. This is useful, since it means that we’ll get one match for the prefix (in the first set of parentheses of our regular expression), and one match for the units (in the second set of parentheses of our regular expression). I don’t want to belabor the details of this, but if we feed the output of regexec() to the regmatches() command, we can pull out one string for our prefix and another for our units. Since these are returned as a list, we’ll also use unlist() to coerce our results into one nice vector. If the length of that vector is 0, indicating no match, an error is generated.

  match.unit.in <- unlist(regmatches(unitin, regexec(unit.search, unitin)))
  match.unit.out <- unlist(regmatches(unitout, regexec(unit.search, unitout)))
  
  if (length(match.unit.in) == 0) stop(paste0("Can't parse input units (", unitin, ")"))
  if (length(match.unit.out) == 0) stop(paste0("Can't parse output units (", unitout, ")"))

If we were to take a closer look look at match.unit.in, we would see that the first entry is the full match, the second entry is the prefix match, and the third entry is the unit match. To make sure that the units agree (i.e. that we’re not trying to convert grams into liters or something similar), we use:

  if (match.unit.in[3] != match.unit.out[3]) stop("Base units don't match")

…and then finish by using the match() command to find the index in the metric.prefixes vector corresponding to the correct prefix (note that if there’s no prefix matched, it matches the "" entry of the vector–very handy). That index allows us to pull out the corresponding log multiplier, and we then return the difference to get a conversion factor. Our final function looks like this1:

convert.one.unit <- function (unitin, unitout) {
  # the prefix codes for the metric system
  metric.prefixes <- c("y", "z", "a", "f", "p", "n", "u", "m", "c", "d", "", "da", "h", "k", "M", "G", "T", "P", "E", "Z", "Y")
  # ...and their corresponding log multipliers
  metric.logmultipliers <- c(-24, -21, -18, -15, -12, -9, -6, -3, -2, -1, 0, 1, 2, 3, 6, 9, 12, 15, 18, 21, 24)
  # The units that we'd like to detect.  I guess we could add distance, but that's not too relevant to most of the analytes that I can think of
  units.for.lab <- c("mol", "g", "L", "U", "IU")

  # remove white space
  unitin <- gsub("\\s", "", unitin)
  unitout <- gsub("\\s", "", unitout)
  
  # build the pieces of our regular expression...
  prefix.combo <- paste0(metric.prefixes, collapse = "|")
  unit.combo <- paste0(units.for.lab, collapse = "|")

  # ...and stitch it all together
  unit.search <- paste0("^(", prefix.combo, ")?(", unit.combo, ")$")

  # identify the matches
  match.unit.in <- unlist(regmatches(unitin, regexec(unit.search, unitin)))
  match.unit.out <- unlist(regmatches(unitout, regexec(unit.search, unitout)))
  
  if (length(match.unit.in) == 0) stop(paste0("Can't parse input units (", unitin, ")"))
  if (length(match.unit.out) == 0) stop(paste0("Can't parse output units (", unitout, ")"))
  
  if (match.unit.in[3] != match.unit.out[3]) stop("Base units don't match")
  
  # get the appropriate log multipliers
  logmult.in <- metric.logmultipliers[match(match.unit.in[2], metric.prefixes)]
  logmult.out <- metric.logmultipliers[match(match.unit.out[2], metric.prefixes)]
  
  # return the appropriate (log) conversion factor
  return(logmult.in - logmult.out)
}


# Try it out
convert.one.unit("mL","L")

## [1] -3

‘Apply’-ing yourself

We’re actually most of the way there now. The final family of commands that we’d like to use is apply(), with various flavors that allow you to repeatedly apply (no surprise) a function to many entries of a variable. Dan mentioned this in his last post. He also mentioned not understanding the bad press that for loops get when they’re small. I completely agree with him, but the issue tends to arise when you’re used to a language like C (yes, I know we’re talking about compiled vs. interpreted in that case), where your loops are blazingly fast. You come to R and try nested loops that run from 1:10000, and then you have to go for coffee. lapply(), mapply(), mapply(), apply(), etc. have advantages in the R world. Might as well go with the flow on this one.

We’re going to make a convert.multiple.units() function that takes unitsin and unitsout vectors, binds them together as two columns, and then runs apply() to feed them to convert.one.unit(). Because apply() lets us interate a function over either dimension of a matrix, we can bind the two columns (a vector of original units and a vector of target units) and then iterate over each pair by rows (that’s what the 1 means as the second argument of apply(): it applies the function by row). If the anonymous function syntax throws you off…let us know in the comments, and we’ll cover it some time. For now, just understand that the last part of the line feeds values to the convert.one.unit()function.

convert.multiple.units <- function (unitsin, unitsout) {
  apply(cbind(unitsin, unitsout), 1, function (x) {convert.one.unit(x[1], x[2])})
}

Finally, we’ll go back to our original labunit.convert() function. Our overall plan is to split each unit by recognizing the “/” character using strsplit(). This returns a list of vectors of split groups (i.e. “mg/dL” becomes the a list where the first element is a character vector (“mg”, “dl”)). We then make sure that the lengths match (i.e. if the input is “mg/dL” and the output if “g/mL” that’s OK, but if the output is “g” then that’s a problem), obtain all the multipliers, and then add them all up. We add because they’re logs…and actually we mostly subtract, because we’re dividing. For cuteness points, we return 2*x[1] - sum(x), which will accurately calculate not only conversions like mg→g and mg/dL→g/L, but will even do crazy stuff like U/g/L→mU/kg/dL. Don’t ask me why you’d want to do that, but it works. The final multiplier is used to convert the vector of values (good for you if you notice that we didn’t check to make sure that the length of the values vector matched the unitsin vector…but we can always recycle our values that way).

labunit.convert <- function (values, unitsin, unitsout) {
  insep <- strsplit(unitsin, "/")
  outsep <- strsplit(unitsout, "/")

  lengthsin <- sapply(insep, length)
  lengthsout <- sapply(outsep, length)
  
  if (!all(lengthsin == lengthsout)) stop("Input and output units can't be converted")

  multipliers <- mapply(convert.multiple.units, insep, outsep)
  
  final.multiplier <- apply(t(multipliers), 1, function (x) {2*x[1] - sum(x)})
  
  return(values * 10^final.multiplier)
}

OK, enough. Back over to you, Dan. We now have a piece of code that we can use when we start comparing PT data from different instruments. That’s the immediate plan for future posts2, and before long there may even be an entry with nice graphics like those of my Canadian colleague.

-SRM

I received a request to convert “G/L” to “M/mL”, which was interpreted as converting billions/L to millions/mL. This requires changing our convert.one.unit() function to handle a “no units” case. Actually, it’s not as difficult as it sounds; if we just add an empty string (i.e. "") to the end of the units.for.lab vector, our regular expression does the right thing. Your edited line would read units.for.lab <- c("mol", "g", "L", "U", "IU", ""). The reason this works, incidentally, is that there’s no overlap (except "") between the prefixes and the units, so the pattern match doesn’t have a chance to be confused.↩
Following Dan’s lead, I should point out a major caveat to any such plans is James 4:13-15. Double extra credit if you are interested enough to look it up.↩

A Closer Look at TAT Time Dependence

August 28, 2015September 2, 2015 dtholmes@mail.ubc.ca

The Problem

We want to have a closer look at the time–dependence of turn around times (TATs). In particular, we would like to see if there is a significant trend in TAT over time (improvement or deterioration) and we would like the data to inform us of slowdowns and potentially unexpected problems that occur throughout each week. This should allow us to identify areas of the pre-analytical and/or analytical process phlebotomists that require attention.

My interest in this topic (which in past seemed entirely banal) came from the frustration of receiving monthly TAT reports showing spaghetti plots produced in Excel. In examining these figures is was entirely unclear to me whether any observed changes in the median (the only measure of central tendency provided) represented stochastic behaviour or a real problem. Utlimately, we want to be able to identify real problems in the preanalytical and analytical process but to do this, we need to visualize the data in a more sophisticated manner.

To do this, we are going to look at order–to–file times for a whole year for a nameless test X. You should be able to modify this approach to the manner in which your data is provided to you.

The real data was a little dirty but I have pre–cleaned it—this will have to be the topic of another post. In short, I purged the cancelled tests, removed duplicate records and limited my analysis to stat tests based on a stat flag that is stored in the laboratory information system (LIS). I won’t discuss this process here. The buffed–up file is named “2014_and_All_Clean.txt”. This happens to be a tab–delimited txt file. For this reason, I used read.delim() rather than read.csv(). These are basically the same function with different defaults for the seperator–one uses a comma and the other uses a tab. Please see our first post on TAT to understand how we are using the lubridate function ymd_hm().

Loading the Data

library(lubridate)
myData <- read.delim(file = "2014_and_All_clean.txt")
myData$ordered <- ymd_hm(myData$ordered)
myData$collected <- ymd_hm(myData$collected)
myData$received <- ymd_hm(myData$received)
myData$resulted <- ymd_hm(myData$resulted)
#confirm success
head(myData)

library(lubridate)

myData <- read.delim(file = "2014_and_All_clean.txt")

myData$ordered <- ymd_hm(myData$ordered)

myData$collected <- ymd_hm(myData$collected)

myData$received <- ymd_hm(myData$received)

myData$resulted <- ymd_hm(myData$resulted)

#confirm success

head(myData)

##               ordered           collected            received
## 1 2014-01-01 17:53:00 2014-01-01 17:54:00 2014-01-01 18:08:00
## 2 2014-01-01 15:10:00 2014-01-01 15:19:00 2014-01-01 15:21:00
## 3 2014-01-01 17:07:00 2014-01-01 17:15:00 2014-01-01 17:17:00
## 4 2014-01-01 18:20:00 2014-01-01 18:30:00 2014-01-01 18:35:00
## 5 2014-01-01 11:19:00 2014-01-01 11:25:00 2014-01-01 11:29:00
## 6 2014-01-01 11:00:00 2014-01-01 11:08:00 2014-01-01 11:11:00
##              resulted
## 1 2014-01-01 18:45:00
## 2 2014-01-01 16:33:00
## 3 2014-01-01 18:09:00
## 4 2014-01-01 19:45:00
## 5 2014-01-01 13:33:00
## 6 2014-01-01 11:47:00

## ordered collected received

## 1 2014-01-01 17:53:00 2014-01-01 17:54:00 2014-01-01 18:08:00

## 2 2014-01-01 15:10:00 2014-01-01 15:19:00 2014-01-01 15:21:00

## 3 2014-01-01 17:07:00 2014-01-01 17:15:00 2014-01-01 17:17:00

## 4 2014-01-01 18:20:00 2014-01-01 18:30:00 2014-01-01 18:35:00

## 5 2014-01-01 11:19:00 2014-01-01 11:25:00 2014-01-01 11:29:00

## 6 2014-01-01 11:00:00 2014-01-01 11:08:00 2014-01-01 11:11:00

## resulted

## 1 2014-01-01 18:45:00

## 2 2014-01-01 16:33:00

## 3 2014-01-01 18:09:00

## 4 2014-01-01 19:45:00

## 5 2014-01-01 13:33:00

## 6 2014-01-01 11:47:00

Now we want to look at a TAT. As in our first post on this topic, we will look at the order–to–file time.

otf <- difftime(myData$resulted, myData$ordered,units = "min")
myData <- cbind(myData,otf)

1 2	otf <- difftime(myData$resulted, myData$ordered,units = "min") myData <- cbind(myData,otf)

Sanity Check

Let’s just have a quick look at this to make sure nothing crazy is happening.

hist(as.numeric(myData$otf),xlim = c(0,200),breaks = 150, col = "orange", xlab = "TAT for X (min)", main = "Histogram of TAT for X")

1	hist(as.numeric(myData$otf),xlim = c(0,200),breaks = 150, col = "orange", xlab = "TAT for X (min)", main = "Histogram of TAT for X")

summary(as.numeric(myData$otf))

1	summary(as.numeric(myData$otf))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   51.00   62.00   70.85   77.00 1656.00

1 2	## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3.00 51.00 62.00 70.85 77.00 1656.00

Some Nutty Stuff

We do note one thing—there is a sample with a TAT of 1656 min. This is a little crazy so we could investigate those samples to see if this is real (because of a lost sample) or an artifact of an add–on analysis being misidentified as a stat or some other similar nonsensical event.

If you wanted to list all of these extreme outliers for the year, you could do so like this:

library("dplyr")

1	library("dplyr")

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Attaching package: 'dplyr'

## The following objects are masked from 'package:lubridate':

## intersect, setdiff, union

## The following objects are masked from 'package:stats':

## filter, lag

## The following objects are masked from 'package:base':

## intersect, setdiff, setequal, union

head(arrange(myData,desc(otf)),10)

1	head(arrange(myData,desc(otf)),10)

##                ordered           collected            received
## 1  2014-09-21 21:50:00 2014-09-21 21:58:00 2014-09-21 22:04:00
## 2  2014-09-07 14:59:00 2014-09-08 12:00:00 2014-09-08 12:04:00
## 3  2014-03-18 13:45:00 2014-03-18 14:10:00 2014-03-18 14:29:00
## 4  2014-04-01 01:48:00 2014-04-01 02:03:00 2014-04-01 02:06:00
## 5  2014-04-01 02:21:00 2014-04-01 02:28:00 2014-04-01 02:31:00
## 6  2014-09-04 21:35:00 2014-09-04 21:45:00 2014-09-04 22:08:00
## 7  2014-10-08 10:38:00 2014-10-08 10:42:00 2014-10-08 10:46:00
## 8  2014-05-25 13:23:00 2014-05-25 13:35:00 2014-05-25 13:45:00
## 9  2014-08-06 16:50:00 2014-08-06 17:09:00 2014-08-06 17:15:00
## 10 2014-08-02 11:52:00 2014-08-02 21:30:00 2014-08-02 21:44:00
##               resulted       otf
## 1  2014-09-23 01:26:00 1656 mins
## 2  2014-09-08 13:26:00 1347 mins
## 3  2014-03-19 11:17:00 1292 mins
## 4  2014-04-01 17:29:00  941 mins
## 5  2014-04-01 16:20:00  839 mins
## 6  2014-09-05 11:07:00  812 mins
## 7  2014-10-08 22:41:00  723 mins
## 8  2014-05-26 01:20:00  717 mins
## 9  2014-08-07 03:46:00  656 mins
## 10 2014-08-02 22:44:00  652 mins

## ordered collected received

## 1 2014-09-21 21:50:00 2014-09-21 21:58:00 2014-09-21 22:04:00

## 2 2014-09-07 14:59:00 2014-09-08 12:00:00 2014-09-08 12:04:00

## 3 2014-03-18 13:45:00 2014-03-18 14:10:00 2014-03-18 14:29:00

## 4 2014-04-01 01:48:00 2014-04-01 02:03:00 2014-04-01 02:06:00

## 5 2014-04-01 02:21:00 2014-04-01 02:28:00 2014-04-01 02:31:00

## 6 2014-09-04 21:35:00 2014-09-04 21:45:00 2014-09-04 22:08:00

## 7 2014-10-08 10:38:00 2014-10-08 10:42:00 2014-10-08 10:46:00

## 8 2014-05-25 13:23:00 2014-05-25 13:35:00 2014-05-25 13:45:00

## 9 2014-08-06 16:50:00 2014-08-06 17:09:00 2014-08-06 17:15:00

## 10 2014-08-02 11:52:00 2014-08-02 21:30:00 2014-08-02 21:44:00

## resulted otf

## 1 2014-09-23 01:26:00 1656 mins

## 2 2014-09-08 13:26:00 1347 mins

## 3 2014-03-19 11:17:00 1292 mins

## 4 2014-04-01 17:29:00 941 mins

## 5 2014-04-01 16:20:00 839 mins

## 6 2014-09-05 11:07:00 812 mins

## 7 2014-10-08 22:41:00 723 mins

## 8 2014-05-26 01:20:00 717 mins

## 9 2014-08-07 03:46:00 656 mins

## 10 2014-08-02 22:44:00 652 mins

which gives you the TAT of the 10 (or whatever number you prefer) worst specimens for the year. Obviously when you do this kind of analysis on your own data, you will retain the specimen ID in the data set and you could explore what is going on here–whether these are add–ons etc. You discover interesting things when you dig into your data.

Time Dependence

But we are interested in time-dependence of the TAT, so let’s look at a scatterplot of the whole year.

start <- ceiling_date(min(myData$collected))
finish <- start+days(356)
plot(myData$collected,myData$otf, pch = 19, xlim = c(start,finish),col = "#00000020",cex = 0.5, ylim = c(0,200), ylab = "TAT of X (min)", xlab = "Date")

start <- ceiling_date(min(myData$collected))

finish <- start+days(356)

plot(myData$collected,myData$otf, pch = 19, xlim = c(start,finish),col = "#00000020",cex = 0.5, ylim = c(0,200), ylab = "TAT of X (min)", xlab = "Date")

So, that’s pretty hard to draw inferences from. We can see that there are some outliers with inconceivably low TAT. We will have to investigate what is going on with those collections but not right now. These outliers will not affect the non–parametric measures of central tendency.

Tunnelling Down

Let’s have a look at one week.

finish <- start + days(7)
ticks <- seq(from = start, to = finish, by = "days")
plot(myData$collected,myData$otf, pch = 19, xlim = c(start,finish), col = "#00000020", cex = 0.5, ylim = c(0,200), ylab = "TAT of X (min)", xlab = "", xaxt = "n")
axis.POSIXct(side = 1, ticks, at = ticks, las = 2, cex.axis = 0.6, col.axis = "gray30", format = "%b %d %Y")
mtext("Date of analysis", side = 1, line = 4)

finish <- start + days(7)

ticks <- seq(from = start, to = finish, by = "days")

plot(myData$collected,myData$otf, pch = 19, xlim = c(start,finish), col = "#00000020", cex = 0.5, ylim = c(0,200), ylab = "TAT of X (min)", xlab = "", xaxt = "n")

axis.POSIXct(side = 1, ticks, at = ticks, las = 2, cex.axis = 0.6, col.axis = "gray30", format = "%b %d %Y")

mtext("Date of analysis", side = 1, line = 4)

See the first post on this topic for more information about the plotting parameters.

We can see there is a definite (and unsurprising) periodicity in the number of tests per hour. We can look at “volumes” another time. What we want to do now is look for time–dependence in the TAT so we can ultimately investigate what days of the week and times of the day are worse. But we don’t want to do this for one week—we want to do this for all weeks in the year. It would be nice, for example to plot all the Sundays, Mondays, Tuesdays etc overlapping and then see if we can see day–of–week and time–of–day trends.

Some More Lubridate Magic

Therefore, we need to assign every point in our myData dataframe a day of the week. The lubridate function wday() does this for us.

start

start

## [1] "2014-01-01 02:39:00 UTC"

1	## [1] "2014-01-01 02:39:00 UTC"

wday(start, label = TRUE)

1	wday(start, label = TRUE)

## [1] Wed
## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat

1 2	## [1] Wed ## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat

#if you want them as numbers, just leave the label option out.
wday(start)

1 2	#if you want them as numbers, just leave the label option out. wday(start)

## [1] 4

## [1] 4

So, January 1, 2014 was a Wednesday, which is the 4th day of the week. Let’s assign the day of the week for all our days and then bind this to our data.

weekday <- wday(myData$collected)
myData <- cbind(myData,weekday)
head(myData)

weekday <- wday(myData$collected)

myData <- cbind(myData,weekday)

head(myData)

##               ordered           collected            received
## 1 2014-01-01 17:53:00 2014-01-01 17:54:00 2014-01-01 18:08:00
## 2 2014-01-01 15:10:00 2014-01-01 15:19:00 2014-01-01 15:21:00
## 3 2014-01-01 17:07:00 2014-01-01 17:15:00 2014-01-01 17:17:00
## 4 2014-01-01 18:20:00 2014-01-01 18:30:00 2014-01-01 18:35:00
## 5 2014-01-01 11:19:00 2014-01-01 11:25:00 2014-01-01 11:29:00
## 6 2014-01-01 11:00:00 2014-01-01 11:08:00 2014-01-01 11:11:00
##              resulted      otf weekday
## 1 2014-01-01 18:45:00  52 mins       4
## 2 2014-01-01 16:33:00  83 mins       4
## 3 2014-01-01 18:09:00  62 mins       4
## 4 2014-01-01 19:45:00  85 mins       4
## 5 2014-01-01 13:33:00 134 mins       4
## 6 2014-01-01 11:47:00  47 mins       4

## ordered collected received

## 1 2014-01-01 17:53:00 2014-01-01 17:54:00 2014-01-01 18:08:00

## 2 2014-01-01 15:10:00 2014-01-01 15:19:00 2014-01-01 15:21:00

## 3 2014-01-01 17:07:00 2014-01-01 17:15:00 2014-01-01 17:17:00

## 4 2014-01-01 18:20:00 2014-01-01 18:30:00 2014-01-01 18:35:00

## 5 2014-01-01 11:19:00 2014-01-01 11:25:00 2014-01-01 11:29:00

## 6 2014-01-01 11:00:00 2014-01-01 11:08:00 2014-01-01 11:11:00

## resulted otf weekday

## 1 2014-01-01 18:45:00 52 mins 4

## 2 2014-01-01 16:33:00 83 mins 4

## 3 2014-01-01 18:09:00 62 mins 4

## 4 2014-01-01 19:45:00 85 mins 4

## 5 2014-01-01 13:33:00 134 mins 4

## 6 2014-01-01 11:47:00 47 mins 4

Now let’s plot all of the Monday–data for the whole year and look at the with–day–trends for Mondays. We are going to convert all of the TATs and times at which they are collected to decimal numbers so we don’t run into any hassles. (Yes, I ran into hassles when I did not do this.)

This little function accomplishes this for us:

#convert the time to decimal number of hours since midnight for simplicity in plotting
timeconvert = function(t){hour(t)+minute(t)/60}

1 2	#convert the time to decimal number of hours since midnight for simplicity in plotting timeconvert = function(t){hour(t)+minute(t)/60}

mondayData <- subset(myData,weekday == 2)
mondayTimes <- timeconvert(mondayData$collected)
mondayData <- cbind(mondayData,times = mondayTimes)
mondayData$otf <- as.numeric(mondayData$otf)
head(mondayData)

mondayData <- subset(myData,weekday == 2)

mondayTimes <- timeconvert(mondayData$collected)

mondayData <- cbind(mondayData,times = mondayTimes)

mondayData$otf <- as.numeric(mondayData$otf)

head(mondayData)

##                 ordered           collected            received
## 225 2014-01-06 23:35:00 2014-01-06 23:30:00 2014-01-07 00:09:00
## 226 2014-01-06 11:02:00 2014-01-06 11:10:00 2014-01-06 11:14:00
## 227 2014-01-06 14:08:00 2014-01-06 14:20:00 2014-01-06 14:24:00
## 228 2014-01-06 10:06:00 2014-01-06 10:16:00 2014-01-06 10:19:00
## 229 2014-01-06 14:42:00 2014-01-06 15:09:00 2014-01-06 15:16:00
## 230 2014-01-06 20:02:00 2014-01-06 20:17:00 2014-01-06 20:22:00
##                resulted otf weekday    times
## 225 2014-01-07 00:46:00  71       2 23.50000
## 226 2014-01-06 12:42:00 100       2 11.16667
## 227 2014-01-06 16:02:00 114       2 14.33333
## 228 2014-01-06 11:44:00  98       2 10.26667
## 229 2014-01-06 16:04:00  82       2 15.15000
## 230 2014-01-06 20:55:00  53       2 20.28333

## ordered collected received

## 225 2014-01-06 23:35:00 2014-01-06 23:30:00 2014-01-07 00:09:00

## 226 2014-01-06 11:02:00 2014-01-06 11:10:00 2014-01-06 11:14:00

## 227 2014-01-06 14:08:00 2014-01-06 14:20:00 2014-01-06 14:24:00

## 228 2014-01-06 10:06:00 2014-01-06 10:16:00 2014-01-06 10:19:00

## 229 2014-01-06 14:42:00 2014-01-06 15:09:00 2014-01-06 15:16:00

## 230 2014-01-06 20:02:00 2014-01-06 20:17:00 2014-01-06 20:22:00

## resulted otf weekday times

## 225 2014-01-07 00:46:00 71 2 23.50000

## 226 2014-01-06 12:42:00 100 2 11.16667

## 227 2014-01-06 16:02:00 114 2 14.33333

## 228 2014-01-06 11:44:00 98 2 10.26667

## 229 2014-01-06 16:04:00 82 2 15.15000

## 230 2014-01-06 20:55:00 53 2 20.28333

So this seems to have worked and now we can make a scatter plot.

Monday Monday, So Good to Me?

plot(mondayData$times,mondayData$otf, pch = 19, col = "#00000020", cex = 0.5, ylim = c(0,200), xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)")
axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")

1 2	plot(mondayData$times,mondayData$otf, pch = 19, col = "#00000020", cex = 0.5, ylim = c(0,200), xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)") axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")

But now for the interesting part. We want to see how the median TAT is related to the time of day. We might want to look at, say, the running median over one–hour window all day long. Notice that I have made the times, t, go from 0.5 to 23.5 because these are the only times for which a 60 min moving median can be calculated. Otherwise we’d have this really annoying situation where we’d have to fetch data from the last half-hour of Sunday and the first half-hour of Tuesday. I don’t need that level of perfectionism at present.

plot(mondayData$times,mondayData$otf, pch = 19, col = "#00000020",cex = 0.5, ylim = c(30,100),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)")
axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")

#60 min moving median calculation:
#Create a point at every minute of the day
t <- seq(from = 0.5,to = 23.5, by = 1/60)
#create and empty vector to store the data
med.tats <- vector()

for(i in 1:length(t)){
  #get all points within an hour of the minute
  tats <- subset(mondayData,(mondayData$times >= (t[i]-0.5)) & (mondayData$times < (t[i]+0.5)))
  #alternatively using filter() from the dplyr package
  #tats <- filter(mondayData, times >= (t[i] - 0.5) & times < (t[i] + 0.5))
  med.tats[i] <- median(tats$otf)
}

lines(t,med.tats, col = "blue")
grid(NA,NULL)

plot(mondayData$times,mondayData$otf, pch = 19, col = "#00000020",cex = 0.5, ylim = c(30,100),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)")

axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")

#60 min moving median calculation:

#Create a point at every minute of the day

t <- seq(from = 0.5,to = 23.5, by = 1/60)

#create and empty vector to store the data

med.tats <- vector()

for(i in 1:length(t)){

#get all points within an hour of the minute

tats <- subset(mondayData,(mondayData$times >= (t[i]-0.5)) & (mondayData$times < (t[i]+0.5)))

#alternatively using filter() from the dplyr package

#tats <- filter(mondayData, times >= (t[i] - 0.5) & times < (t[i] + 0.5))

med.tats[i] <- median(tats$otf)

}

lines(t,med.tats, col = "blue")

grid(NA,NULL)

Removing the For–ness

Many R folks don’t like for–loops and would rather use the apply() family of functions. I’m not sure I always understand contempt towards loops for small simple tasks but if you wanted to accomplish the same looping task without using a for–loop, you could do as follows:

t <- seq(from = 0.5, to = 23.5, by = 1/60)
movingmed<-function(t,x){
  tats <- filter(x, times >= (t - 0.5) & times < (t + 0.5))
  median(tats$otf)
}
med.tats <- sapply(t, movingmed, x=mondayData)

t <- seq(from = 0.5, to = 23.5, by = 1/60)

movingmed<-function(t,x){

tats <- filter(x, times >= (t - 0.5) & times < (t + 0.5))

median(tats$otf)

}

med.tats <- sapply(t, movingmed, x=mondayData)

Smoothing

This approach is reasonable but the problem (I have found) is that it is computationally expensive on large data sets. For this reason, it is nice to use a canned smoothing algorithm like LOWESS which is much faster. The parameter f of the lowess function has a default of 2/3 which in our case results in a fit that is way–too smoothed. I played around with f until I got something that more or less tracked with the 60–min moving median. There are many approaches to smoothing–don’t get lost in the vortex.

Lowess Smoothing

plot(mondayData$times,mondayData$otf, pch = 19, col = "#00000020",cex = 0.5, ylim = c(30,100),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)")
axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")
#running median
#create a point at every minute of the day
t <- seq(from = 0.5,to = 23.5, by = 1/60)
med.tats <- vector()
for(i in 1:length(t)){
  #get all points within an hour of the minute
  tats <- subset(mondayData,(mondayData$times >= (t[i] - 0.5)) & (mondayData$times < (t[i] + 0.5)))
  #alternatively using filter() from the dplyr package
  #tats <- filter(mondayData, times >= (t[i] - 0.5) & times < (t[i] + 0.5))
  med.tats[i] <- median(tats$otf)
}
lines(t,med.tats, col = "blue")
grid(col = "black")
mondayFit <- lowess(mondayData$times,mondayData$otf,f = 0.05)
lines(mondayFit,col = "red")

plot(mondayData$times,mondayData$otf, pch = 19, col = "#00000020",cex = 0.5, ylim = c(30,100),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)")

axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")

#running median

#create a point at every minute of the day

t <- seq(from = 0.5,to = 23.5, by = 1/60)

med.tats <- vector()

for(i in 1:length(t)){

#get all points within an hour of the minute

tats <- subset(mondayData,(mondayData$times >= (t[i] - 0.5)) & (mondayData$times < (t[i] + 0.5)))

#alternatively using filter() from the dplyr package

#tats <- filter(mondayData, times >= (t[i] - 0.5) & times < (t[i] + 0.5))

med.tats[i] <- median(tats$otf)

}

lines(t,med.tats, col = "blue")

grid(col = "black")

mondayFit <- lowess(mondayData$times,mondayData$otf,f = 0.05)

lines(mondayFit,col = "red")

So, that’s cool. Now lets loop over all the days of the week and make plots for each day.

#create a 2x4 plot window
par(mfrow = c(2,4))
#loop over days
for (i in 1:7){
  #make the lowess fit
  mydayData <- subset(myData,weekday == i)
  mydayTimes <- timeconvert(mydayData$collected)
  mydayData <- cbind(mydayData,times = mydayTimes)
  mydayFit <- lowess(mydayData$times,mydayData$otf,f = 0.05)
  plot(NA,NA,ylim = c(30,100), xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)")
  axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")
  
  #running median
  #create a point at every minute of the day
  t <- seq(from = 0.5,to = 23.5, by = 1/60)
  med.tats <- vector()
  for(j in 1:length(t)){
    #get all points within an hour of the minute
    tats <- subset(mydayData,(mydayData$times >= (t[j] - 0.5)) & (mydayData$times < (t[j] + 0.5)))
    #alternatively using filter() from the dplyr package
    #tats <- filter(mydayData, times >= (t[j] - 0.5) & times < (t[j] + 0.5))
    med.tats[j] <- median(tats$otf)
  }
  lines(t,med.tats, col = "blue")
  grid(col = "black")
  mydayFit <- lowess(mydayData$times,mydayData$otf,f = 0.05)
  lines(mydayFit,col = "red")
  
  #put in horizontal gridding
  grid(NA,NULL)
  #you could add a legend on all the plots if you wanted
  #legend("bottomright", c("moving median","lowess"), lty = c(1,1), col = c("blue","red", byt = "n"))
}

#create a 2x4 plot window

par(mfrow = c(2,4))

#loop over days

for (i in 1:7){

#make the lowess fit

mydayData <- subset(myData,weekday == i)

mydayTimes <- timeconvert(mydayData$collected)

mydayData <- cbind(mydayData,times = mydayTimes)

mydayFit <- lowess(mydayData$times,mydayData$otf,f = 0.05)

plot(NA,NA,ylim = c(30,100), xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)")

axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")

#running median

#create a point at every minute of the day

t <- seq(from = 0.5,to = 23.5, by = 1/60)

med.tats <- vector()

for(j in 1:length(t)){

#get all points within an hour of the minute

tats <- subset(mydayData,(mydayData$times >= (t[j] - 0.5)) & (mydayData$times < (t[j] + 0.5)))

#alternatively using filter() from the dplyr package

#tats <- filter(mydayData, times >= (t[j] - 0.5) & times < (t[j] + 0.5))

med.tats[j] <- median(tats$otf)

}

lines(t,med.tats, col = "blue")

grid(col = "black")

mydayFit <- lowess(mydayData$times,mydayData$otf,f = 0.05)

lines(mydayFit,col = "red")

#put in horizontal gridding

grid(NA,NULL)

#you could add a legend on all the plots if you wanted

#legend("bottomright", c("moving median","lowess"), lty = c(1,1), col = c("blue","red", byt = "n"))

}

Now, let’s overplot all the lowess fits on a single graph and see what practical observations we can make. I have increased the lowess() smoothing to make things easier to look at.

par(mfrow = c(1,1))
plot(0,0,ylim = c(40,80),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)")
axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")
for (i in 1:7){
  mydayData <- subset(myData,weekday == i)
  mydayTimes <- timeconvert(mydayData$collected)
  mydayData <- cbind(mydayData,times = mydayTimes)
  mydayFit <- lowess(mydayData$times,mydayData$otf,f = 0.1)
  lines(mydayFit, col = rainbow(7)[i], ylim = c(40,80),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)", main = paste("TAT for",wday(i,label = TRUE),"in 2014"))
}
legend("bottomright",as.character(wday(1:7,label = TRUE)), col = rainbow(7), lty = rep(1,7), cex = 0.5)

par(mfrow = c(1,1))

plot(0,0,ylim = c(40,80),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)")

axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")

for (i in 1:7){

mydayData <- subset(myData,weekday == i)

mydayTimes <- timeconvert(mydayData$collected)

mydayData <- cbind(mydayData,times = mydayTimes)

mydayFit <- lowess(mydayData$times,mydayData$otf,f = 0.1)

lines(mydayFit, col = rainbow(7)[i], ylim = c(40,80),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)", main = paste("TAT for",wday(i,label = TRUE),"in 2014"))

}

legend("bottomright",as.character(wday(1:7,label = TRUE)), col = rainbow(7), lty = rep(1,7), cex = 0.5)

Observations

We can immediately see some issues. Weekends in the early hours of the morning are bad. 8 am is bad across all days. Noon is generally problematic and particularly so on Saturdays. There is also a slowdown in mid–afternoon and in the early evenings. Saturday midnight is the most problematic time, although the endpoints of the figure have fewer local weighting points and their confidence intervals are wider. This is something we can cover another time.

Remember, also, this is only the median we have looked at. Other horrors may be lurking in the 90th percentile.

Next time what we will do is move all of this TAT visualization to a 3D representation so we can more easily spot the problematic times.

-Dan

The lot is cast into the lap, but its every decision is from the LORD.

Proverbs 16:33

Generating Meaningful Turaround Time Plots for Clinical Laboratory Medicine

August 11, 2015August 11, 2015 dtholmes@mail.ubc.ca

The Problem

It is standard practice in Clinical Laboratory Medicine to monitor turn around times (TATs) for high volume tests like potassium (K), Troponin (Tn) and Hemoglobin (Hb). The term TAT is typically understood to mean “the time elapsed from when the doctor orders the test to the time the result is available in the Laboratory Information System (LIS)”. This of course does not take into account the lag between the result availability and the time when the physician logs in to view it and respond, but let’s just say that we are not there yet.

Traditionally, some dedicated soul would take .csv extracts from the LIS and do laborious things in Excel to generate the median TAT for the month for each test and each lab location for which they were responsible. Not only is it impossible to automate such a process, it is entirely manual and produces fairly uninformative output since (at least at our site) only medians were generated.

What really frustrates physicians is not where the median goes each month, it is the behaviour of, say, the 90th percentile of TAT or the outliers. These are the ones they remember.

R allows us to produce a much more informative figure in an automatable fashion. I provide here an example of a TAT figure for Hb with some statistical metric included.

Look at the Data

Let’s start by reading in our data and looking at how it is structured.

myData<-read.csv(file = "Hb_TAT_data.csv",header = TRUE)
str(myData)
head(myData)

myData<-read.csv(file = "Hb_TAT_data.csv",header = TRUE)

str(myData)

head(myData)

## 'data.frame':    4497 obs. of  6 variables:
##  $ specimenID: int  4221 5281 5308 5320 5356 5374 5375 5376 5241 5270 ...
##  $ ordered   : Factor w/ 4244 levels "2015-06-30 23:15",..: 4 68 86 92 115 126 127 128 46 61 ...
##  $ collected : Factor w/ 4245 levels "2015-07-01 00:02",..: 2 72 88 95 114 126 128 129 46 65 ...
##  $ received  : Factor w/ 4098 levels "2015-07-01 00:07",..: 2 69 81 88 107 119 120 122 45 62 ...
##  $ resulted  : Factor w/ 3765 levels "2015-07-01 00:11",..: 2 70 79 86 104 113 114 116 42 63 ...
##  $ result    : int  126 110 134 117 135 113 109 111 106 129 ...

## specimenID ordered collected received
## 1 4221 2015-06-30 23:28 2015-07-01 00:08 2015-07-01 00:29
## 2 5281 2015-07-01 14:12 2015-07-01 14:25 2015-07-01 14:30
## 3 5308 2015-07-01 15:56 2015-07-01 16:03 2015-07-01 16:10
## 4 5320 2015-07-01 16:57 2015-07-01 17:12 2015-07-01 17:20
## 5 5356 2015-07-01 20:00 2015-07-01 20:07 2015-07-01 20:18
##6 5374 2015-07-01 21:37 2015-07-01 21:40 2015-07-01 21:44
 resulted result
1 2015-07-01 00:37 126
2 2015-07-01 14:50 110
3 2015-07-01 16:16 134
4 2015-07-01 17:23 117
5 2015-07-01 20:23 135
6 2015-07-01 21:53 113

## 'data.frame': 4497 obs. of 6 variables:

## $ specimenID: int 4221 5281 5308 5320 5356 5374 5375 5376 5241 5270 ...

## $ ordered : Factor w/ 4244 levels "2015-06-30 23:15",..: 4 68 86 92 115 126 127 128 46 61 ...

## $ collected : Factor w/ 4245 levels "2015-07-01 00:02",..: 2 72 88 95 114 126 128 129 46 65 ...

## $ received : Factor w/ 4098 levels "2015-07-01 00:07",..: 2 69 81 88 107 119 120 122 45 62 ...

## $ resulted : Factor w/ 3765 levels "2015-07-01 00:11",..: 2 70 79 86 104 113 114 116 42 63 ...

## $ result : int 126 110 134 117 135 113 109 111 106 129 ...

## specimenID ordered collected received

## 1 4221 2015-06-30 23:28 2015-07-01 00:08 2015-07-01 00:29

## 2 5281 2015-07-01 14:12 2015-07-01 14:25 2015-07-01 14:30

## 3 5308 2015-07-01 15:56 2015-07-01 16:03 2015-07-01 16:10

## 4 5320 2015-07-01 16:57 2015-07-01 17:12 2015-07-01 17:20

## 5 5356 2015-07-01 20:00 2015-07-01 20:07 2015-07-01 20:18

##6 5374 2015-07-01 21:37 2015-07-01 21:40 2015-07-01 21:44

resulted result

1 2015-07-01 00:37 126

2 2015-07-01 14:50 110

3 2015-07-01 16:16 134

4 2015-07-01 17:23 117

5 2015-07-01 20:23 135

6 2015-07-01 21:53 113

In this simplified anonymized data set we can see that we have 4497 observations with all of the necessary time points to calculate the turnaround times of the preanalytical and analytical processes. For the sake of this example, let’s focus on the order-to-file time.

We are going to need to handle the dates, for which there is only one package worth discussing, namely lubridate.

library(lubridate)

1	library(lubridate)

Basic Data Preparation

The first thing we need to do is to convert the order, collect, receive and result times to lubridate objects (i.e. time and date objects) so that we can do some algebra on them. We can see from the structure of myData that the order, collect, receive and result time points are in the format “YYYY-MM-DD HH:MM”. Therefore we can use the lubridate function ymd_hm() to perform the conversion.

myData$ordered<-ymd_hm(myData$ordered)
myData$collected<-ymd_hm(myData$collected)
myData$received<-ymd_hm(myData$received)
myData$resulted<-ymd_hm(myData$resulted)

myData$ordered<-ymd_hm(myData$ordered)

myData$collected<-ymd_hm(myData$collected)

myData$received<-ymd_hm(myData$received)

myData$resulted<-ymd_hm(myData$resulted)

Applying str() again to myData, you will see that the dates and times are now POSIXct, that is, they are now dates and times. This allows use to calculate the order-to-file TAT, we can do with the difftime() function exporting the result in minutes. We will also append the order-to-file (otf) TAT to the dataframe and do some quick sanity-checking.

Sanity Check

otf<-difftime(myData$resulted,myData$ordered,units = "min")
myData<-cbind(myData,otf)
summary(as.numeric(myData$otf))
hist(as.numeric(myData$otf), main = "Histogram of Hb TATs", breaks = 60, col = "darkred",xlim = c(0,200), xlab = "Order to File in Minutes")

otf<-difftime(myData$resulted,myData$ordered,units = "min")

myData<-cbind(myData,otf)

summary(as.numeric(myData$otf))

hist(as.numeric(myData$otf), main = "Histogram of Hb TATs", breaks = 60, col = "darkred",xlim = c(0,200), xlab = "Order to File in Minutes")

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   22.00   30.00   36.58   43.00  694.00

1 2	## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3.00 22.00 30.00 36.58 43.00 694.00

This looks reasonable, so we can proceed with a TAT scatterplot.

Scatterplot

plot(myData$ordered,myData$otf, pch = 19, main = "Hemoglobin TAT",xlab = "Date",ylab = "TAT (min)")

1	plot(myData$ordered,myData$otf, pch = 19, main = "Hemoglobin TAT",xlab = "Date",ylab = "TAT (min)")

Beautifying

This is kind-of problematic because we really want to focus on results in the 0-200 minute range. There are some wild-outliers as occurs in real life because of instrument down-time, add-ons, etc. We can leave this matter for the present. Notice that I have displayed every day on the x-axis because this will allow us to investigate any problems we see. So we will adjust the ylim and we will also make the plot points semitransparent by using hexidecimal colour codes followed by a fractional transparency expressed in hexidecimal. Black is “#000000” and “20” is hexidecimal for 32 which is 32/256 or 12.5% opacity.

#make the points semistranparent and a little smaller
plot(myData$ordered,myData$otf, pch = 19, main = "Hemoglobin TAT",xlab = "Date of Analysis",ylab = "TAT (min)", ylim = c(0,200), col = "#00000020",cex = 0.5)

1 2	#make the points semistranparent and a little smaller plot(myData$ordered,myData$otf, pch = 19, main = "Hemoglobin TAT",xlab = "Date of Analysis",ylab = "TAT (min)", ylim = c(0,200), col = "#00000020",cex = 0.5)

We’ll accept the fact that we know that there are a number of outliers. We could easily have a plot that displayed them or a tabular summary of them.

Now we will need to prepare the vector of daily medians, 10th and 90th percentiles to plot. We will loop through each day of the month and then calculate the statistics for that day.

#calculate the start and end date
#fist collection in the month's reporting is always collected in the previous month so ceiling forces startDate to be first day of the month we are interested in. Same is true for endDate.
startDate <- ceiling_date(min(myData$ordered),"day")
endDate <- ceiling_date(max(myData$ordered),"day")
days <- seq (from = startDate, to = endDate, by = 'days')
#when we plot stastics, we want them at mid day
middays <- days + hours(12)
tenth<-vector()
fiftieth<-vector()
ninetieth<-vector()

for ( i in seq_along(days) ) {
  daysData<-subset(myData,myData$ordered >= days[i]&myData$ordered<days[i+1])
  tenth[i]<-quantile(daysData$otf,probs = 0.10)
  fiftieth[i]<-median(daysData$otf)
  ninetieth[i]<-quantile(daysData$otf,probs = 0.90)
}

quantileData<-data.frame(middays,tenth,fiftieth,ninetieth)

plot(myData$ordered,myData$otf, pch = 19, main = "Hemoglobin TAT",xlab = "Date of Analysis", ylab = "TAT (min)", ylim = c(0,200), col = "#00000020",cex = 0.5)
lines(quantileData$middays,quantileData$tenth, col = "red")
lines(quantileData$middays,quantileData$ninetieth, col = "red")
lines(quantileData$middays,quantileData$fiftieth, col = "blue")

#calculate the start and end date

#fist collection in the month's reporting is always collected in the previous month so ceiling forces startDate to be first day of the month we are interested in. Same is true for endDate.

startDate <- ceiling_date(min(myData$ordered),"day")

endDate <- ceiling_date(max(myData$ordered),"day")

days <- seq (from = startDate, to = endDate, by = 'days')

#when we plot stastics, we want them at mid day

middays <- days + hours(12)

tenth<-vector()

fiftieth<-vector()

ninetieth<-vector()

for ( i in seq_along(days) ) {

daysData<-subset(myData,myData$ordered >= days[i]&myData$ordered<days[i+1])

tenth[i]<-quantile(daysData$otf,probs = 0.10)

fiftieth[i]<-median(daysData$otf)

ninetieth[i]<-quantile(daysData$otf,probs = 0.90)

}

quantileData<-data.frame(middays,tenth,fiftieth,ninetieth)

plot(myData$ordered,myData$otf, pch = 19, main = "Hemoglobin TAT",xlab = "Date of Analysis", ylab = "TAT (min)", ylim = c(0,200), col = "#00000020",cex = 0.5)

lines(quantileData$middays,quantileData$tenth, col = "red")

lines(quantileData$middays,quantileData$ninetieth, col = "red")

lines(quantileData$middays,quantileData$fiftieth, col = "blue")

But this is not all that easy to look at. First, it’s kind-of ugly and second, if we find a problem date, we can’t read it from the figure. So let’s start by fixing the x-axis labels:

plot(myData$ordered,myData$otf, pch = 19, main = "Hemoglobin TAT", xlab = "", ylab = "TAT (min)", ylim = c(0,200), col = "#00000020",cex = 0.4,xaxt = "n")
#don't plot first day of next month on axis
axis.POSIXct(side = 1, quantileData$middays[1:length(quantileData$middays)-1],at = quantileData$middays[1:length(quantileData$middays)-1], las = 2, cex.axis = 0.6, col.axis = "gray30", format = "%b %d %Y")
#allows me to move the xlab down manually so as not to overwrite the dates.
mtext("Date of analysis", side = 1, line = 4)

plot(myData$ordered,myData$otf, pch = 19, main = "Hemoglobin TAT", xlab = "", ylab = "TAT (min)", ylim = c(0,200), col = "#00000020",cex = 0.4,xaxt = "n")

#don't plot first day of next month on axis

axis.POSIXct(side = 1, quantileData$middays[1:length(quantileData$middays)-1],at = quantileData$middays[1:length(quantileData$middays)-1], las = 2, cex.axis = 0.6, col.axis = "gray30", format = "%b %d %Y")

#allows me to move the xlab down manually so as not to overwrite the dates.

mtext("Date of analysis", side = 1, line = 4)

To paint the central 80% as a band, we will need to use the polygon() function. I am going to write a function to which and x-vector and two y-vectors is supplied which then fills the area between then with a supplied color. Naturally, the three vectors must have the same length.

#colours in the space between two curves.
fillitin<-function(x,ymin,ymax,colour){
  for (i in 1:length(x)){
    #define the x coordinates of the vertices of the polygon
    xvert<-c(x[i],x[i],x[i+1],x[i+1])
    #define the y coordinates of the vertices of the polygon
    yvert<-c(ymin[i],ymax[i],ymax[i+1],ymin[i+1])
    polygon(xvert,yvert, col = colour, border = NA)
  }
}

#now add these effects to the existing figure
fillitin(middays,tenth,ninetieth,"#FF000020")
lines(quantileData$middays,quantileData$fiftieth, col = "blue")
lines(quantileData$middays,quantileData$tenth, col = "red")
lines(quantileData$middays,quantileData$ninetieth, col = "red")

#colours in the space between two curves.

fillitin<-function(x,ymin,ymax,colour){

for (i in 1:length(x)){

#define the x coordinates of the vertices of the polygon

xvert<-c(x[i],x[i],x[i+1],x[i+1])

#define the y coordinates of the vertices of the polygon

yvert<-c(ymin[i],ymax[i],ymax[i+1],ymin[i+1])

polygon(xvert,yvert, col = colour, border = NA)

}

#now add these effects to the existing figure

fillitin(middays,tenth,ninetieth,"#FF000020")

lines(quantileData$middays,quantileData$fiftieth, col = "blue")

lines(quantileData$middays,quantileData$tenth, col = "red")

lines(quantileData$middays,quantileData$ninetieth, col = "red")

Final Product

Now we should just finish it off with a legend.

legend("topright",c("Median","Central 80%"),lty = c(1,1),col = c("blue","red"),inset = .05)

1	legend("topright",c("Median","Central 80%"),lty = c(1,1),col = c("blue","red"),inset = .05)

And that is a little more informative. There are many features you could add from this point – like smoothing, statistical analysis, outlier report. You could also loop over different tests, examine both the preanalytical and analytical processes at different locations, and produce a pdf report using MarkDown for all the institutions you look after.

-Dan

Background

The Concept

Hoffmann

What do I plot for Hoffmann: a QQ-plot or the CDF?

The Correct Approach

What not to do

Bhattacharya

Gaussian Mixture Model

Summary of Results

Conclusion

Background

Easy Peaks

More Difficult Peaks

Getting the Data

Smoothing

Transformation

Find Extrema

Fitting

Isolate the \(\gamma\) Region

Attempt Something that Ultimately Does Not Work

Baseline Removal

All about that Base(line)

Conclusions

Background

Some Fake Data to Work With

Gather the Grain

Visualize

Separate the Wheat from the Chaff

I can’t see the difference. Can you see the difference?

Conclusion

Background

Reading in the Data

Clean Up

Check it Out

Write the Data

Conclusion

Background

Precision of Components

Build Approximation Functions

Random Simulation

Conclusion

Teach us to number our days, that we may gain a heart of wisdom. Psalm 90:12.

Removing NA’s from a Data Frame in R

The Problem

Finding NA's

Hey Hey! Ho Ho! Those NAs have got to go!

Complete Cases

na.omit()

Row Numbers are Actually Names

Now we can move on

Final Thought:

Introduction

Simple unit conversion

Getting started

And now for some regular expressions

Paste, briefly

‘Apply’-ing yourself

The Problem

Loading the Data

Sanity Check

Some Nutty Stuff

Time Dependence

Tunnelling Down

Some More Lubridate Magic

Monday Monday, So Good to Me?

Removing the For–ness

Smoothing

Lowess Smoothing

Observations

The lot is cast into the lap, but its every decision is from the LORD.

Proverbs 16:33

The Problem

Look at the Data

Basic Data Preparation

Sanity Check

Scatterplot

Beautifying

Final Product

“The LORD detests dishonest scales, but accurate weights find favor with him.”

Proverbs 11:1