Let me preface this by saying I am not making a recommendation to use the Hoffman method. Neither am I advocating for reference interval mining from routine data. There are many challenges associated with this kind of effort. That's for another post I think. However, I am going to how one does the calculations for two methods I have seen used: the Hoffman Method and the Bhattacharya Method. Then I will show how to do this using the mixtools package in R which uses the expectation maximum algorithm to determine the maximum likelihood.
When you look at histograms of routine clinical data from allcomers, on some occasions the data will form a bimodal looking distribution formed by the putatively sick and well. If you could statistically determine the distribution of the well subjects, then you could, in principle, determine the reference interval without performing a reference interval study. We can all dream, right?
All three of the approaches I show assume that the two distributions are Gaussian. This is almost never true. But for the purposes of the calculations, I will provide each approach data that meets the assumptions it makes. So, let's make a fake bimodal distribution and see how each method does. We will assume equal numbers of sick and well so that the bimodal distribution is obvious. One will have \(\mu_1 = 2\) and \(\sigma_1 = 0.5\) and the other will have \(\mu_2 = 6\) and \(\sigma_2 = 2\). The expected normal range for this population is based on \(\mu_1\) and \(\sigma_1\) and is \(2 – 0.5 \times 1.96\) and \(2 + 0.5 \times 1.96\) or about 1–3.
#two Gaussian distributions with means of 2 and 6 respectively and sd's of 1 and 2 set.seed(10) #to make sure you generate the same data mode1 <- rnorm(1000,2,0.5) mode1 <- mode1[mode1 > 0] #get rid of negative results mode2 <- rnorm(1000,6,2) mode2 <- mode2[mode2 > 0] d <- sort(c(mode1,mode2)) dhist <- hist(d, breaks = c(seq(0,20,0.25),100), xlim = c(0,10), main = "Histogram of Patient Results", xlab = "Concentration of Analyte")
To illustrate how the two populations add you can plot one in green and one in pink. The overlap shows in a yucky brown.
hist(d, breaks = c(seq(0,10,0.25),100), freq = TRUE, xlim = c(0,10), main = "Histogram of Patient Results", xlab = "Concentration of Analyte") hist(mode1, breaks = c(seq(0,10,0.25),100), add = TRUE, col = rgb(0,1,0,0.3), freq = TRUE) hist(mode2, breaks = c(seq(0,10,0.25),100), add = TRUE, col = rgb(1,0,0,0.3), freq = TRUE)
In 1963 Robert Hoffman proposed a simple graphical approach to this problem and use of his method is alive and well—see here for example. The method assumes that both modes are Gaussian and that if one eye-balls (yes…the paper says “eye-fit”) the first linear-looking portion of the cumulative probability distribution (CDF) function as plotted on normal probability paper and finds its intersection with the lines y = 0.025 and y = 0.975, one can impute the normal range.
It is very important to understand that the use normal probability paper, as Hoffman described, was mandatory because it produces a normal probablity plot. As he says,
“This special graph paper serves the useful purpose of 'straightening out' a cumulative gaussian distribution. It forms a straight line.”
A CDF plotted on linear scale is sigmoidal. This is not what we want. We want a normal probability plot which is just a special case of the QQ-plot where the comparator distribution is the normal distribution. Inadvertently plotting a plain old CDF will not produce correct estimates of the lower and upper limits of normal (ie \(\mu \pm 1.96\sigma\)). The reason I emphasize this is that I have seen this error made in a number of reference interval papers—but not the one I cited above—it is correct. The importance of the distinction becomes not-very-subtle when you apply the Hoffman approach to a pure Gaussian distribution. In short, use of the CDF in linear space generates erroneous results as we will show later on.
Here is the standard r-base normal QQ-plot of our mock data set:
qqnorm(d, type = "l")
To prevent reader confusion, I am going present the plots the way Hoffman originally showed them. So I will put the patient data on the x-axis. It doesn't change anything. You can do it as you like.
my.qq <- qqnorm(d, datax = TRUE, type = "l", ylab = "Patient Results", xlab = "Quantiles of the Normal Distribution")
From this you can see that there is obviously linear section between about x = 0 to x = 2 (and with the eye of faith, there is a second after x = 6). This is what Hoffman calls the “eye-fit”. Since the first linear section is attributable to the first of the two normal distributions which form the overall distribution, we can use the it to determine properties of the first distribution. If I look only at the data between x = 0 and x = 2, I am sort-of guaranteed to be in the first linear section. You don't have to kill yourself to correctly identify where the linearity ends because the density of the points should be highest near the middle of the linear section and this will weight the regression for you.
Next if I extend this line to find its intersection with y = -1.96 and y = 1.96 (ie the z-scores corresponding the limits of normal, namely the 2.5th and 97.5th centiles), I can estimate the reference interval, by dropping perpendicular lines from the two respective intersections. Here is what I get:
#get regression line - it's linear from about from 0 to 2 my.qq <- as.data.frame(my.qq) linear.bit <- subset(my.qq, x <= 2) #get the regression line of the linear section reg <- lm(y ~ x, data = linear.bit) plot(y ~ x, data = my.qq, type = "l", ylab = "Quantiles of the Normal Distribution", xlab = "Patient Results") abline(reg, col = "red") abline(h = c(-1.96,1.96), lty = 2) uln.hoff <- unname((1.96 - coef(reg)[1])/coef(reg)[2]) lln.hoff <- unname((-1.96 - coef(reg)[1])/coef(reg)[2]) abline(v = c(lln.hoff,uln.hoff), lty = 2)
lln.hoff
## [1] 1.105849
uln.hoff
## [1] 3.699254
So the Hoffman reference interval becomes 1.11 to 3.70 which you can compare to the expected values of about 1 and 3 based on the random data. Not the greatest but not bad.
Let's apply the correct approach to the Hoffman method (QQ-Plot) and incorrect approach (CDF on a linear scale) to a pseudorandom sampling (n=10,000) of the standard normal distribution, which has a mean of 0 and a standard deviation of 1. Therefore the central 95% or “normal range” for this distribution will be -1.96 to 1.96. I will plot regression lines through the linear part of each curve and find the respective intersections with the appropriate horizontal lines.
# QQ-Norm plot of standard normal distribution z <- sort(rnorm(10000,0,1)) par(mfrow = c(1,2)) my.qq <- qqnorm(z, type = "l", datax = TRUE, plot.it = FALSE) plot(my.qq, ylim = c(-2.1,2.1), type = "l", main = "Normal QQ-plot", ylab = "Quantiles of the Normal Distribution", xlab = "Sample Quantiles") qqline(z, col = "blue") abline(h=c(1.96,-1.96), col = "blue") abline(v = c(1.96,-1.96), col = "blue", lty = 2) #lower and upper limits are -2 and 2. # CDF of standard normal distribution my.ecdf <- ecdf(z) df <- data.frame(z = z, y = my.ecdf(z)) plot(y ~ z, data = df, type = "l", main = "Cumulative Normal Distribution") abline(v = c(1.96,-1.96), col = "blue", lty = 2) abline(h = c(0.025,0.975), col = "blue") linear.bit <- subset(df, z > -.5 & z < 0.5) abline(lm(y ~ z, data = linear.bit), col = "blue") abline(v = c(-(0.5 - 0.05/2)*sqrt(2*pi),(0.5 - 0.05/2)*sqrt(2*pi)), col = rgb(1,0,0,0.5) )
The QQ-plot generates estimates the limits of normal, \(\mu \pm 1.96\sigma\), as about \(\pm 1.96\) as it should. You can easily show that the same procedure on the CDF intersects the lines \(y = \alpha /2\) and \(y = 1 – \alpha/2\) at values of \(\pm (1 – \alpha) \sqrt{\pi/2} \sigma\) which is about \(\pm 1.19\) for \(\sigma = 1\) and \(\alpha = 0.05\). This erroneous estimate is shown with the pink vertical lines. So the Hoffman method does not work if one attempts to extend the linear portion of the CDF if it is plotted in linear space and it will produce estimates of \(\sigma\) that are 40% too low in this case. If you're puting this all together, this because the CDF is well away from its linear portion when the cumulative proportions are 0.025 and 0.975—not so for a QQ-plot. If you see a “Hoffman plot” constructed from a sigmoidal CDF plotted on a linear scale, something is wrong.
This method is based on a much more highly cited paper in Biometrics published in 1967 by C.G. Bhattacharya. Loosely speaking, the method of Bhattacharya determines the parameter estimates of \(\mu_i\) and \(\sigma_i\) from the slope of the log of the distribution function. It was originally intended as a graphical method and so it also involves some human eye-balling.
We will need the log of the counts from the histogram. When we store the results of a histogram in R, we have the counts automatically.
str(dhist)
## List of 6 ## $ breaks : num [1:82] 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 ... ## $ counts : int [1:81] 2 4 5 19 49 100 140 210 174 180 ... ## $ density : num [1:81] 0.004 0.008 0.01 0.038 0.098 ... ## $ mids : num [1:81] 0.125 0.375 0.625 0.875 1.125 ... ## $ xname : chr "d" ## $ equidist: logi FALSE ## - attr(*, "class")= chr "histogram"
We can now calculate the log of the counts (denoted \(y\)) and \(\Delta log(y)\) from bin to bin. We put these in a dataframe along with the counts and the midpoints of the bins. The bin width, which is chosen to be constant \(h\), is the distance between the midpoints of each bin.
#alter the number of breaks to make the linear sections more obvious. dhist <- hist(d, breaks = 30, plot = FALSE) ly <- log(dhist$counts) dly <- diff(ly) df <- data.frame(xm = dhist$mids[-length(dhist$mids)], ly = dly, counts = dhist$counts[-length(dhist$mids)]) h <- diff(df$xm)[1]
Now let's plot \(\Delta log(y)\) as a function of the midpoints of the bins. I also number all the points to facilitate the next step.
plot(ly ~ xm, data = df, xlim = c(0,10)) abline(h = 0) abline(v = df$xm, lty = 2, col = "#00000080") #number all the points library(calibrate) num.df <- na.omit(df) textxy(num.df$xm, num.df$ly, 1:nrow(num.df), row.names(num.df), cex = 0.8, offset = 1, col = "blue")
We can see from the figure that there are two sections where the plot shows a downsloping line: one between points 2 to 6 and another between points 10 to 21. How straight these lines appear is affected by how wide your bins are so if you get lines that are hard to discern, you can try making fewer bins.
In any case, using Bhattacharya's notation, the next step in the procedure is to draw regression lines through the \(r_{th}\) linear section and determine the intercept \(\hat{\lambda}_r\) with the x-axis. Bhattacharya intended this as a graphical procedure and advises,
“While matching the straight line it is better to fit closely to the points where the frequency is large even if the apparent discrepancy becomes somewhat large where the frequency is small.”
Since we are doing this by calculation, we can take his advice by weighting the linear regressions according to the counts. This allows the determination of the \(\hat{\mu}_r\) by:
\[\hat{\mu}_r = \hat{\lambda}_r + h/2\]
and also the determination of \(\hat{\sigma}_r\) by:
\[\hat{\sigma}^2_r = -h/ \text{slope}_r – h^2/12\]
#linear section 1 linear.bit1 <- subset(df[2:6,]) lm1 <- lm(ly ~ xm, data = linear.bit1, weights = linear.bit1$counts) lambda1 <- -coef(lm1)[1]/coef(lm1)[2] mu1 <- lambda1 + h/2 sigma1 <- sqrt(-h/coef(lm1)[2] - h^2/12) #linear section 2 linear.bit2 <- subset(df[10:21,]) lm2 <- lm(ly ~ xm, data = linear.bit2, weights = linear.bit2$counts) lambda2 <- -coef(lm2)[1]/coef(lm2)[2] mu2 <- lambda2 + h/2 sigma2 <- sqrt(-h/coef(lm2)[2] - h^2/12) #normal range limits lln.bhat <- qnorm(0.025,mu1, sigma1) uln.bhat <- qnorm(0.975,mu1, sigma1)
And here are the results we get:
mu Values | sigma Values | Normal Range Limits |
---|---|---|
2.06 | 0.59 | 0.90 |
6.25 | 1.83 | 3.21 |
And here is what it all looks like
plot(ly ~ xm, data = df, xlim = c(0,10)) abline(h = 0) abline(v = df$xm, lty = 2, col = "#00000080") abline(lm1, col = "green") abline(lm2, col = "red")
In this demonstration, there are only two Gaussian distributions to resolve, but the method is not limited to the resolution of two Gaussian curves at all. If there are more, there will be more downsloping lines crossing the x-axis. So we get normal range estimates of 0.90 and 3.21 which compare much better with the expected values of about 1 and 3. We also get good estimates of \(\mu_2=\) 6.3 and \(\sigma_2=\) 1.8 which are about 6 and 2 respectively in our data set.
Bhattacharya also provides a means of calculating the mixing proportion of the two distributions—that is, the proportions of patients in the sick and abnormal populations. We don't need that here so I omit it.
In R there are a lot of ways to approach the separation of mixtures of distributions using maximum likelihood. Here I am using a function from the mixtools package that is particularly easy to use. The concept of using maximum likelihood for mining your reference interval is not new (see this paper) but many would be intimidated by the math required to do it from scratch.
With R, this is pretty easy but please be cautioned that real data does not play as nice as the data in this demonstration (even moreso for Hoffman and Bhattacharya) and it is unlikely that you will get smashing results unless your data fits the assumptions of the model.
In any case,
#Gaussian Mixed Model - the right way to do this library(mixtools) fit <- normalmixEM(d, k = 2) #try to fit two Gaussians
## number of iterations= 28
summary(fit)
## summary of normalmixEM object: ## comp 1 comp 2 ## lambda 0.519121 0.480879 ## mu 2.014404 6.186571 ## sigma 0.518210 1.966858 ## loglik at estimate: -3961.014
which gives very good parameter estimates indeed! Estimates of \(\mu_1\) and \(\mu_2\) are 2.01 and 6.19 respectively and estimates of \(\sigma_1\) and \(\sigma_2\) are 0.52 and 1.97 respectively.
Looking at this graphically:
hist(d, freq = FALSE, breaks = 50, main = "Histogram of Patient Results") #show the respective curves lines(d,fit$lambda[1]*dnorm(d,fit$mu[1],fit$sigma[1]), col = "green") lines(d,fit$lambda[2]*dnorm(d,fit$mu[2],fit$sigma[2]), col = "red")
#find the 2.5th 97.5th percentile from the mixed model fit lln.EM <- qnorm(0.025,fit$mu[1], fit$sigma[1]) lln.EM
## [1] 0.9987315
uln.EM <- qnorm(0.975,fit$mu[1], fit$sigma[1]) uln.EM
## [1] 3.030077
So the normal range estimate from EM method is 1.00 to 3.03 which is pretty fantastic.
LLN | ULN | |
---|---|---|
Raw Random Data | 1.03 | 2.98 |
Hoffman | 1.11 | 3.70 |
Bhattacharya | 0.90 | 3.21 |
mixtools EM – winner! | 1.00 | 3.03 |
It's not too hard to figure out which one of these approaches works best. But what do you do if your patient data distribution is obviously not a mixture of Gaussians (ie when the distributions look skewed)? There are ways to do this in R for this but I will cover that another time–maybe in a paper.
normalmixEM()
function from the mixtools package performs very well without any human oversight.Parting Thought
Please don't fall on the wrong side of God's mixture separation procedures for wheat and chaff.
Said, John the Baptist, “But after me comes one who is more powerful than I, whose sandals I am not worthy to carry. He will baptize you with the Holy Spirit and fire. His winnowing fork is in his hand, and he will clear his threshing floor, gathering his wheat into the barn and burning up the chaff with unquenchable fire.”
Matt 3:11–12
]]>At the AACC meeting recently, there was an enthusiastic discussion of standardization of reporting for serum protein electrophoresis (SPEP) presented by a working group headed up by Dr. Chris McCudden and Dr. Ron Booth, both of the University of Ottawa. One of the discussions pertained to how monoclonal bands, especially small ones, should be integrated. While many use the default manual vertical gating or “drop” method offered by Sebia's Phoresis software, Dr. David Keren was discussing the value of tangent skimming as a more repeatable and effective means of monoclonal protein quantitation. He was also discussing some biochemical approaches distinguishing monoclonal proteins from the background gamma proteins.
The drop method is essentially an eye-ball approach to where the peak starts and ends and is represented by the vertical lines and the enclosed shaded area.
The tangent skimming approach is easier to make reproducible. In the mass spectrometry world it is a well-developed approach with a long history and multiple algorithms in use. This is apparently the book. However, when tangent skimming is employed in SPEP, unless I am mistaken, it seems to be done by eye. The integration would look like this:
During the discussion it was point out that peak deconvolution of the monoclonal protein from the background gamma might be preferable to either of the two described procedures. By this I mean integration as follows:
There was discussion this procedure is challenging for number of reasons. Further, it should be noted that there will only likely be any clinical value in a deconvolution approach when the concentration of the monoclonal protein is low enough that manual integration will show poor repeatability, say < 5 g/L = 0.5 g/dL.
Fitting samples with larger monoclonal peaks is fairly easy. Fitting tends to converge nicely and produce something meaningful. For example, using the approach I am about to show below, an electropherogram like this:
with a gamma region looking like this:
can be deconvoluted with straightforward non-linear regression (and no baseline subtraction) to yield this:
and the area of the green monoclonal peak is found to be 5.3%.
What is more challenging is the problem of small monoclonals buried in normal \(\gamma\)-globulins. These could be difficult to integrate using a tangent skimming approach, particularly without image magnification. For the remainder of this post we will use a gel with a small monoclonal in the fast gamma region shown at the arrow.
EP data can be extracted from the PDF output from any electrophoresis software. This is not complicated and can be accomplished with pdf2svg or Inkscape and some Linux bash scripting. I'm sure we can get it straight from the instrument but it is not obvious to me how to do this. One could also rescan a gel and use ImageJ to produce a densitometry scan which is discussed in the ImageJ documentation and on YouTube. ImageJ also has a macro language for situations where the same kind of processing is done repeatedly.
The data has 10284 pairs of (x,y) data. But if you blow up on it and look carefully you find that it is a series of staircases.
plot(y~x, data = head(ep.data,100), type = "o", cex = 0.5)
It turns out that this jaggedness significantly impairs attempts to numerically identify the peaks and valleys. So, I smoothed it a little using the handy rle()
function to identify the midpoint of each step. This keeps the total area as close to its original value as possible–though this probably does not matter too much.
ep.rle <- rle(ep.data$y) stair.midpoints <- cumsum(ep.rle$lengths) - floor(ep.rle$lengths/2) ep.data.sm <- ep.data[stair.midpoints,] plot(y~x, data = head(ep.data,300), type = "o", cex = 0.5) points(y~x, data = head(ep.data.sm,300), type = "o", cex = 0.5, col = "red")
Now that we are satisfied that the new data is OK, I will overwrite the original dataframe.
ep.data <- ep.data.sm
The units on the x and y-axes are arbitrary and come from page coordinates of the PDF. We can normalize the scan by making the x-axis go from 0 to 1 and by making the total area 1.
library(Bolstad) #A package containing a function for Simpon's Rule integration ep.data$x <- ep.data$x/max(ep.data$x) A.tot <- sintegral(ep.data$x,ep.data$y)$value ep.data$y <- ep.data$y/A.tot #sanity check sintegral(ep.data$x,ep.data$y)$value
## [1] 1
plot(y~x, data = ep.data, type = "l")
Using the findPeaks
function from the quantmod package we can find the minima and maxima:
library(quantmod) ep.max <- findPeaks(ep.data$y) plot(y~x, data = ep.data, type = "l", main = "Maxima") abline(v = ep.data$x[ep.max], col = "red", lty = 2)
ep.min <- findValleys(ep.data$y) plot(y~x, data = ep.data, type = "l", main = "Minima") abline(v = ep.data$x[ep.min], col = "blue", lty = 2)
Not surprisingly, there are some extraneous local extrema that we do not want. I simply manually removed them. Generally, this kind of thing could be tackled with more smoothing of the data prior to analysis.
ep.max <- ep.max[-1] ep.min <- ep.min[-c(1,length(ep.min))]
Now it's possible with the nls()
function to fit the entire SPEP with a series of Gaussian curves simultaneously. It works just fine (provided you have decent initial estimates of \(\mu_i\) and \(\sigma_i\)) but there is no particular clinical value to fitting the albumin, \(\alpha_1\), \(\alpha_2\), \(\beta_1\) and \(\beta_2\) domains with Gaussians. What is of interest is separately quantifying the two peaks in \(\gamma\) with two separate Gaussians so let's isolate the \(\gamma\) region based on the location of the minimum between \(\beta_2\) and \(\gamma\).
gamma.ind <- max(ep.min):nrow(ep.data) gamma.data <- data.frame(x = ep.data$x[gamma.ind], y = ep.data$y[gamma.ind]) plot(y ~ x, gamma.data, type = "l")
At first I thought I could just throw two normal distributions at this and it would work. However, it does not work well at all and this kind of not-so-helpful fit turns out to happen a fair bit. I use the nls()
function here which is easy to call. It requires a functional form which I set to be:
\[y = C_1 \exp\Big(-{\frac{(x-\mu_1)^2}{2\sigma_1^2}}\Big) + C_2 \exp \Big({-\frac{(x-\mu_2)^2}{2\sigma_2^2}}\Big)\]
where \(\mu_1\) is the \(x\) location of the first peak in \(\gamma\) and \(\mu_2\) is the \(x\) location of the second peak in \(\gamma\). The estimates of \(\sigma_1\) and \(\sigma_2\) can be obtained by trying to estimate the full-width-half-maximum (FWHM) of the peaks, which is related to \(\sigma\) by
\[FWHM_i = 2 \sqrt{2\ln2} \times \sigma_i = 2.355 \times \sigma_i\]
I had to first make a little function that returns the respective half-widths at half-maximum and then uses them to estimate the \(FWHM\). Because the peaks are poorly resolved, it also tries to get the smallest possible estimate returning this as FWHM2
.
FWHM.finder <- function(ep.data, mu.index){ peak.height <- ep.data$y[mu.index] fxn.for.roots <- ep.data$y - peak.height/2 indices <- 1:nrow(ep.data) root.indices <- which(diff(sign(fxn.for.roots))!=0) tmp <- c(root.indices,mu.index) %>% sort tmp2 <- which(tmp == mu.index) first.root <- root.indices[tmp2 -1] second.root <- root.indices[tmp2] HWHM1 <- ep.data$x[mu.index] - ep.data$x[first.root] HWHM2 <- ep.data$x[second.root] - ep.data$x[mu.index] FWHM <- HWHM2 + HWHM1 FWHM2 = 2*min(c(HWHM1,HWHM2)) return(list(HWHM1 = HWHM1,HWHM2 = HWHM2,FWHM = FWHM,FWHM2 = FWHM2)) }
The peak in the \(\gamma\) region was obtained previously:
plot(y ~ x, gamma.data, type = "l") gamma.max <- findPeaks(gamma.data$y) abline(v = gamma.data$x[gamma.max])
and from them \(\mu_1\) is determined to be 0.7. We have to guess where the second peak is, which is at about \(x=0.75\) and has an index of 252 in the gamma.data
dataframe.
gamma.data[252,]
## x y ## 252 0.7487757 0.6381026
#append the second peak gamma.max <- c(gamma.max,252) gamma.mu <- gamma.data$x[gamma.max] gamma.mu
## [1] 0.6983350 0.7487757
plot(y ~ x, gamma.data, type = "l") abline(v = gamma.data$x[gamma.max])
Now we can find the estimates of the standard deviations:
#find the FWHM estimates of sigma_1 and sigma_2: FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.data) gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355
The estimates of \(\sigma_1\) and \(\sigma_2\) are now obtained. The estimates of \(C_1\) and \(C_2\) are just the peak heights.
peak.heights <- gamma.data$y[gamma.max]
We can now use nls()
to determine the fit.
fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) + C2*exp(-(x-mean2)**2/(2 * sigma2**2))), data = gamma.data, start = list(mean1 = gamma.mu[1], mean2 = gamma.mu[2], sigma1 = gamma.sigma[1], sigma2 = gamma.sigma[2], C1 = peak.heights[1], C2 = peak.heights[2]), algorithm = "port")
Determining the fitted values of our unknown coefficients:
dffit <- data.frame(x=seq(0, 1 , 0.001)) dffit$y <- predict(fit, newdata=dffit) fit.sum <- summary(fit) fit.sum #show the fitted coefficients
## ## Formula: y ~ (C1 * exp(-(x - mean1)^2/(2 * sigma1^2)) + C2 * exp(-(x - ## mean2)^2/(2 * sigma2^2))) ## ## Parameters: ## Estimate Std. Error t value Pr(>|t|) ## mean1 0.7094793 0.0003312 2142.23 <2e-16 *** ## mean2 0.7813900 0.0007213 1083.24 <2e-16 *** ## sigma1 0.0731113 0.0002382 306.94 <2e-16 *** ## sigma2 0.0250850 0.0011115 22.57 <2e-16 *** ## C1 0.6983921 0.0018462 378.29 <2e-16 *** ## C2 0.0819704 0.0032625 25.12 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.01291 on 611 degrees of freedom ## ## Algorithm "port", convergence message: both X-convergence and relative convergence (5)
coef.fit <- fit.sum$coefficients[,1] mu.fit <- coef.fit[1:2] sigma.fit <- coef.fit[3:4] C.fit <- coef.fit[5:6]
And now we can plot the fitted results against the original results:
#original plot(y ~ x, data = gamma.data, type = "l", main = "This is Garbage") #overall fit lines(y ~ x, data = dffit, col ="red", cex = 0.2) legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum")) #components of the fit for(i in 1:2){ x <- dffit$x y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2)) lines(x,y, col = i + 2) }
And this is garbage. The green curve is supposed to be the monoclonal peak, the blue curve is supposed to be the \(\gamma\) background, and the red curve is their sum, the overall fit. This is a horrible failure.
Subsequently, I tried fixing the locations of \(\mu_1\) and \(\mu_2\) but this also yielded similar nonsensical fitting. So, with a lot of messing around trying different functions like the lognormal distribution, the Bi-Gaussian distribution and the Exponentially Modified Gaussian distribution, and applying various arbitrary weighting functions, and simultaneously fitting the other regions of the SPEP, I concluded that nothing could predictably produce results that represented the clinical reality.
I thought maybe the challenge to obtain a reasonable fit related to the sloping baseline, so I though I would try to remove it. I will model the baseline in the most simplistic manner possible: as a sloped line.
I will arbitrarily define the tail of the \(\gamma\) region to be those values having \(y \leq 0.02\). Then I will connect the first (x,y) point from the \(\gamma\) region and connect it to the tail.
gamma.tail <- filter(gamma.data, y <= 0.02) baseline.data <- rbind(gamma.data[1,],gamma.tail) names(baseline.data) <- c("x","y") baseline.fun <- approxfun(baseline.data) plot(y~x, data = gamma.data, type = "l") lines(baseline.data$x,baseline.fun(baseline.data$x), col = "blue")
Now we can define a new dataframe gamma.no.base
that has the baseline removed:
gamma.no.base <- data.frame(x = gamma.data$x, y = gamma.data$y - baseline.fun(gamma.data$x)) plot(y~x, data = gamma.data, type = "l") lines(y ~ x, data = gamma.no.base, lty = 2) gamma.max <- findPeaks(gamma.no.base$y)[1:2] #rejects a number of extraneous peaks abline(v = gamma.no.base$x[gamma.max])
The black is the original \(\gamma\) and the dashed has the baseline removed. This becomes and easy fit.
#Estimate the Ci peak.heights <- gamma.no.base$y[gamma.max] #Estimate the mu_i gamma.mu <- gamma.no.base$x[gamma.max] #the same values as before #Estimate the sigma_i from the FWHM FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.no.base) gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355 #Perform the fit fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) + C2*exp(-(x-mean2)**2/(2 * sigma2**2))), data = gamma.no.base, start = list(mean1 = gamma.mu[1], mean2 = gamma.mu[2], sigma1 = gamma.sigma[1], sigma2 = gamma.sigma[2], C1 = peak.heights[1], C2 = peak.heights[2]), algorithm = "port") #Plot the fit dffit <- data.frame(x=seq(0, 1 , 0.001)) dffit$y <- predict(fit, newdata=dffit) fit.sum <- summary(fit) coef.fit <- fit.sum$coefficients[,1] mu.fit <- coef.fit[1:2] sigma.fit <- coef.fit[3:4] C.fit <- coef.fit[5:6] plot(y ~ x, data = gamma.no.base, type = "l") legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum")) lines(y ~ x, data = dffit, col ="red", cex = 0.2) for(i in 1:2){ x <- dffit$x y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2)) lines(x,y, col = i + 2) }
Lo and behold…something that is not completely insane. The green is the monoclonal, the blue is the \(\gamma\) background and the red is their sum, that is, the overall fit. A better fit could now we sought with weighting or with a more flexible distribution shape. In any case, the green peak is now easily determined. Since
\[\int_{-\infty}^{\infty} C_1 \exp\Big(-{\frac{(x-\mu_1)^2}{2\sigma_1^2}}\Big)dx = \sqrt{2\pi}\sigma C_1\]
A.mono <- sqrt(2*pi)*sigma.fit[1]*C.fit[1] %>% unname() A.mono <- round(A.mono,3) A.mono
## sigma1 ## 0.024
So this peak is 2.4% of the total area. Now, of course, this assumes that nothing under the baseline is attributable to the monoclonal peak and all belongs to normal \(\gamma\)-globulins, which is very unlikely to be true. However, the drop and tangent skimming methods also make assumptions about how the area under the curve contributes to the monoclonal protein. The point is to try to do something that will produce consistent results that can be followed over time. Obviously, if you thought there were three peaks in the \(\gamma\)-region, you'd have to set up your model accordingly.
There are obviously better ways to model the baseline because this approach of a linear baseline is not going to work in situations where, for example, there is a small monoclonal in fast \(\gamma\) dwarfed by normal \(\gamma\)-globulins. That is, like this:
Something curvilinear or piecewise continuous and flexible enough for more circumstances is generally required.
There is also no guarantee that baseline removal, whatever the approach, is going to be a good solution in other circumstances. Given the diversity of monoclonal peak locations, sizes and shapes, I suspect one would need a few different approaches for different circumstances.
The data in the PDFs generated by EP software are processed (probably with splining or similar) followed by the stair-stepping seen above. It would be better to work with raw data from the scanner.
nls()
because nls()
does not play nice with data having no noise (“Do not use nls on artificial 'zero-residual' data”)Integrating monoclonal peaks under the \(\gamma\) baseline (or \(\beta\)) is unlikely to be a one-size-fits all approach and may require application of a number of strategies to get meaningful results.
Peak integration will require human adjudication.
While most monoclonal peaks show little skewing, better fitting is likely to be obtained with distributions that afford some skewing.
MASSFIX may soon make this entire discussion irrelevant.
Parting Thought
On the matter of fitting
In bringing many sons and daughters to glory, it was fitting that God, for whom and through whom everything exists, should make the pioneer of their salvation perfect through what he suffered.
Heb 2:10
]]>Sometimes we might want to compare three or four tube types for a particular analyte on a group of patients or we might want to see if a particular analyte is stable over time in aliqioted samples. In these experiments are essentially doing the multivariable analogue of the paired t-test. In the tube-type experiment, the factor that is differing between the ('paired') groups is the container: serum separator tubes (SST), EDTA plasma tubes, plasma separator tubes (PST) etc. In a stability experiment, the factor that is differing is storage duration.
Since this is a fairly common clinical lab experiment, I thought I would just jot down how this is accomplished in R – though I must confess I know just about \(\lim_{x\to0}x\) about statistics. In any case, the statistical test is a repeated-measures ANOVA and this is one way to do it (there are many) including an approach to the post-hoc testing.
I'm going to make some fake data. I tried to dig up the data from an experiment I did as a resident but alas, I think the raw data died on an old laptop. But fake data will do for demonstration purposes. Let's suppose we are looking at parathyroid hormone (PTH) in three different vacutainer tubes: SST, EDTA and PST. For the sake of argument, let's say that we collect samples from 20 patients simultaneously and we anlayze them all as per our usual process. This means that each patient has three samples of material that should be otherwise identical outside of the effects of the collection contained.
library(magrittr) set.seed(100) #to force the same pseudo-random each time #data in pmol/L #induce some heteroscedastic error SST <- runif(20,3,50) PST <- 1.03*SST + rnorm(20,0,0.1)*SST #set the data up to show no difference EDTA <- 1.15*SST + rnorm(20,0,0.1)*SST #set the data up to show a difference tube.data <- data.frame(SST,PST,EDTA) %>% round(.,1) tube.data <- data.frame(Subject = factor(1:20), tube.data)
This is the way we usually express (and receive) data like this in an Excel spreadsheet:
Subject | SST | PST | EDTA |
---|---|---|---|
1 | 17.5 | 18.1 | 19.9 |
2 | 15.1 | 15.7 | 20.0 |
3 | 29.0 | 29.2 | 32.9 |
4 | 5.7 | 6.2 | 6.4 |
5 | 25.0 | 26.1 | 27.0 |
6 | 25.7 | 26.4 | 29.0 |
7 | 41.2 | 40.8 | 48.1 |
8 | 20.4 | 22.1 | 24.3 |
9 | 28.7 | 26.9 | 36.0 |
10 | 11.0 | 13.9 | 13.7 |
11 | 32.4 | 31.9 | 36.9 |
12 | 44.5 | 49.2 | 57.4 |
13 | 16.2 | 17.1 | 15.7 |
14 | 21.7 | 24.1 | 26.3 |
15 | 38.8 | 36.8 | 42.6 |
16 | 34.4 | 34.0 | 44.2 |
17 | 12.6 | 12.1 | 14.1 |
18 | 19.8 | 20.9 | 25.4 |
19 | 19.9 | 18.2 | 23.0 |
20 | 35.4 | 37.4 | 34.1 |
This Excel-ish way of storing the data is referred to as the “datawide” format for obvious reasons.
As it turns out this is not the way that we want to store data to do the statistical analyses of interest. What we want to do is have the tube type in a single column because this is the factor that is different within the subjects. We want to gather()
or melt()
the data (depending on your package of choice) to be like so:
library(tidyr) tube.data.2 <- gather(tube.data, key = "Subject") tube.data.2 %>% kable()
Subject | Subject | value |
---|---|---|
1 | SST | 17.5 |
2 | SST | 15.1 |
3 | SST | 29.0 |
4 | SST | 5.7 |
5 | SST | 25.0 |
6 | SST | 25.7 |
7 | SST | 41.2 |
8 | SST | 20.4 |
9 | SST | 28.7 |
10 | SST | 11.0 |
11 | SST | 32.4 |
12 | SST | 44.5 |
13 | SST | 16.2 |
14 | SST | 21.7 |
15 | SST | 38.8 |
16 | SST | 34.4 |
17 | SST | 12.6 |
18 | SST | 19.8 |
19 | SST | 19.9 |
20 | SST | 35.4 |
1 | PST | 18.1 |
2 | PST | 15.7 |
3 | PST | 29.2 |
4 | PST | 6.2 |
5 | PST | 26.1 |
6 | PST | 26.4 |
7 | PST | 40.8 |
8 | PST | 22.1 |
9 | PST | 26.9 |
10 | PST | 13.9 |
11 | PST | 31.9 |
12 | PST | 49.2 |
13 | PST | 17.1 |
14 | PST | 24.1 |
15 | PST | 36.8 |
16 | PST | 34.0 |
17 | PST | 12.1 |
18 | PST | 20.9 |
19 | PST | 18.2 |
20 | PST | 37.4 |
1 | EDTA | 19.9 |
2 | EDTA | 20.0 |
3 | EDTA | 32.9 |
4 | EDTA | 6.4 |
5 | EDTA | 27.0 |
6 | EDTA | 29.0 |
7 | EDTA | 48.1 |
8 | EDTA | 24.3 |
9 | EDTA | 36.0 |
10 | EDTA | 13.7 |
11 | EDTA | 36.9 |
12 | EDTA | 57.4 |
13 | EDTA | 15.7 |
14 | EDTA | 26.3 |
15 | EDTA | 42.6 |
16 | EDTA | 44.2 |
17 | EDTA | 14.1 |
18 | EDTA | 25.4 |
19 | EDTA | 23.0 |
20 | EDTA | 34.1 |
Now we see that there is a column for tube-type and a column for the PTH results which we can name accordingly. You can see why this called the “datalong” format.
names(tube.data.2) <- c("Subject", "Tube.Type", "PTH") tube.data.2$Tube.Type <- as.factor(tube.data.2$Tube.Type) #turns tube type into factor
Summarize the data:
summary(tube.data)
## Subject SST PST EDTA ## 1 : 1 Min. : 5.70 Min. : 6.20 Min. : 6.40 ## 2 : 1 1st Qu.:17.18 1st Qu.:17.85 1st Qu.:19.98 ## 3 : 1 Median :23.35 Median :25.10 Median :26.65 ## 4 : 1 Mean :24.75 Mean :25.36 Mean :28.85 ## 5 : 1 3rd Qu.:32.90 3rd Qu.:32.42 3rd Qu.:36.23 ## 6 : 1 Max. :44.50 Max. :49.20 Max. :57.40 ## (Other):14
Let's just have a quick look graphically:
library(mcr) plot(mcreg(SST, EDTA, method.reg = "PaBa", mref.name = "SST", mtest.name = "EDTA"))
plot(mcreg(SST, PST, method.reg = "PaBa", mref.name = "SST", mtest.name = "PST"))
And as a boxplot with the points overtop:
boxplot(PTH ~ Tube.Type, data = tube.data.2, col = c("purple", "lightgreen", "gold")) stripchart(PTH ~ Tube.Type, vertical = TRUE, data = tube.data.2, method = "jitter", add = TRUE, pch = 20, col = rgb(0,0,0,0.5))
Now we want to make comparisons to see if these are different. To accomplish this, we will use the aov()
function. This requires us to have data formatted “datalong” as it is in the tube.data.2
dataframe.
fit <- aov(PTH ~ Tube.Type + Error(Subject/Tube.Type), data=tube.data.2)
If you are like me, this syntax is confusing. But it goes like this. PTH
is a function of Tube.Type
which is straight forward–hence the PTH ~ Tube.Type
bit. The error term has the Subject
in front of the /
and the factor that is different within the subjects (Tube.Type
) after the /
. That's my grade 2 explanation from reading this and this and this.
summary(fit)
## ## Error: Subject ## Df Sum Sq Mean Sq F value Pr(>F) ## Residuals 19 7307 384.6 ## ## Error: Subject:Tube.Type ## Df Sum Sq Mean Sq F value Pr(>F) ## Tube.Type 2 195.9 97.97 22.47 3.63e-07 *** ## Residuals 38 165.7 4.36 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This tells us that there is a difference between the groups but it does not specify where the difference is.
Sorry – I just had to make a pop-culture reference to this. We want to be specific about where the differences are without making a Type I error which might arise if we blindly charge ahead and do multiple paired t-tests. One easy way to accomplish this is to use the pairwise.t.test()
function which does corrections for multiple comparisons. You can choose from a number of approaches for adjustment for pairwise comparison. This requires the “response vector” which is PTH and the “grouping factor” which is the tube type.
# choices for p.adjust.method are: c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none") pwt <- pairwise.t.test(tube.data.2$PTH, tube.data.2$Tube.Type, p.adj = "bonferroni", paired = TRUE) pwt
## ## Pairwise comparisons using paired t tests ## ## data: tube.data.2$PTH and tube.data.2$Tube.Type ## ## EDTA PST ## PST 0.00083 - ## SST 7.9e-05 0.35033 ## ## P value adjustment method: bonferroni
This is pretty easy to understand. There are statistically significant differences found between the EDTA and PST (p = 0.00083) and the EDTA and PST (p = 0.00008) but none between SST and PST (p = 0.35033).
Non-statistician's approach to tube-type comparisons, which is also applicable to analyte stability studies. This is a one-way repeated measures ANOVA with one within-subjects factor. There is a great deal more to say on the matter by people who know much more in the citations in the links provided above.
God probably uses datawide format
All the nations will be gathered before him, and he will separate the people one from another as a shepherd separates the sheep from the goats. He will put the sheep on his right and the goats on his left.
(Matt 25:32-33)
]]>From time to time I have wanted to bring an online table into an R dataframe. While in principle, the data can be cut and paste into Excel, sometimes the table is very large and sometimes the columns get goofed up in the process. Fortunately, there are a number of R tools for accomplishing this. I am just going to show one approach using the rvest package. The rvest package also makes it possible to interact with forms on webpages to request specific material which can then be scraped. I think you will see the potential if you look here.
In our (simple) case, we will apply this process to Westgard's desirable assay specifications as shown on his website. The goal is to parse out the biological variation tables, get them into a dataframe and the write to csv or xlsx.
The first thing to do is to load the rvest
and httr
packages and define an html session with the html_session()
function.
library(rvest) library(httr) wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("LabRtorian"))
Now looking at the webpage, you can see that there are 8 columns in the tables of interest. So, we will define an empty dataframe with 8 columns.
#define empty table to hold all the content biotable = data.frame(matrix(NA,0, 8))
We need to know which part of the document to scrape. This is a little obscure, but following the instructions in this post, we can determine that the xpaths we need are:
/html/body/div[1]/div[3]/div/main/article/div/table[1]
/html/body/div[1]/div[3]/div/main/article/div/table[2]
/html/body/div[1]/div[3]/div/main/article/div/table[3]
…
etc.
There are 8 such tables in the whole webpage. We can define a character vector for these as such:
xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")
Now we make a loop to scrape the 8 tables and with each iteration of the loop, append the scraped subtable
to the main dataframe called biotable
using the rbind()
function. We have to use the parameter fill = TRUE
in the html_table()
function because the table does not happen to always a uniform number of columns.
for (j in 1:8){ subtable <- wg %>% read_html() %>% html_nodes(xpath = xpaths[j]) %>% html_table(., fill = TRUE) subtable <- subtable[[1]] biotable <- rbind(biotable,subtable) }
Now that we have the raw data out, we can have a quick look at it:
X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 |
---|---|---|---|---|---|---|---|
Analyte | Number of Papers | Biological Variation | Biological Variation | Desirable specification | Desirable specification | Desirable specification | |
Analyte | Number of Papers | CVI | CVg | I(%) | B(%) | TE(%) | |
S- | 11-Desoxycortisol | 2 | 21.3 | 31.5 | 10.7 | 9.5 | 27.1 |
S- | 17-Hydroxyprogesterone | 2 | 19.6 | 50.4 | 9.8 | 13.5 | 29.7 |
U- | 4-hydroxy-3-methoximandelate (VMA) | 1 | 22.2 | 47.0 | 11.1 | 13.0 | 31.3 |
S- | 5' Nucleotidase | 2 | 23.2 | 19.9 | 11.6 | 7.6 | 26.8 |
U- | 5'-Hydroxyindolacetate, concentration | 1 | 20.3 | 33.2 | 10.2 | 9.7 | 26.5 |
S- | α1-Acid Glycoprotein | 3 | 11.3 | 24.9 | 5.7 | 6.8 | 16.2 |
S- | α1-Antichymotrypsin | 1 | 13.5 | 18.3 | 6.8 | 5.7 | 16.8 |
S- | α1-Antitrypsin | 3 | 5.9 | 16.3 | 3.0 | 4.3 | 9.2 |
We can see that we need define column names and we need to get rid of some rows containing extraneous column header information. There are actually 8 such sets of headers to remove.
table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE") names(biotable) <- table.header
Let's now find rows we don't want and remove them.
for.removal <- grep("Analyte", biotable$Analyte) biotable <- biotable[-for.removal,]
You will find that the table has missing data which is written as “- – -”. This should be now replaced by NA
and the column names should be assigned to sequential integers. Also, we will remove all the minus signs after the specimen type. I'm not sure what they add.
biotable[biotable == "---"] <- NA row.names(biotable) <- 1:nrow(biotable) biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)
Just having another look at the first 10 rows:
Sample | Analyte | NumPapers | CVI | CVG | I | B | TE |
---|---|---|---|---|---|---|---|
S | 11-Desoxycortisol | 2 | 21.3 | 31.5 | 10.7 | 9.5 | 27.1 |
S | 17-Hydroxyprogesterone | 2 | 19.6 | 50.4 | 9.8 | 13.5 | 29.7 |
U | 4-hydroxy-3-methoximandelate (VMA) | 1 | 22.2 | 47.0 | 11.1 | 13.0 | 31.3 |
S | 5' Nucleotidase | 2 | 23.2 | 19.9 | 11.6 | 7.6 | 26.8 |
U | 5'-Hydroxyindolacetate, concentration | 1 | 20.3 | 33.2 | 10.2 | 9.7 | 26.5 |
S | α1-Acid Glycoprotein | 3 | 11.3 | 24.9 | 5.7 | 6.8 | 16.2 |
S | α1-Antichymotrypsin | 1 | 13.5 | 18.3 | 6.8 | 5.7 | 16.8 |
S | α1-Antitrypsin | 3 | 5.9 | 16.3 | 3.0 | 4.3 | 9.2 |
S | α1-Globulins | 2 | 11.4 | 22.6 | 5.7 | 6.3 | 15.7 |
U | α1-Microglobulin, concentration, first morning | 1 | 33.0 | 58.0 | 16.5 | 16.7 | 43.9 |
Now examining the structure:
str(biotable)
## 'data.frame': 370 obs. of 8 variables: ## $ Sample : chr "S" "S" "U" "S" ... ## $ Analyte : chr "11-Desoxycortisol" "17-Hydroxyprogesterone" "4-hydroxy-3-methoximandelate (VMA)" "5' Nucleotidase" ... ## $ NumPapers: chr "2" "2" "1" "2" ... ## $ CVI : chr "21.3" "19.6" "22.2" "23.2" ... ## $ CVG : chr "31.5" "50.4" "47.0" "19.9" ... ## $ I : chr "10.7" "9.8" "11.1" "11.6" ... ## $ B : chr "9.5" "13.5" "13.0" "7.6" ... ## $ TE : chr "27.1" "29.7" "31.3" "26.8" ...
It's kind-of undesirable to have numbers as characters so…
#convert appropriate columns to numeric biotable[,3:8] <- lapply(biotable[3:8], as.numeric)
Using the xlsx
package, you can output the table to an Excel file in the current working directory.
library(xlsx) write.xlsx(biotable, file = "Westgard_Biological_Variation.xlsx", row.names = FALSE)
If you are having trouble getting xlsx
to install, then just write as csv.
write.csv(biotable, file = "Westgard_Biological_Variation.csv", row.names = FALSE)
You can now use the same general approach to parse any table you have web access to, no mater how small or big it is. Here is a complete script in one place:
library(httr) library(rvest) library(xlsx) wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("yournamehere")) xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]") #define empty dataframe biotable = data.frame(matrix(NA,0, 8)) #loop over the 8 html tables for (j in 1:8){ subtable <- wg %>% read_html() %>% html_nodes(xpath = xpaths[j] ) %>% html_table(., fill = TRUE) subtable <- subtable[[1]] biotable <- rbind(biotable,subtable) } table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE") names(biotable) <- table.header #remove extraneous rows for.removal <- grep("Analyte", biotable$Analyte) biotable <- biotable[-for.removal,] #make missing data into NA biotable[ biotable == "---" ] <- NA row.names(biotable) <- 1:nrow(biotable) #convert appropriate columns to numeric biotable[,3:8] <- lapply(biotable[3:8], as.numeric) #get rid of minus signs in column 1 biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE) write.xlsx(biotable, file = "Westgard_Biological_Variation.xlsx", row.names = FALSE) write.csv(biotable, file = "Westgard_Biological_Variation.csv", row.names = FALSE)
Parting Thought on Tables
You prepare a table before me in the presence of my enemies. You anoint my head with oil; my cup overflows.
(Psalm 23:5)
]]>At the AACC meeting last week, some of my friends were bugging me that I had not made a blog post in 10 months. Without getting into it too much, let's just say I can blame Cerner. Thanks also to a prod from a friend, here is an approach to a fairly common problem.
We all report calculated quantities out of our laboratories–quantities such as LDL cholesterol, non-HDL cholesterol, aldosterone:renin ratio, free testosterone, eGFR etc. How does one determine the precision (i.e. imprecision) of a calculated quantity. While earlier in my life, I might go to the trouble of trying to do such calculations analytically using the rules of error propagation, in my later years, I am more pragmatic and I'm happy to use a computational approach.
In this example, we will model the precision in calculated bioavailable testosterone (CBAT). Without explanation, I provide an R function for CBAT (and free testosterone) where testosterone is reported in nmol/L, sex hormone binding globulin (SHBG) is reported in nmol/L, and albumin is reported in g/L. Using the Vermeulen Equation as discussed in this publication, you can calculate CBAT as follows:
cbat <- function(TT,SHBG,ALB = 43){ Kalb <- 3.6*10^4 Kshbg <- 10^9 N <- 1 + Kalb*ALB/69000 a <- N*Kshbg b <- N + Kshbg*(SHBG - TT)/10^9 c <- -TT/10^9 FT <- (-b + sqrt(b^2 - 4*a*c))/(2*a)*10^9 cbat <- N*FT return(list(free.T = FT, cbat = cbat)) }
To sanity-check this, we can use this online calculator. Taking a typical male testosterone of 20 nmol/L, an SHBG of 50 nmol/L and an albumin of 43 g/L, we get the following:
cbat(20,50)
## $free.T ## [1] 0.3273049 ## ## $cbat ## [1] 7.670319
which is confirmed by the online calculator. Because the function is vectorized, we an submit a vector of testosterone results and SHBG results and get a vector of CBAT results.
cbat(c(10,20,30), c(40,50,60))
## $free.T ## [1] 0.1738837 0.3273049 0.4661380 ## ## $cbat ## [1] 4.074926 7.670319 10.923842
We now need some precision data for the three components. However, in our lab, we just substitute 43 g/L for the albumin, so we will leave that term out of the analysis and limit our precision calculation to testosterone and SHBG. This will allow us to present the precision as surface plots as a function of total testosterone and SHBG.
We do testosterone by LC-MS/MS using Deborah French's method. In the last three months, the precision has been 3.9% at 0.78 nmol/L, 5.5% at 6.7 nmol/L, 5.2% at 18.0 nmol/L, and 6.0% at 28.2 nmol/L. We are using the Roche Cobas e601 SHBG method which, according to the package insert, has precision of 1.8% at 14.9 nmol/L, 2.1 % at 45.7 nmol/L, and 4.0% at 219 nmol/L.
cv.tt <- c(3.9, 5.5, 5.2, 6.0) conc.tt <- c(0.78, 6.7, 18.0, 28.2) tt.df <- data.frame(conc.tt,cv.tt) plot(cv.tt ~ conc.tt, data = tt.df, main = "Precision Profile of Testosterone", xlab = "Testosterone (nmol/L)", ylab = "CV Testosterone (%)", ylim = c(0,8), type = "o")
cv.shbg <- c(1.8, 2.1, 4.0) conc.shbg <- c(14.9,45.7,219) shbg.df <- data.frame(cv.shbg, conc.shbg) plot(cv.shbg ~ conc.shbg, data = shbg.df, main = "Precision Profile of SHBG", xlab = "SHBG (nmol/L)", ylab = "CV SHGB (%)", ylim = c(0,5), type = "o")
We will want to generate linear interpolations of these precision profiles. Generally, we might watnt to use non-linear regression to do this but I will just linearly interpolate with the approxfun()
function. This will allow us to just call a function to get the approximate CV at concentrations other than those for which we have data.
tt.fun <- approxfun(x = tt.df$conc.tt, y = tt.df$cv.tt) shbg.fun <- approxfun(x = shbg.df$conc.shbg, y = shbg.df$cv.shbg)
Now, if we want to know the precision of SHBG at, say, 100 nmol/L, we can just write,
shbg.fun(100)
## [1] 2.695326
to obtain our precision result.
Now let's build a grid of SHBG and total testosterone (TT) values at which we will calculate the precision for CBAT.
shbg <- seq(from = 15, to = 200, by = 5) tt <- seq(from = 1, to = 28, by = 1)
At each point on the grid, we will have to generate, say, 100000 random TT values and 100000 random SHBG values with the appropriate precision and then calculate the expected precision of CBAT at those concentrations.
Let's do this for a single pair of concentrations by way of example modelling the random analytical error as Gaussian using the rnorm()
function.
# [SHBG] = 15 nmol/L # [TT] = 5.0 nmol/L set.seed(100) #just to get consistent results rng.tt <- rnorm(100000, mean = 5.0, sd = tt.fun(5.0)/100*5.0) rng.shbg <- rnorm(100000, mean = 15, sd = shbg.fun(15)/100*15) rng.cbat <- cbat(rng.tt, rng.shbg) cv.cbat <- sd(rng.cbat$cbat)/mean(rng.cbat$cbat)*100 cv.cbat
## [1] 5.30598
So, we can build the process of calculating the CV of CBAT into a function as follows:
cbat.cv <- function(TT, SHBG, N = 100000){ rng.tt <- rnorm(N, mean = TT, sd = tt.fun(TT)/100*TT) rng.shbg <- rnorm(N, mean = SHBG, sd = shbg.fun(SHBG)/100*SHBG) rng.cbat <- cbat(rng.tt, rng.shbg) cv <- sd(rng.cbat$cbat)/mean(rng.cbat$cbat)*100 return(cv) }
Now, we can make a matrix of the data for presenting a plot, calculating the CV and appending it to the dataframe.
cv.grid <- expand.grid(tt, shbg) names(cv.grid) <- c("tt", "shbg") cv.grid$cv.cbat <- mapply(cbat.cv, cv.grid$tt, cv.grid$shbg)
Now make plot using the wireframe()
function.
library(lattice) wireframe(cv.cbat ~ tt*shbg, data = cv.grid, xlab = "Testo \n (nmol/L)", ylab = "SHBG \n (nmol/L)", zlab = "CV \n (%)", drape = TRUE, colorkey = TRUE, col.regions = colorRampPalette(c("blue", "red", "yellow"))(100), scales = list(arrows=FALSE,cex=.5,tick.number = 10) )
This shows us that the CV of CBAT ranges from about 4–8% over the TT and SHBG ranges we have looked at.
We have determined the CV of calculated bioavailable testosterone using random number simulations using empirical CV data and produced a surface plot of CV. This allows us to comment on the CV of this lab reportable as a function of the two variables by which it is determined.
Parting Thought on Monte Carlo Simulations
The die is cast into the lap, but its every decision is from the LORD.
(Prov 16:33)
]]>There are a few ways to approach the problem of a conditionally formatted table in R. You can use the ReporteRs package's FlexTable()
function, the formattable package, or the condformat package. These allow you to produce a conditionally formatted tables in HTML. You can also use xtable package and essentially program what you want in LaTeX via the xtable()
function.
In my desire for something simple-ish, I am going do this graphically using the image()
function as suggested here. The benefit is that I can then push the table into an RMarkdown generated PDF document easily.
Suppose that you want to prepare a summary of how resident and medical student orders are placed on various wards. You obtain data that is formatted in the following manner.
head(orders,10)
## ward order.type cosigned ## 1 Med CPOE TRUE ## 2 Med Written FALSE ## 3 Med CPOE TRUE ## 4 Med Written TRUE ## 5 Med Written TRUE ## 6 Med Written TRUE ## 7 Med CPOE TRUE ## 8 Med CPOE FALSE ## 9 Med CPOE TRUE ## 10 Med CPOE TRUE
There are 4 wards: medicine, surgery, ER and orthopedics. Orders can come in as computerized physician order entry (CPOE), verbal or written. The orders have to be cosigned by staff and this is recorded as TRUE/FALSE because staff are not always compliant in logging on to the EMR to cosign the trainee orders.
str(orders)
## 'data.frame': 550 obs. of 3 variables: ## $ ward : Factor w/ 4 levels "Med","Surg","ER",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ order.type: Factor w/ 3 levels "CPOE","Verbal",..: 1 3 1 3 3 3 1 1 1 1 ... ## $ cosigned : logi TRUE FALSE TRUE TRUE TRUE TRUE ...
summary(orders)
## ward order.type cosigned ## Med :150 CPOE :291 Mode :logical ## Surg:100 Verbal : 54 FALSE:195 ## ER :200 Written:205 TRUE :355 ## Orth:100 NA's :0
Let's start with the assumption that we want to apply the same conditional formatting to all data in the table. That is, we want to color code all results with the same algorithm. We can used the image()
function to get this done. Let's display the rates at which different order types (CPOE, verbal,or written) from the four wards. We can generate the proportions table in percent very easily with the prop.table()
and table()
functions operating on the first two columns of our orders
data:
my.data <- round(prop.table(table(orders[,1:2]),1)*100,1) my.data
## order.type ## ward CPOE Verbal Written ## Med 49.3 8.7 42.0 ## Surg 30.0 4.0 66.0 ## ER 89.5 7.0 3.5 ## Orth 8.0 23.0 69.0
The image()
function produces a tile plot based on matrix of z values, where z = f(x,y) using colours we can define and thresholds for switching from one colour to the next based on a breaks
parameter. In our case, we will say that if the result is less than equal to 25%, we will colour the tile blue, if it is greater than 25% but less than or equal to 50%, we will colour it red, and if it greater than 50%, it will be yellow.
You will note that we have to transpose the data with the t()
function because the image function plots the rows on the x axis on the columns on the y axis. You will also notice that we need to plot y descending on the y-axis to account for the fact that our tabular data has increasing index going down but the tile plot will default to have increasing y going up. We can also need to suppress the axes and their labels. The reader can comment out the lines xaxt = 'n'
and yaxt = 'n'
to see what is going on in terms of x and y values.
x = 1:ncol(my.data) y = 1:nrow(my.data) centers <- expand.grid(y,x) #make the plot margins a little bigger par(mar = c(2,7,4,2)) image(x, y, t(my.data), col = c(rgb(0,0,1,0.3),rgb(1,0,0,0.3), rgb(1,1,0,0.3)), breaks = c(0, 25, 50, 100), xaxt = 'n', yaxt = 'n', xlab = '', ylab = '', ylim = c(max(y) + 0.5, min(y) - 0.5) )
Now we can write our values over top with the text()
function.
text(centers[,2], centers[,1], c(my.data), col= "black")
And then we can write the variable names (which we yank from the attributes of the table) into the figure margin and draw some lines to make it look pretty. It was necessary to use the adj
and padj
parameters to make it look a little cleaner.
#add margin text mtext(paste(attributes(my.data)$dimnames[[2]],"(%)"), at=1:ncol(my.data), padj = -1) mtext(attributes(my.data)$dimnames[[1]], at=1:nrow(my.data), side = 2, las = 1, adj = 1.2) #add black lines abline(h=y + 0.5) abline(v=x + 0.5)
Now, if you want to make the text colour match the background colour, we will need a little function.
color.picker <- function(z){ if(z <= 25){return("blue")} else if( z > 25 & z <= 50){return("red")} else {return("darkorange4")} }
and then apply it over the values of the matrix:
text.cols <- sapply(c(my.data), color.picker) text(centers[,2], centers[,1], c(my.data), col= text.cols)
Now suppose you wanted different conditional formatting for each column. This is kind of a pain because you will need to provide the image()
function a matrix to generate an appropriate fill-colour and a different matrix for the data to be written in each cell. Let's imagine for example that we want to include the compliance rate for co-signing in a fourth column and this is the only column we want coloured. To this column we want a colour scheme applied wherein if compliance is less than or equal to 20%, the colour is red, between 20% and 80%, it is yellow, and above 80% it is green.
We can calculate a proportions table based on columns 1 and 3 of the orders
dataframe and then we can define a matrix fill.data
that has NA
on all the rates we calculated above.
my.data <- cbind(my.data,Cosigned = round(prop.table(table(orders[,c(1,3)]),1)*100,1)[,2]) fill.data <- my.data fill.data[,1:3] <- matrix(NA, nrow = nrow(my.data), ncol = ncol(my.data) - 1)
Now the proportions matrix is as follows:
my.data
## CPOE Verbal Written Cosigned ## Med 49.3 8.7 42.0 75.3 ## Surg 30.0 4.0 66.0 52.0 ## ER 89.5 7.0 3.5 88.5 ## Orth 8.0 23.0 69.0 13.0
and the fill data is:
fill.data
## CPOE Verbal Written Cosigned ## Med NA NA NA 75.3 ## Surg NA NA NA 52.0 ## ER NA NA NA 88.5 ## Orth NA NA NA 13.0
Now we can apply the image()
function to the fill.data
matrix. When it comes to writing the data in the cells, we will use the original my.data
matrix and we will adjust out color.picker()
function.
color.picker <- function(z){ if(is.na(z)){return("black")} else if(z <= 20){return("red")} else if( z > 20 & z <= 80){return("darkorange4")} else {return("darkgreen")} } x = 1:ncol(my.data) y = 1:nrow(my.data) centers <- expand.grid(y,x) par(mar = c(2,7,4,2)) image(x, y, t(fill.data), col = c(rgb(1,0,0,0.3),rgb(1,1,0,0.3), rgb(0,1,0,0.3)), breaks = c(0, 20, 80, 100), xaxt='n', yaxt='n', xlab='', ylab='', ylim = c(max(y) + 0.5, min(y) - 0.5) ) #write in values text.cols <- sapply(c(fill.data), color.picker) text(centers[,2], centers[,1], format(c(my.data),nsmall = 1), col= text.cols) #add margin text mtext(paste(attributes(my.data)$dimnames[[2]],"(%)"), at=1:ncol(my.data), padj = -1) mtext(attributes(my.data)$dimnames[[1]], at=1:nrow(my.data), side = 2, las = 1, adj = 1.2) #add black lines abline(h=y + 0.5) abline(v=x + 0.5)
So, it looks like this could become super–awkward if we had elaborate conditions to apply. This is where a packages like condformat and formattable come in handy. If you use the condformat package, you can include the table in an RMarkdown generated PDF or HTML document. However, the formattable()
function, though capable of much prettier output, does not work with PDFs generated using RMarkdown.
First, here is a condformat example. Suppose we wanted to colourized CPOE in shades of green because CPOE is more operationally desirable and verbal/written orders in shades of red because they are less operationally desirable. We also want the red/yellow/green formatting in the Cosigned column. Using condformat we could do the following:
library(condformat) my.data <- as.data.frame(my.data) color.picker <- function(z){ if(is.na(z)){return(0)} else if(z <= 20){return(1)} else if( z > 20 & z <= 80){return(2)} else {return(3)} } condformat(my.data) + rule_fill_gradient(CPOE, low = rgb(1,1,1), high = rgb(0,1,0)) + rule_fill_gradient(Verbal, low = rgb(1,1,1), high = rgb(1,0,0)) + rule_fill_gradient(Written, low = rgb(1,1,1), high = rgb(1,0,0)) + rule_fill_discrete(Cosigned, expression = sapply(Cosigned, color.picker),colours=c("0" = "white", "1" = "red", "2" = "yellow", "3" = "lightgreen"))
CPOE | Verbal | Written | Cosigned | |
---|---|---|---|---|
1 | 49.3 | 8.7 | 42.0 | 75.3 |
2 | 30.0 | 4.0 | 66.0 | 52.0 |
3 | 89.5 | 7.0 | 3.5 | 88.5 |
4 | 8.0 | 23.0 | 69.0 | 13.0 |
You can see that the rownames are suppressed with condformat()
. You could circumvent this by putting the rownames into their own column. This package is pretty easy to use and with PDF rendering (shown below) it produces something more LaTeX-ish than what is shown above which was generated straight to HTML.
For something more attractive looking, here is an example of something similar using the formattable package (borrowing heavily from the code author's examples ):
library(formattable) color.picker <- function(z){ if(is.na(z)){return("black")} else if(z <= 20){return("red")} else if( z > 20 & z <= 80){return("darkorange")} else {return("darkgreen")} } bg.picker <- function(z){ if(is.na(z)){return("black")} else if(z <= 20){return("pink")} else if( z > 20 & z <= 80){return("yellow")} else {return("lightgreen")} } my.data <- as.data.frame(my.data) formattable(my.data, list( CPOE = color_tile("white", "green"), Verbal = color_tile("white", "red"), Written = color_tile("white", "red"), Cosigned = formatter("span", style = x ~ style(display = "block", "border-radius" = "4px", "padding-right" = "4px", color = sapply(x,color.picker), "background-color" = sapply(x,bg.picker)), x ~ sprintf("%.2f (rank: %02d)", x, rank(-x))) ))
CPOE | Verbal | Written | Cosigned | |
---|---|---|---|---|
Med | 49.3 | 8.7 | 42.0 | 75.30 (rank: 02) |
Surg | 30.0 | 4.0 | 66.0 | 52.00 (rank: 03) |
ER | 89.5 | 7.0 | 3.5 | 88.50 (rank: 01) |
Orth | 8.0 | 23.0 | 69.0 | 13.00 (rank: 04) |
I hope that this points you in the right direction.
And as for conditions:
“If you declare with your mouth, “Jesus is Lord,” and believe in your heart that God raised him from the dead, you will be saved.”
Romans 10:9
]]>In two previous posts, I discussed visualizing your turnaround times (TATs). These posts are here and here. One other nice way to visualize your TAT is by means of a heatmap. In particular, we would like to look at the TAT for every hour of the week in a single figure. This manner of dataviz bling seems to be particularly attractive to managers because it costs you $0 to do this with R, but with commercial tools like Tableau, you'd have to pay a fortune and, as with Excel, your report would not be readily reproducible. Further, to make it autogenerate a PDF would mean you had to fork out more money for a report-generation module. Pffft.
We're going to read in a year's worth of order times and result times for a stat immunoassay test offered to a particular ward. The data, as I've formatted it, has two columns, ord and res.
test.data <- read.csv("test_data.csv") head(test.data)
## ord res ## 1 2015-01-01 13:24:00 2015-01-01 14:29:00 ## 2 2015-01-01 06:16:00 2015-01-01 07:43:00 ## 3 2015-01-01 06:32:00 2015-01-01 07:43:00 ## 4 2015-01-01 06:32:00 2015-01-01 07:43:00 ## 5 2015-01-01 12:12:00 2015-01-01 13:13:00 ## 6 2015-01-01 12:12:00 2015-01-01 13:13:00
Now, of course, we want to look at data collected from a long period of time so that we can be sure that the observations we are not simply an artifact of recent instrument downtime, maintenance, or whoever happened to be running the instrument. This is why I chose a year's worth of data. We are going to visualize the median order-to-file TAT for this test.
To calculate the hourly medians, we'll need to be able to label every TAT with the day it was run and the hour in the day it was run. This is pretty easy with the lubridate package. We'll do three things:
difftime()
function to calculate the TATswday()
function to determine which day of the week the specimen was run onformat()
function.library("dplyr") library("lubridate") library("fields") library("magrittr") test.data$ord <- ymd_hms(test.data$ord) test.data$res <- ymd_hms(test.data$res) test.data <- mutate(test.data,otf = difftime(res,ord,units="min")) test.data <- mutate(test.data,dow = wday(ord)) test.data <- mutate(test.data,hod = as.numeric(format(test.data$ord, "%H")))
And now the data will look like this:
head(test.data)
## ord res otf dow hod ## 1 2015-01-01 13:24:00 2015-01-01 14:29:00 65 mins 5 13 ## 2 2015-01-01 06:16:00 2015-01-01 07:43:00 87 mins 5 6 ## 3 2015-01-01 06:32:00 2015-01-01 07:43:00 71 mins 5 6 ## 4 2015-01-01 06:32:00 2015-01-01 07:43:00 71 mins 5 6 ## 5 2015-01-01 12:12:00 2015-01-01 13:13:00 61 mins 5 12 ## 6 2015-01-01 12:12:00 2015-01-01 13:13:00 61 mins 5 12
where the order-to-file TAT is in the otf column, the day-of-week is in the dow column and the hour-of-day is in the hod column. Now we can cycle though the days of the week and the hours of the day and calculate the year's median TAT for each hour, storing it in a matrix:
#prepare an empty matrix heat.data <- matrix(rep(NA,7*24),nrow = 7, ncol = 24) #loop over the days and hours and calculate the median TAT for(i in 1:7){ for(j in 0:23){ heat.data[i,j+1] <- subset(test.data, test.data$dow==i & test.data$hod==j)$otf %>% median } }
There are many ways to make the heatmap but I am particularly fond of the appearance of surface plots made with the fields package.
image.plot(1:7,seq(from=0.5, to=23.5, by = 1),heat.data,axes=FALSE, xlab = "Day of Week", ylab = "Hour of Day", ylim=c(0,24)) # the following pointless command is necessary to make the custom axis labels non-transparent # google revealed this among a number of other workarounds. points(0,0) # now these will display properly axis(side=1, at=1:7, labels=as.character(wday(1:7, label=TRUE)), las=2, cex.axis = 0.8) axis(side=2, at= 0:24, labels=0:24, las=1, cex.axis=0.8)
We can see that there is a morning slowdown that is particularly bad on Saturday. But what if we wanted to know the exact value for these eye-catching problem times? We'd have trouble, unless we overlaid some text.
It turns out that if you use white printing, you can't read the numbers when the background colour is yellow and green. There is a 64 colour gradient used in the image.plot()
function, so I calculated which integers in 0–64 were the problem and found the TATs that would correspond. It turned out that colours 20–45 out of the 64 colours in the gradient are the problem. By this means, I can make the printing black over the yellows and greens but white everywhere else:
image.plot(1:7,seq(from=0.5, to=23.5, by = 1),heat.data,axes=FALSE, xlab = "Day of Week", ylab = "Hour of Day", ylim=c(0,24)) points(0,0) #random command that resets par axis(side=1, at=1:7, labels=as.character(wday(1:7, label=TRUE)), las=2, cex.axis = 0.8) axis(side=2, at= 0:24, labels=0:24, las=1, cex.axis=0.8) # calculate the lowest and highest TAT min.z <- min(heat.data) max.z <- max(heat.data) # determine which TAT's will have yellow to green shading z.yellows <- min.z + (max.z - min.z)/64*c(20,45) # print the labels for(i in 1:7){ for(j in 1:24){ if((heat.data[i,j] > z.yellows[1])&(heat.data[i,j] < z.yellows[2])){ text(i,j-0.5,heat.data[i,j], col="black", cex = 0.8) }else{ text(i,j-0.5,heat.data[i,j], col="white", cex = 0.8) } } }
So, that is not too bad, and if you wanted to look at the 75th percentile instead you would only have to adjust the heat.data calculation as follows:
#prepare an empty matrix heat.data <- matrix(rep(NA,7*24),nrow = 7, ncol = 24) #loop over the days and hours and calculate the median TAT for(i in 1:7){ for(j in 0:23){ heat.data[i,j+1] <- subset(test.data, test.data$dow==i & test.data$hod==j)$otf %>% quantile(.,probs=0.75) } }
And this is what you will get.
Hmmm…we'd better look at Saturday morning, 6 am. I hope you have found this helpful.
And as for heat
“He will sit as a refiner and purifier of silver”
Malachi 3:3
]]>As you know in Clinical Chemistry, we are not always writing a major paper but sometimes just preparing a short-report to answer a technical question that we've encounted at work. For shorter papers, journals often have more stringent rules about how many figures you can submit and even sometimes forbid multipanelled figures. In these situations, we might want to cram a little more into your figure than we might otherwise. In a recent submission, I wanted to produce a difference plot of immunoassay results before and after storage but I also wanted to show the distribution of the results using a histogram–but this would have counted as two separate figures.
However, thanks to some fine work by Dean Attali of UBC Department of Statistics where he works with R-legend Jenny Bryan, it is quite easy to add marginal histograms to a Bland Altman (or any other scatter) plot using the ggExtra package.
Let's make some fake data for a Bland Altman plot. Let's pretend that we are measuring the same quantity by immunoassay at baseline and after 1 year of storage at -80 degrees. We'll add some heteroscedastic error and create some apparent degradation of about 20%:
set.seed(10) #make predictable random data baseline <- rlnorm(100, 0, 1) post <- 0.8*baseline + rnorm(100, 0, 0.10*baseline) plot(baseline,post) abline(lm(post ~ baseline)) abline(0, 1, col="red", lty = 2)
Or if we plot this in the ggplot()
paradigm
library(ggplot2) my.data <- data.frame(baseline, post) ggplot(my.data, aes(x=baseline, y=post)) + theme_bw() + geom_point(shape=1) + # Use hollow circles geom_smooth(method=lm) + # Add linear regression line geom_abline(slope = 1, intercept = 0, linetype = 2, colour = "red")
Now we will prepare the difference data:
diff <- (post - baseline) diffp <- (post - baseline)/baseline*100 sd.diff <- sd(diff) sd.diffp <- sd(diffp) my.data <- data.frame(baseline, post, diff, diffp)
In standard Bland Altman plots, one plots the difference between methods against the average of the methods, but in this case, the x-axis should be the baseline result, because that is the closest thing we have to the truth.
library(ggExtra) diffplot <- ggplot(my.data, aes(baseline, diff)) + geom_point(size=2, colour = rgb(0,0,0, alpha = 0.5)) + theme_bw() + #when the +/- 2SD lines will fall outside the default plot limits #they need to be pre-stated explicitly to make the histogram line up properly. #Thanks to commenter for noticing this. ylim(mean(my.data$diff) - 3*sd.diff, mean(my.data$diff) + 3*sd.diff) + geom_hline(yintercept = 0, linetype = 3) + geom_hline(yintercept = mean(my.data$diff)) + geom_hline(yintercept = mean(my.data$diff) + 2*sd.diff, linetype = 2) + geom_hline(yintercept = mean(my.data$diff) - 2*sd.diff, linetype = 2) + ylab("Difference pre and post Storage (mg/L)") + xlab("Baseline Concentration (mg/L)") #And now for the magic - we'll use 25 bins ggMarginal(diffplot, type="histogram", bins = 25)
So that is the difference plot for the absolute difference. We can also obviously do the percent difference.
diffplotp <- ggplot(my.data, aes(baseline, diffp)) + geom_point(size=2, colour = rgb(0,0,0, alpha = 0.5)) + theme_bw() + geom_hline(yintercept = 0, linetype = 3) + geom_hline(yintercept = mean(my.data$diffp)) + geom_hline(yintercept = mean(my.data$diffp) + 2*sd.diffp, linetype = 2) + geom_hline(yintercept = mean(my.data$diffp) - 2*sd.diffp, linetype = 2) + ylab("Difference pre and post Storage (%)") + xlab("Baseline Concentration (mg/L)") ggMarginal(diffplotp, type="histogram", bins = 25)
You can also do this in a non-ggplot()
paradigm using base plotting utilities as described in this R-bloggers post.
And that, friends, is a way of squishing in a histogram of your sample concentrations into your difference plot which allows you to graphically display your sampling distribution and justify whether you would use parametric or non-parametric statistics to assess the extent of loss of immunoreactivity from storage.
And speaking of scatterplots
“…then the Lord your God will restore your fortunes and have compassion on you and gather you again from all the nations where he scattered you.”
Deut 30:3
Back in 2011 I was not aware of any tool in R for Passing Bablok (PB) regression, a form of robust regression described in a series of three papers in Clinical Chemistry and Laboratory Medicine (then J Clin Chem and Biochem) available here, here and here. For reasons that are not entirely clear to me, this regression methodology is favoured by clinical chemists but seems largely ignored by other disciplines. However since reviewers clinical chemistry journals will demand the use of PB regression, it seemed expeditious to me to code it in R. This is what spawned a small project for a piece of software to do PB (and Deming and ordinary least squares) regression using a self-contained executable that could be downloaded, unzipped on a Windows Desktop and just ran. You can download here and instructions for installation and use are here and here respectively. The calculations are all done in R, the GUI is built with Python and Py-Qt4 and the executable with cx_freeze. I made it run without an installer because hospital IT often refuse to install software that has not been officially vetted and purchased. The tool was a lot more popular than I anticipated now having about 2000 downloads. In any case, maintenance, upgrades, bug fixing and dealing with operating system updates that break things (like OSX El Capitan's security policies) are no-fun so a Shiny based solution to the same problem makes a lot of sense.
Since 2011, statisticians at Roche Diagnostics programmed the mcr package for PB and Deming regression. Additionally, there is also the MethComp package and the deming package from the Mayo Clinic which both offer PB regression.
Enter Burak Bahar, a like-minded Clinical Pathologist who is currently doing a fellowship at Yale. He liked my cp-R program but he saw the need for a web-based equivalent.
Burak and his wife Ayse, also a physician, have coded a Shiny App for doing Deming, PB and least squares regression in R which is capable of producing publication quality figures and provides all the regression statistics you would need for method-validation or publication. It can also produce a regression report in PDF, Word or HTML. The dynamic duo of the Bahar-MDs deserve all credit here as my only contribution related to suggestions related to usability. This project was presented at the 2016 American Association of Clinical Chemistry meeting in Philadelphia.
The app URL is bahar.shinyapps.io/method_compare. Go to the data tab on the left and then cut and paste your data from an spreadsheet program. Shortcuts CTRL-C (copy) and CTRL-V (paste) work natively in the table. The table is pre-populated with some random data for demonstration purposes. Once your data is pasted in, click on the Plots tab and choose the Bland-Altman or Scatter Plot.
Here is an image generated with the Bahar Shiny app using method comparison data obtained from St. Paul's Hospital Laboratory in migrating from Siemens Immulite 2000 XPi to Roche Cobas e601 for Calcitonin determination. Don't worry, we did more than 33 comparison–I am just showing the low end.
Try adjusting some of the plot parameters. The figures will update in real time. Thanks to Burak and Ayse Bahar for your work!
(Dan's) Parting Thought
There are straight lines that matter a lot more than regression.
I will make justice the measuring line and righteousness the plumb line
(Isa 28:17)
As Clinical Pathologists we work hard to create laboratory developed tests (LDTs) using liquid chromatography and tandem mass spectrometry (LC-MS/MS) that are robust, repeatable, accurate and have a wider dynamic range than commercial immunoassays. In our experience, properly developed LC-MS/MS assays are much less expensive and outperform their commercial immunoassay counterparts from an analytical standpoint.
However, despite mass spectrometry's communal obsession with analytical performance of our LDTs, sometimes we overlook the matter of handling the data we generate. Unlike traditional diagnostic companies (e.g. Siemens, Roche) who take care of upload and download of patient data and results via HL7 streams to the laboratory information system (LIS), mass spectrometry companies have not yet made this a priority. This leaves us either paying out a lot of money for custom middleware solutions or manually transcribing our LC-MS/MS results.
We might naively think, “How bad can the transcription be?” but over time, it becomes painfully evident that manual transcription of result is tedious, error–prone and inefficient use of tech–time.
Many LIS vendors offer what is called a “flat-file interface”. In this case, there is no HL7 stream generated using a communication socket between instrument and LIS. Rather, the results are saved in an ASCII text file with a pre-defined format and then transferred to the LIS via a secure shell (SSH) connection.
For this post, we are going to take some sample flat files from a SCIEX API5000 triple quadrupole mass spectrometer and prepare a flat file for the SunQuest LIS. Please note that this code is provided to you as is under the GNU Public Licence and without any guarantee. You know how all the LC-MS/MS vendors say their instruments are for “research use only”? –yeah, I'm giving this to you in the same spirit. If you use or modify it, you do so at your own risk. Any changes to how your flatfile is generated by your mass spectrometer or any upgrades to your LC-MS/MS software could make this code malfunction. You have been warned.
SunQuest requires the output file to be a comma separated values (CSV) file with a unique specimen or internal QC result in each row. The first column is the instrument ID, the second columns is the specimen container ID (an E followed by a 10–digit integer), the third is testcode and the fourth is the result. The file itself is required to have a time–stamp so that it has a traceable name and should have no header. For an instrument named PAPI (short for Providence API 5000) and a testcode TES (for testosterone), the file might look like this:
PAPI,E2324434511,TES,3.12 PAPI,E2324434542,TES,8.75 PAPI,E2324434565,TES,25.34 ...
After we have completed an analytical run and reviewed all peaks to generate our fileable results, we can export the quatified sample batch to an ASCII text file. The file contains a whole lot of diagnostic information about the run like which multiple reaction monitoring (MRM) transitions we used, what the internal standard (IS) counts were, results from the quantifier and qualifier ion, fitted values for the calibrators etc. There are more than 80 columns in a typical file and we could talk about all the things we might do with this data but in this case, we are concerned with extracting and preparing the results file.
If we are actually going to make an R script usable by a human, it would be good to be able to choose which file we want to process and what test we want to extract using a simple graphical user interface (GUI). There are a number of tools one can use to build GUIs in R but the most rudimentary is TclTk. I have to confess that I find the language constructs for GUI creation both non–intuitive and boring. For this reason, I present without discussion, a modification of a recipe for creating a box with radio–buttons. We are going to choose which of three analytes (you can increase this number as you please) for which we wish to process a flat–file. These are: aldosterone, cortisol and testosterone. Please note that if you execute this code on a Mac, you will have to install XQuartz because Macs don't have native X-windows support despite the BSD Linux heritage of OSX.
library(tcltk2) #make a radiobutton widget #source for tk widget modifed from http://www.sciviews.org/recipes/tcltk/TclTk-radiobuttons/ #accessed Feb 10, 2016 win1 <- tktoplevel() win1$env$rb1 <- tk2radiobutton(win1) win1$env$rb2 <- tk2radiobutton(win1) win1$env$rb3 <- tk2radiobutton(win1) rbValue <- tclVar("Aldosterone") tkconfigure(win1$env$rb1, variable = rbValue, value = "Aldosterone") tkconfigure(win1$env$rb2, variable = rbValue, value = "Cortisol") tkconfigure(win1$env$rb3, variable = rbValue, value = "Testosterone") tkgrid(tk2label(win1, text = "Which analyte are you processing?"), columnspan = 2, padx = 10, pady = c(15, 5)) tkgrid(tk2label(win1, text = "Aldosterone"), win1$env$rb1, padx = 10, pady = c(0, 5)) tkgrid(tk2label(win1,text = "Cortisol"), win1$env$rb2, padx = 10, pady = c(0, 5)) tkgrid(tk2label(win1,text = "Testosterone"), win1$env$rb3, padx = 10, pady = c(0, 15)) onOK <- function() { rbVal <- as.character(tclvalue(rbValue)) tkdestroy(win1) } win1$env$butOK <- tk2button(win1, text = "OK", width = -6, command = onOK) tkgrid(win1$env$butOK, columnspan = 2, padx = 10, pady = c(5, 15)) tkfocus(win1) #this final line is necessary to prevent to the program from proceeding until this radio button widget has closed tkwait.window(win1)
This will give us the following pop-up window with radiobuttons in which I have selected testosterone.
You will notice that Tk windows do not appear native to the operating system. We can live with this because we are not shallow.
After you hit the OK button, the Tk widget then puts the chosen value into an Tk variable called rbValue
. We can determine the value using the command tclvalue(rbValue)
. The reason we need to know which analyte we are working with is because the name of the MRM we want to pull out of the flat file is dependent on the analyte of course. We will also need to replace results below the limit of quantitation (LoQ) with “< x”, whatever x happens to be, which will be a different threshold for each analyte.
In our case, the testcodes for aldosterone, cortisol and testosterone are ALD,CORT and TES respectively, the LoQs are 50 pmol/L, 1 nmol/L and 0.05 nmol/L and the MRM names are “Aldo 1”, “Aldo 2”, “Cortisol 1”, “Cortisol 2” and “Testo 1” and “Testo 2” as we defined them within SCIEX Analyst Software. We will use the switch()
function to define three variables (test.code
, LoQ
, and MRM.names
) which we will use later to process the flat–file. We will also define the name of the worksheet in a variable called worksheet
. These are the parameters you would have to change in order to modify the code for your purposes.
#set the testcode by test test.code <- switch(tclvalue(rbValue), "Aldosterone" = "ALD", "Testosterone" = "TES", "Cortisol" = "CORT" ) #set the LoQ by test LoQ <- switch(tclvalue(rbValue), "Aldosterone" = "<50", "Testosterone" = "<0.05", "Cortisol" = "<1" ) #set the MRM names by test MRM.names <- switch(tclvalue(rbValue), "Aldosterone" = c("Aldo 1", "Aldo 2"), "Testosterone" = c("Testo 1", "Testo 2"), "Cortisol" = c("Cortisol 1", "Cortisol 2") ) #set the worksheet name for your site worksheet <- "PAPI"
Now we will prompt the user to tell them that they are to choose an instrument flat–file and we will determine the path of the chosen file. We will need the path to both read in the appropriate file but also to write the output later.
#choose the flat file to process tkmessageBox(message="You are about to choose a flat file to process.") flat.file.path <- tk_choose.files(default = "", caption = "Select File", multi = FALSE, filters = NULL, index = 1) #determine the directory name of the chosen file flat.file.dir <- dirname(flat.file.path) #determine the file name of the chosen file flat.file.name <- basename(flat.file.path)
This code will create this message box:
and this file choice dialogue box:
and after a file is selected and the Open is pressed, the path to the flat–file is stored in the variable flat.file.path
.
So we chosen the file we want to read in but what does this file look like? To just get a gander at it, we could open it with Excel and see how it is laid out. But since we have broken up with Excel, we won't do this. SCIEX Analyst exports tab (not comma) delimited files. R has a built in function read.delim()
for reading these files but we will quickly discover that read.delim()
assumes the files have a rectangular structure, having the same number of columns in each row. R will make assumptions about the shape of the data file based on the first few rows and then try to read it in. In this case, it will fail and you will get gibberish. To get this to work for us we will need to tell R how many rows to skip before the real data starts or we will need to tell R the number of columns the file has (which is not guaranteed to be consistent between versions of vendor software). There are lots of ways to do this but I think the simplest is to use grep()
.
I did this by reading the file in with no parsing of the tabs using the readLines()
function. This function creates a vector for which each successive value is the entire content of the row of the file. I display the first 30 lines of the file. Suppose that we chose a testosterone flat file.
x <- readLines(flat.file.path) x[1:30]
## [1] "Peak Name: Testo-d3 2" ## [2] "Use as Internal Standard" ## [3] "Q1/Q3 Masses: 292.50/97.20 Da" ## [4] "" ## [5] "Peak Name: Testo 1" ## [6] "Internal Standard: Testo-d3 2" ## [7] "Q1/Q3 Masses: 289.50/97.20 Da" ## [8] "" ## [9] "Fit\tQuadratic\tWeighting\t1 / x\tIterate\tNo" ## [10] "a0\t0.00658" ## [11] "a1\t0.2" ## [12] "a2\t-0.000443" ## [13] "Correlation coefficient\t0.9999" ## [14] "Use Area" ## [15] "" ## [16] "Peak Name: Testo 2" ## [17] "Internal Standard: Testo-d3 2" ## [18] "Q1/Q3 Masses: 289.50/109.10 Da" ## [19] "" ## [20] "Fit\tQuadratic\tWeighting\t1 / x\tIterate\tNo" ## [21] "a0\t0.00359" ## [22] "a1\t0.17" ## [23] "a2\t-0.000313" ## [24] "Correlation coefficient\t0.9999" ## [25] "Use Area" ## [26] "" ## [27] "" ## [28] "" ## [29] "Sample Name\tSample ID\tSample Type\tSample Comment\tSet Number\tAcquisition Method\tAcquisition Date\tRack Type\tRack Position\tVial Position\tPlate Type\tPlate Position\tFile Name\tDilution Factor\tWeight To Volume Ratio\tSample Annotation\tDisposition\tAnalyte Peak Name\tAnalyte Units\tAnalyte Peak Area (counts)\tAnalyte Peak Area for DAD (mAU x min)\tAnalyte Peak Height (cps)\tAnalyte Peak Height for DAD (mAU)\tAnalyte Concentration (nmol/L)\tAnalyte Retention Time (min)\tAnalyte Expected RT (min)\tAnalyte RT Window (sec)\tAnalyte Centroid Location (min)\tAnalyte Start Scan\tAnalyte Start Time (min)\tAnalyte Stop Scan\tAnalyte Stop Time (min)\tAnalyte Integration Type\tAnalyte Signal To Noise\tAnalyte Peak Width (min)\tStandard Query Status\tAnalyte Mass Ranges (Da)\tAnalyte Wavelength Ranges (nm)\tArea Ratio\tHeight Ratio\tAnalyte Annotation\tAnalyte Channel\tAnalyte Peak Width at 50% Height (min)\tAnalyte Slope of Baseline (%/min)\tAnalyte Processing Alg.\tAnalyte Peak Asymmetry\tAnalyte Integration Quality\tIS Peak Name\tIS Units\tIS Peak Area (counts)\tIS Peak Area for DAD (mAU x min)\tIS Peak Height (cps)\tIS Peak Height for DAD (mAU)\tIS Concentration (nmol/L)\tIS Retention Time (min)\tIS Expected RT (min)\tIS RT Window (sec)\tIS Centroid Location (min)\tIS Start Scan\tIS Start Time (min)\tIS Stop Scan\tIS Stop Time (min)\tIS Integration Type\tIS Signal To Noise\tIS Peak Width (min)\tIS Mass Ranges (Da)\tIS Wavelength Ranges (nm)\tIS Channel\tIS Peak Width at 50% Height (min)\tIS Slope of Baseline (%/min)\tIS Processing Alg.\tIS Peak Asymmetry\tIS Integration Quality\tUse Record\tRecord Modified\tCalculated Concentration (nmol/L)\tCalculated Concentration for DAD (nmol/L)\tRelative Retention Time\tAccuracy (%)\tResponse Factor\tAcq. Start Time (min)\tInjection Volume used\t" ## [30] "Blank\t\tBlank\tBlankMPX_SAMPLE_ID:189001;Stream Number:2;Plate Code:Deep Well MTP 96 Cooled;Injection Volume:25;\t0\tTesto_DWP_S2.dam\t2/11/2013 5:06:45 PM\tDeep Well MTP 96 Cooled\t1\t1\tDeep Well MTP 96 Cooled\t2\t140305B1.wiff\t1.00\t0.00\t\t\tTesto 1\tnmol/L\t0\tN/A\t0.00e+000\tN/A\t0.00\t0.00\t1.29\t30.0\t0.00\t0\t0.00\t0\t0.00\tNo Peak\tN/A\t0.00\tN/A\t289.500/97.200 Da\tN/A\t0.00e+000\t0.00e+000\t\tN/A\t0.00\t0.00e+000\tSpecify Parameters - MQIII\t0.00\t0.00\tTesto-d3 2\tnmol/L\t158416\tN/A\t5.58e+004\tN/A\t1.00\t1.27\t1.29\t20.0\t1.28\t115\t1.18\t139\t1.43\tBase To Base\tN/A\t0.248\t292.500/97.200 Da\tN/A\tN/A\t4.36e-002\t1.06e+000\tSpecify Parameters - MQIII\t1.60\t0.956\t\t0\tN/A\tN/A\t0.00\tN/A\tN/A\t2.85\t25\t"
All of the \t
's that you see are the tabs in the file which are has read in literally when we use readLines()
. We can see that in this file nothing of use happens until line 29 but this is not consistent from file to file so we should not just assume that 29 is always the magic number where the good stuff begins. We can see that the line starting “Sample Name \t Sample ID” is the real starting point so we can determine how many lines to skip by using grep()
and prepare for some error–handling with a variable called problem
by which we can deal with the circumstance that no approriate starting row is identified.
skip.val <- grep("Sample Name\tSample ID", x, fixed = TRUE) - 1 #if no such row is found, then the wrong file has been chosen if (length(skip.val)==0){ problem <- TRUE } else { problem <- FALSE } skip.val
## [1] 28
Now that we know how many lines to skip we can read in the data:
my.data <- read.delim(flat.file.path, sep = "\t", strip.white = TRUE, skip = skip.val, header = TRUE, stringsAsFactors = FALSE)
We can have a look at the structure of this file
str(my.data)
## 'data.frame': 196 obs. of 83 variables: ## $ Sample.Name : chr "Blank" "Blank" "STD1" "STD1" ... ## $ Sample.ID : logi NA NA NA NA NA NA ... ## $ Sample.Type : chr "Blank" "Blank" "Standard" "Standard" ... ## $ Sample.Comment : chr "BlankMPX_SAMPLE_ID:189001;Stream Number:2;Plate Code:Deep Well MTP 96 Cooled;Injection Volume:25;" "BlankMPX_SAMPLE_ID:189001;Stream Number:2;Plate Code:Deep Well MTP 96 Cooled;Injection Volume:25;" "StandardMPX_SAMPLE_ID:189002;Stream Number:2;Plate Code:Deep Well MTP 96 Cooled;Injection Volume:25;" "StandardMPX_SAMPLE_ID:189002;Stream Number:2;Plate Code:Deep Well MTP 96 Cooled;Injection Volume:25;" ... ## $ Set.Number : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Acquisition.Method : chr "Testo_DWP_S2.dam" "Testo_DWP_S2.dam" "Testo_DWP_S2.dam" "Testo_DWP_S2.dam" ... ## $ Acquisition.Date : chr "2/11/2013 5:06:45 PM" "2/11/2013 5:06:45 PM" "2/11/2013 5:13:14 PM" "2/11/2013 5:13:14 PM" ... ## $ Rack.Type : chr "Deep Well MTP 96 Cooled" "Deep Well MTP 96 Cooled" "Deep Well MTP 96 Cooled" "Deep Well MTP 96 Cooled" ... ## $ Rack.Position : int 1 1 1 1 1 1 1 1 1 1 ... ## $ Vial.Position : int 1 1 13 13 25 25 37 37 49 49 ... ## $ Plate.Type : chr "Deep Well MTP 96 Cooled" "Deep Well MTP 96 Cooled" "Deep Well MTP 96 Cooled" "Deep Well MTP 96 Cooled" ... ## $ Plate.Position : int 2 2 2 2 2 2 2 2 2 2 ... ## $ File.Name : chr "140305B1.wiff" "140305B1.wiff" "140305B1.wiff" "140305B1.wiff" ... ## $ Dilution.Factor : num 1 1 1 1 1 1 1 1 1 1 ... ## $ Weight.To.Volume.Ratio : num 0 0 0 0 0 0 0 0 0 0 ... ## $ Sample.Annotation : logi NA NA NA NA NA NA ... ## $ Disposition : logi NA NA NA NA NA NA ... ## $ Analyte.Peak.Name : chr "Testo 1" "Testo 2" "Testo 1" "Testo 2" ... ## $ Analyte.Units : chr "nmol/L" "nmol/L" "nmol/L" "nmol/L" ... ## $ Analyte.Peak.Area..counts. : int 0 0 5273 3464 19412 16195 37994 32722 87815 74821 ... ## $ Analyte.Peak.Area.for.DAD..mAU.x.min. : chr "N/A" "N/A" "N/A" "N/A" ... ## $ Analyte.Peak.Height..cps. : num 0 0 1830 1300 6620 5700 13600 11400 30900 26100 ... ## $ Analyte.Peak.Height.for.DAD..mAU. : chr "N/A" "N/A" "N/A" "N/A" ... ## $ Analyte.Concentration..nmol.L. : chr "0.00" "0.00" "0.108" "0.108" ... ## $ Analyte.Retention.Time..min. : num 0 0 1.29 1.29 1.29 1.28 1.29 1.29 1.29 1.29 ... ## $ Analyte.Expected.RT..min. : num 1.29 1.29 1.29 1.29 1.29 1.29 1.29 1.29 1.29 1.29 ... ## $ Analyte.RT.Window..sec. : num 30 30 30 30 30 30 30 30 30 30 ... ## $ Analyte.Centroid.Location..min. : num 0 0 1.3 1.29 1.29 1.29 1.3 1.3 1.3 1.29 ... ## $ Analyte.Start.Scan : int 0 0 119 120 119 118 120 120 119 119 ... ## $ Analyte.Start.Time..min. : num 0 0 1.22 1.23 1.22 1.21 1.23 1.23 1.22 1.22 ... ## $ Analyte.Stop.Scan : int 0 0 137 130 137 135 135 138 141 139 ... ## $ Analyte.Stop.Time..min. : num 0 0 1.41 1.33 1.41 1.39 1.39 1.42 1.45 1.43 ... ## $ Analyte.Integration.Type : chr "No Peak" "No Peak" "Base To Base" "Base To Base" ... ## $ Analyte.Signal.To.Noise : chr "N/A" "N/A" "N/A" "N/A" ... ## $ Analyte.Peak.Width..min. : num 0 0 0.186 0.103 0.186 0.176 0.155 0.186 0.227 0.207 ... ## $ Standard.Query.Status : chr "N/A" "N/A" "N/A" "N/A" ... ## $ Analyte.Mass.Ranges..Da. : chr "289.500/97.200 Da" "289.500/109.100 Da" "289.500/97.200 Da" "289.500/109.100 Da" ... ## $ Analyte.Wavelength.Ranges..nm. : chr "N/A" "N/A" "N/A" "N/A" ... ## $ Area.Ratio : num 0 0 0.0304 0.02 0.113 0.094 0.219 0.188 0.468 0.398 ... ## $ Height.Ratio : num 0 0 0.0304 0.0216 0.108 0.0933 0.225 0.189 0.471 0.398 ... ## $ Analyte.Annotation : logi NA NA NA NA NA NA ... ## $ Analyte.Channel : chr "N/A" "N/A" "N/A" "N/A" ... ## $ Analyte.Peak.Width.at.50..Height..min. : num 0 0 0.0447 0.0447 0.046 0.0432 0.0439 0.0454 0.0438 0.045 ... ## $ Analyte.Slope.of.Baseline....min. : num 0 0 16.4 23.2 2.89 7.54 4.65 3.46 0.631 3 ... ## $ Analyte.Processing.Alg. : chr "Specify Parameters - MQIII" "Specify Parameters - MQIII" "Specify Parameters - MQIII" "Specify Parameters - MQIII" ... ## $ Analyte.Peak.Asymmetry : num 0 0 1.8 0.874 1.83 1.41 1.51 2.18 2.18 2 ... ## $ Analyte.Integration.Quality : num 0 0 0.379 0.233 0.697 0.62 0.794 0.765 0.907 0.875 ... ## $ IS.Peak.Name : chr "Testo-d3 2" "Testo-d3 2" "Testo-d3 2" "Testo-d3 2" ... ## $ IS.Units : chr "nmol/L" "nmol/L" "nmol/L" "nmol/L" ... ## $ IS.Peak.Area..counts. : int 158416 158416 173383 173383 172263 172263 173811 173811 187783 187783 ... ## $ IS.Peak.Area.for.DAD..mAU.x.min. : chr "N/A" "N/A" "N/A" "N/A" ... ## $ IS.Peak.Height..cps. : num 55800 55800 60100 60100 61200 61200 60300 60300 65700 65700 ... ## $ IS.Peak.Height.for.DAD..mAU. : chr "N/A" "N/A" "N/A" "N/A" ... ## $ IS.Concentration..nmol.L. : num 1 1 1 1 1 1 1 1 1 1 ... ## $ IS.Retention.Time..min. : num 1.27 1.27 1.27 1.27 1.27 1.27 1.28 1.28 1.28 1.28 ... ## $ IS.Expected.RT..min. : num 1.29 1.29 1.29 1.29 1.29 1.29 1.29 1.29 1.29 1.29 ... ## $ IS.RT.Window..sec. : num 20 20 20 20 20 20 20 20 20 20 ... ## $ IS.Centroid.Location..min. : num 1.28 1.28 1.28 1.28 1.28 1.28 1.28 1.28 1.28 1.28 ... ## $ IS.Start.Scan : int 115 115 117 117 115 115 117 117 118 118 ... ## $ IS.Start.Time..min. : num 1.18 1.18 1.2 1.2 1.18 1.18 1.2 1.2 1.21 1.21 ... ## $ IS.Stop.Scan : int 139 139 140 140 140 140 139 139 139 139 ... ## $ IS.Stop.Time..min. : num 1.43 1.43 1.44 1.44 1.44 1.44 1.43 1.43 1.43 1.43 ... ## $ IS.Integration.Type : chr "Base To Base" "Base To Base" "Base To Base" "Base To Base" ... ## $ IS.Signal.To.Noise : chr "N/A" "N/A" "N/A" "N/A" ... ## $ IS.Peak.Width..min. : num 0.248 0.248 0.238 0.238 0.258 0.258 0.227 0.227 0.217 0.217 ... ## $ IS.Mass.Ranges..Da. : chr "292.500/97.200 Da" "292.500/97.200 Da" "292.500/97.200 Da" "292.500/97.200 Da" ... ## $ IS.Wavelength.Ranges..nm. : chr "N/A" "N/A" "N/A" "N/A" ... ## $ IS.Channel : chr "N/A" "N/A" "N/A" "N/A" ... ## $ IS.Peak.Width.at.50..Height..min. : num 0.0436 0.0436 0.0445 0.0445 0.0435 0.0435 0.0455 0.0455 0.0451 0.0451 ... ## $ IS.Slope.of.Baseline....min. : num 1.06 1.06 1.17 1.17 1.39 1.39 1.51 1.51 1.93 1.93 ... ## $ IS.Processing.Alg. : chr "Specify Parameters - MQIII" "Specify Parameters - MQIII" "Specify Parameters - MQIII" "Specify Parameters - MQIII" ... ## $ IS.Peak.Asymmetry : num 1.6 1.6 2.19 2.19 1.77 1.77 1.88 1.88 2.18 2.18 ... ## $ IS.Integration.Quality : num 0.956 0.956 0.971 0.971 0.968 0.968 0.969 0.969 0.97 0.97 ... ## $ Use.Record : int NA NA 1 1 1 1 1 1 1 1 ... ## $ Record.Modified : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Calculated.Concentration..nmol.L. : chr "N/A" "N/A" "0.119" "0.0962" ... ## $ Calculated.Concentration.for.DAD..nmol.L.: chr "N/A" "N/A" "N/A" "N/A" ... ## $ Relative.Retention.Time : num 0 0 1.01 1.01 1.01 1.01 1.01 1.01 1.01 1.01 ... ## $ Accuracy.... : chr "N/A" "N/A" "111." "89.1" ... ## $ Response.Factor : chr "N/A" "N/A" "0.282" "0.185" ... ## $ Acq..Start.Time..min. : num 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 ... ## $ Injection.Volume.used : int 25 25 25 25 25 25 25 25 25 25 ... ## $ X : logi NA NA NA NA NA NA ...
And we see that there is lots of stuff we don't need. What we do need are the columns titled “Sample.Name” (which is the specimen container ID in this case), the “Analyte.Peak.Name” (which is the MRM, either quantifier or qualifier), and the one whose name starts with “Calculated.Concentration..”. The last of these also contains the units of measure which is analyte–dependent. To get rid of this analyte–dependence of the column name, we can find out which column this is and rename it:
conc.col.num <- grep("Calculated.Concentration..",names(my.data), fixed = TRUE) names(my.data)[conc.col.num]<- "Calculated.Concentration"
Now we can pull out the three columns of interest and put them into a dataframe named results.
#pull out the columns of interest results <- my.data[,c("Sample.Name", "Analyte.Peak.Name","Calculated.Concentration")] names(results) <- c("sampleID", "mrm", "conc")
Now we only need the quantifier ion results which we were defined by the user with Tk GUI, so we can pull them out with grep. I will pull out the qualifiers also but we do not need them unless we wanted to compute ion-ratios, for example.
#handle non-numeric results quantifiers <- results[grep(MRM.names[1], results$mrm),] quantifiers$conc <- as.numeric(quantifiers$conc)
## Warning: NAs introduced by coercion
qualifiers <- results[grep(MRM.names[2], results$mrm),] qualifiers$conc <- as.numeric(qualifiers$conc)
## Warning: NAs introduced by coercion
Having pulled out the MRM of interest, we can define which rows correspond to standards, QC and patients by appropriate use of grep()
. It happens that the CIDs all start with E followed by a 10 digit number so we can search for this pattern with a simple regular expression. Since we only need the QCs and patient data, the variable standards
is calculated only as a matter of completeness.
#separate out sample types standards <- grep("Blank|STD",quantifiers$sampleID) qc <- grep("C-", quantifiers$sampleID) #create a regular expression to identify samples (E followed by 10 digits) regexp<-"(^E)([[:digit:]]{10})" patients <-grep(pattern=regexp,quantifiers$sampleID) output.data <- quantifiers[c(qc,patients),]
Now we can prepare to write a dataframe corresponding to the required format of the output file. To do so, we'll need to find out how many rows we are writing and then prepare a vector of the same length repeating the name of the worksheet and testcode:
#prepare the final data num.rows <- length(output.data$sampleID) final.output.data <- data.frame(rep(worksheet,num.rows), output.data$sampleID, rep(test.code, num.rows), output.data$conc) names(final.output.data) <- c("worksheet","sample","test","conc")
Now we can replace all the NA values that replaced “No Peak” with the correct LoQ according to which analyte we are looking at.
#to put LOQs in, we need to convert to character #this assumes that all non numeric results are undetectable final.output.data$conc <- as.character(final.output.data$conc) final.output.data$conc[is.na(final.output.data$conc)] <- LoQ
Our final.output.data
dataframe looks like it behaved properly.
head(final.output.data,10)
## worksheet sample test conc ## 1 PAPI C-LY1LR TES 0.557 ## 2 PAPI C-LY1 TES 5.65 ## 3 PAPI C-LY2 TES 20.6 ## 4 PAPI C-LY3 TES 28.1 ## 5 PAPI C-PTES TES 0.737 ## 6 PAPI E1234083035 TES 1.04 ## 7 PAPI E1234109065 TES 14.1 ## 8 PAPI E1234086634 TES 19.2 ## 9 PAPI E1234107491 TES 13 ## 10 PAPI E1234114052 TES 18.6
And finally, we create directories to archive our data (if those directories do not exist) and write the files with an appropriate timestamp determined using Sys.time()
. Since colons (i.e : ) don't play nice in all operating systems as filenames, we can use gsub()
to get rid of them. We also pass along error messages or confirmation messages to the user as appropriate.
#If the data file happens to be empty because you selected the wrong file, abort if(nrow(final.output.data)==0){ tkmessageBox(message="Your flat file contained no patient data. Aborting file output") } else if (nrow(final.output.data)>0) { #create the output directory if it does not exist if(!dir.exists(file.path(flat.file.dir, "Processed"))){ dir.create(file.path(flat.file.dir, "Processed")) } if(!dir.exists(file.path(flat.file.dir, "Raw"))){ dir.create(file.path(flat.file.dir, "Raw")) } #create a ISO 8601 compliant timestamp - get rid of spaces and colons time.stamp <- gsub(":","", Sys.time(), fixed = TRUE) time.stamp <- gsub(" ","T", time.stamp, fixed = TRUE) #save a copy of the input file flat.file.copy.name <- paste(test.code,"_",time.stamp, "_Raw.txt", sep="") file.copy(flat.file.path, file.path(flat.file.dir,"Raw", flat.file.copy.name )) #write the final output file final.output.name <- paste(test.code,"_",time.stamp, ".txt", sep="") final.output.path <- file.path(flat.file.dir,"Processed" ,final.output.name) write.table(file = final.output.path, final.output.data, quote = FALSE, row.names = FALSE, col.names = FALSE, sep = ",") #check that the file was created as expected if(file.exists(final.output.path)){ tkmessageBox(message="Data successfully processed \n Check Processed directory") } else { tkmessageBox(message="Your file was not created. There was a problem") } }
Finally, we would wrap all of the directory–creation and file–operation in an if statement tied to the variable called problem
we created previously. You will see this in the final source–code linked below.
Now, you can easily modify this to deal with multiple anlytes that are always on the same run, such as Vitamin D2 and Vitamin D3. If you wanted to suppress results failing ion ratio criteria (which could be concentration–dependent of course) or if you had specimens unexpectedly low IS counts, you could easily censor them to prevent their upload and then review them manually. You could also append canned comments to your results with a dash between your result and the comment. In fact, you could theoretically develop very elaborate middleware for QC evaluation and interpretation. You could also use RMarkdown to generate PDF reports for the run which could include calibration curve plots, plots of quantifier results vs qualifier results, and results that fail various criteria.
You can download the source code and three example flat files here. Setting the source–code up as a “clickable” script is somewhat dependent on the operating system you are working on. Since most of you will be on a windows system you can follow this tutorial. You can also use a windows batch file to call your script.
Now that your file is generated, it is read to upload via ssh. This is usually performed manually but could be automated. Don't implement this code into routine use unless you know what you are doing and you have tested it extensively. By using and/or modifying it, you become entirely responsible for its correct operation. Excel is like a butter knife and R is like Swiss Army Knife. You must be careful with it because…
From everyone who has been given much, much will be demanded; and from the one who has been entrusted with much, much more will be asked.
Luke 12:48
]]>