August 2017 – The Lab-R-torian

Non-Linear Regression: Application to Monoclonal Peak Integration in Serum Protein Electrophoresis

August 28, 2017August 28, 2017 dtholmes@mail.ubc.ca

Background

At the AACC meeting recently, there was an enthusiastic discussion of standardization of reporting for serum protein electrophoresis (SPEP) presented by a working group headed up by Dr. Chris McCudden and Dr. Ron Booth, both of the University of Ottawa. One of the discussions pertained to how monoclonal bands, especially small ones, should be integrated. While many use the default manual vertical gating or “drop” method offered by Sebia's Phoresis software, Dr. David Keren was discussing the value of tangent skimming as a more repeatable and effective means of monoclonal protein quantitation. He was also discussing some biochemical approaches distinguishing monoclonal proteins from the background gamma proteins.

The drop method is essentially an eye-ball approach to where the peak starts and ends and is represented by the vertical lines and the enclosed shaded area.

plot of chunk unnamed-chunk-1

The tangent skimming approach is easier to make reproducible. In the mass spectrometry world it is a well-developed approach with a long history and multiple algorithms in use. This is apparently the book. However, when tangent skimming is employed in SPEP, unless I am mistaken, it seems to be done by eye. The integration would look like this:

plot of chunk unnamed-chunk-2

During the discussion it was point out that peak deconvolution of the monoclonal protein from the background gamma might be preferable to either of the two described procedures. By this I mean integration as follows:

plot of chunk unnamed-chunk-3

There was discussion this procedure is challenging for number of reasons. Further, it should be noted that there will only likely be any clinical value in a deconvolution approach when the concentration of the monoclonal protein is low enough that manual integration will show poor repeatability, say < 5 g/L = 0.5 g/dL.

Easy Peaks

Fitting samples with larger monoclonal peaks is fairly easy. Fitting tends to converge nicely and produce something meaningful. For example, using the approach I am about to show below, an electropherogram like this:

plot of chunk unnamed-chunk-4

with a gamma region looking like this:

plot of chunk unnamed-chunk-5

can be deconvoluted with straightforward non-linear regression (and no baseline subtraction) to yield this:

plot of chunk unnamed-chunk-6

and the area of the green monoclonal peak is found to be 5.3%.

More Difficult Peaks

What is more challenging is the problem of small monoclonals buried in normal $\gamma$-globulins. These could be difficult to integrate using a tangent skimming approach, particularly without image magnification. For the remainder of this post we will use a gel with a small monoclonal in the fast gamma region shown at the arrow.

plot of chunk unnamed-chunk-7

Getting the Data

EP data can be extracted from the PDF output from any electrophoresis software. This is not complicated and can be accomplished with pdf2svg or Inkscape and some Linux bash scripting. I'm sure we can get it straight from the instrument but it is not obvious to me how to do this. One could also rescan a gel and use ImageJ to produce a densitometry scan which is discussed in the ImageJ documentation and on YouTube. ImageJ also has a macro language for situations where the same kind of processing is done repeatedly.

Smoothing

The data has 10284 pairs of (x,y) data. But if you blow up on it and look carefully you find that it is a series of staircases.

plot(y~x, data = head(ep.data,100), type = "o", cex = 0.5)

1 2	plot(y~x, data = head(ep.data,100), type = "o", cex = 0.5)

plot of chunk unnamed-chunk-8

It turns out that this jaggedness significantly impairs attempts to numerically identify the peaks and valleys. So, I smoothed it a little using the handy rle() function to identify the midpoint of each step. This keeps the total area as close to its original value as possible–though this probably does not matter too much.

ep.rle <- rle(ep.data$y)
stair.midpoints <- cumsum(ep.rle$lengths) - floor(ep.rle$lengths/2)
ep.data.sm <- ep.data[stair.midpoints,]
plot(y~x, data = head(ep.data,300), type = "o", cex = 0.5)
points(y~x, data = head(ep.data.sm,300), type = "o", cex = 0.5, col = "red")

ep.rle <- rle(ep.data$y)

stair.midpoints <- cumsum(ep.rle$lengths) - floor(ep.rle$lengths/2)

ep.data.sm <- ep.data[stair.midpoints,]

plot(y~x, data = head(ep.data,300), type = "o", cex = 0.5)

points(y~x, data = head(ep.data.sm,300), type = "o", cex = 0.5, col = "red")

plot of chunk unnamed-chunk-9

Now that we are satisfied that the new data is OK, I will overwrite the original dataframe.

ep.data <- ep.data.sm

1 2	ep.data <- ep.data.sm

Transformation

The units on the x and y-axes are arbitrary and come from page coordinates of the PDF. We can normalize the scan by making the x-axis go from 0 to 1 and by making the total area 1.

library(Bolstad) #A package containing a function for Simpon's Rule integration
ep.data$x <- ep.data$x/max(ep.data$x)
A.tot <- sintegral(ep.data$x,ep.data$y)$value
ep.data$y <- ep.data$y/A.tot

#sanity check
sintegral(ep.data$x,ep.data$y)$value

library(Bolstad) #A package containing a function for Simpon's Rule integration

ep.data$x <- ep.data$x/max(ep.data$x)

A.tot <- sintegral(ep.data$x,ep.data$y)$value

ep.data$y <- ep.data$y/A.tot

#sanity check

sintegral(ep.data$x,ep.data$y)$value

## [1] 1

## [1] 1

plot(y~x, data = ep.data, type = "l")

1 2	plot(y~x, data = ep.data, type = "l")

plot of chunk unnamed-chunk-11

Find Extrema

Using the findPeaks function from the quantmod package we can find the minima and maxima:

library(quantmod)
ep.max <- findPeaks(ep.data$y)
plot(y~x, data = ep.data, type = "l", main = "Maxima")
abline(v = ep.data$x[ep.max], col = "red", lty = 2)

library(quantmod)

ep.max <- findPeaks(ep.data$y)

plot(y~x, data = ep.data, type = "l", main = "Maxima")

abline(v = ep.data$x[ep.max], col = "red", lty = 2)

plot of chunk unnamed-chunk-12

ep.min <- findValleys(ep.data$y)
plot(y~x, data = ep.data, type = "l", main = "Minima")
abline(v = ep.data$x[ep.min], col = "blue", lty = 2)

ep.min <- findValleys(ep.data$y)

plot(y~x, data = ep.data, type = "l", main = "Minima")

abline(v = ep.data$x[ep.min], col = "blue", lty = 2)

plot of chunk unnamed-chunk-12

Not surprisingly, there are some extraneous local extrema that we do not want. I simply manually removed them. Generally, this kind of thing could be tackled with more smoothing of the data prior to analysis.

ep.max <- ep.max[-1]
ep.min <- ep.min[-c(1,length(ep.min))]

ep.max <- ep.max[-1]

ep.min <- ep.min[-c(1,length(ep.min))]

Fitting

Now it's possible with the nls() function to fit the entire SPEP with a series of Gaussian curves simultaneously. It works just fine (provided you have decent initial estimates of $\mu_i$ and $\sigma_i$) but there is no particular clinical value to fitting the albumin, $\alpha_1$, $\alpha_2$, $\beta_1$ and $\beta_2$ domains with Gaussians. What is of interest is separately quantifying the two peaks in $\gamma$ with two separate Gaussians so let's isolate the $\gamma$ region based on the location of the minimum between $\beta_2$ and $\gamma$.

Isolate the $\gamma$ Region

gamma.ind <- max(ep.min):nrow(ep.data)
gamma.data <- data.frame(x = ep.data$x[gamma.ind], y = ep.data$y[gamma.ind])
plot(y ~ x, gamma.data, type  = "l")

gamma.ind <- max(ep.min):nrow(ep.data)

gamma.data <- data.frame(x = ep.data$x[gamma.ind], y = ep.data$y[gamma.ind])

plot(y ~ x, gamma.data, type = "l")

plot of chunk unnamed-chunk-14

Attempt Something that Ultimately Does Not Work

At first I thought I could just throw two normal distributions at this and it would work. However, it does not work well at all and this kind of not-so-helpful fit turns out to happen a fair bit. I use the nls() function here which is easy to call. It requires a functional form which I set to be:

\[y = C_1 \exp\Big(-{\frac{(x-\mu_1)^2}{2\sigma_1^2}}\Big) + C_2 \exp \Big({-\frac{(x-\mu_2)^2}{2\sigma_2^2}}\Big)\]

where $\mu_1$ is the $x$ location of the first peak in $\gamma$ and $\mu_2$ is the $x$ location of the second peak in $\gamma$. The estimates of $\sigma_1$ and $\sigma_2$ can be obtained by trying to estimate the full-width-half-maximum (FWHM) of the peaks, which is related to $\sigma$ by

\[FWHM_i = 2 \sqrt{2\ln2} \times \sigma_i = 2.355 \times \sigma_i\]

I had to first make a little function that returns the respective half-widths at half-maximum and then uses them to estimate the $FWHM$. Because the peaks are poorly resolved, it also tries to get the smallest possible estimate returning this as FWHM2.

FWHM.finder <- function(ep.data, mu.index){
  peak.height <- ep.data$y[mu.index]
  fxn.for.roots <- ep.data$y - peak.height/2
  indices <- 1:nrow(ep.data)
  root.indices <- which(diff(sign(fxn.for.roots))!=0)
  tmp <- c(root.indices,mu.index) %>% sort
  tmp2 <- which(tmp == mu.index)
  first.root <- root.indices[tmp2 -1]
  second.root <- root.indices[tmp2]
  HWHM1 <- ep.data$x[mu.index] - ep.data$x[first.root]
  HWHM2 <- ep.data$x[second.root] - ep.data$x[mu.index]
  FWHM <- HWHM2 + HWHM1
  FWHM2 = 2*min(c(HWHM1,HWHM2))
  return(list(HWHM1 = HWHM1,HWHM2 = HWHM2,FWHM = FWHM,FWHM2 = FWHM2))
}

FWHM.finder <- function(ep.data, mu.index){

peak.height <- ep.data$y[mu.index]

fxn.for.roots <- ep.data$y - peak.height/2

indices <- 1:nrow(ep.data)

root.indices <- which(diff(sign(fxn.for.roots))!=0)

tmp <- c(root.indices,mu.index) %>% sort

tmp2 <- which(tmp == mu.index)

first.root <- root.indices[tmp2 -1]

second.root <- root.indices[tmp2]

HWHM1 <- ep.data$x[mu.index] - ep.data$x[first.root]

HWHM2 <- ep.data$x[second.root] - ep.data$x[mu.index]

FWHM <- HWHM2 + HWHM1

FWHM2 = 2*min(c(HWHM1,HWHM2))

return(list(HWHM1 = HWHM1,HWHM2 = HWHM2,FWHM = FWHM,FWHM2 = FWHM2))

}

The peak in the $\gamma$ region was obtained previously:

plot(y ~ x, gamma.data, type  = "l")
gamma.max <- findPeaks(gamma.data$y)
abline(v = gamma.data$x[gamma.max])

plot(y ~ x, gamma.data, type = "l")

gamma.max <- findPeaks(gamma.data$y)

abline(v = gamma.data$x[gamma.max])

plot of chunk unnamed-chunk-16

and from them $\mu_1$ is determined to be 0.7. We have to guess where the second peak is, which is at about $x=0.75$ and has an index of 252 in the gamma.data dataframe.

gamma.data[252,]

1 2	gamma.data[252,]

##             x         y
## 252 0.7487757 0.6381026

1 2	## x y ## 252 0.7487757 0.6381026

#append the second peak
gamma.max <- c(gamma.max,252)
gamma.mu <- gamma.data$x[gamma.max]
gamma.mu

#append the second peak

gamma.max <- c(gamma.max,252)

gamma.mu <- gamma.data$x[gamma.max]

gamma.mu

## [1] 0.6983350 0.7487757

1	## [1] 0.6983350 0.7487757

plot(y ~ x, gamma.data, type  = "l")
abline(v = gamma.data$x[gamma.max])

plot(y ~ x, gamma.data, type = "l")

abline(v = gamma.data$x[gamma.max])

plot of chunk unnamed-chunk-17

Now we can find the estimates of the standard deviations:

#find the FWHM estimates of sigma_1 and sigma_2:
FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.data)
gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355

#find the FWHM estimates of sigma_1 and sigma_2:

FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.data)

gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355

The estimates of $\sigma_1$ and $\sigma_2$ are now obtained. The estimates of $C_1$ and $C_2$ are just the peak heights.

peak.heights <- gamma.data$y[gamma.max]

1 2	peak.heights <- gamma.data$y[gamma.max]

We can now use nls() to determine the fit.

fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) +
                  C2*exp(-(x-mean2)**2/(2 * sigma2**2))),
           data = gamma.data,
           start = list(mean1 = gamma.mu[1],
                        mean2 = gamma.mu[2],
                        sigma1 = gamma.sigma[1],
                        sigma2 = gamma.sigma[2],
                        C1 = peak.heights[1],
                        C2 = peak.heights[2]),
           algorithm = "port")

fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) +

C2*exp(-(x-mean2)**2/(2 * sigma2**2))),

data = gamma.data,

start = list(mean1 = gamma.mu[1],

mean2 = gamma.mu[2],

sigma1 = gamma.sigma[1],

sigma2 = gamma.sigma[2],

C1 = peak.heights[1],

C2 = peak.heights[2]),

algorithm = "port")

Determining the fitted values of our unknown coefficients:

dffit <- data.frame(x=seq(0, 1 , 0.001))
dffit$y <- predict(fit, newdata=dffit)

fit.sum <- summary(fit)
fit.sum #show the fitted coefficients

dffit <- data.frame(x=seq(0, 1 , 0.001))

dffit$y <- predict(fit, newdata=dffit)

fit.sum <- summary(fit)

fit.sum #show the fitted coefficients

## 
## Formula: y ~ (C1 * exp(-(x - mean1)^2/(2 * sigma1^2)) + C2 * exp(-(x - 
##     mean2)^2/(2 * sigma2^2)))
## 
## Parameters:
##         Estimate Std. Error t value Pr(>|t|)    
## mean1  0.7094793  0.0003312 2142.23   <2e-16 ***
## mean2  0.7813900  0.0007213 1083.24   <2e-16 ***
## sigma1 0.0731113  0.0002382  306.94   <2e-16 ***
## sigma2 0.0250850  0.0011115   22.57   <2e-16 ***
## C1     0.6983921  0.0018462  378.29   <2e-16 ***
## C2     0.0819704  0.0032625   25.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01291 on 611 degrees of freedom
## 
## Algorithm "port", convergence message: both X-convergence and relative convergence (5)

## Formula: y ~ (C1 * exp(-(x - mean1)^2/(2 * sigma1^2)) + C2 * exp(-(x -

## mean2)^2/(2 * sigma2^2)))

## Parameters:

## Estimate Std. Error t value Pr(>|t|)

## mean1 0.7094793 0.0003312 2142.23 <2e-16 ***

## mean2 0.7813900 0.0007213 1083.24 <2e-16 ***

## sigma1 0.0731113 0.0002382 306.94 <2e-16 ***

## sigma2 0.0250850 0.0011115 22.57 <2e-16 ***

## C1 0.6983921 0.0018462 378.29 <2e-16 ***

## C2 0.0819704 0.0032625 25.12 <2e-16 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Residual standard error: 0.01291 on 611 degrees of freedom

## Algorithm "port", convergence message: both X-convergence and relative convergence (5)

coef.fit <- fit.sum$coefficients[,1]
mu.fit <- coef.fit[1:2]
sigma.fit <- coef.fit[3:4]
C.fit <- coef.fit[5:6]

coef.fit <- fit.sum$coefficients[,1]

mu.fit <- coef.fit[1:2]

sigma.fit <- coef.fit[3:4]

C.fit <- coef.fit[5:6]

And now we can plot the fitted results against the original results:

#original
plot(y ~ x, data = gamma.data, type = "l", main = "This is Garbage") 
#overall fit
lines(y ~ x, data = dffit, col ="red", cex = 0.2) 
legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum"))
#components of the fit
for(i in 1:2){
  x <- dffit$x
  y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2))
  lines(x,y, col = i + 2)
}

#original

plot(y ~ x, data = gamma.data, type = "l", main = "This is Garbage")

#overall fit

lines(y ~ x, data = dffit, col ="red", cex = 0.2)

legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum"))

#components of the fit

for(i in 1:2){

x <- dffit$x

y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2))

lines(x,y, col = i + 2)

}

plot of chunk unnamed-chunk-22

And this is garbage. The green curve is supposed to be the monoclonal peak, the blue curve is supposed to be the $\gamma$ background, and the red curve is their sum, the overall fit. This is a horrible failure.

Subsequently, I tried fixing the locations of $\mu_1$ and $\mu_2$ but this also yielded similar nonsensical fitting. So, with a lot of messing around trying different functions like the lognormal distribution, the Bi-Gaussian distribution and the Exponentially Modified Gaussian distribution, and applying various arbitrary weighting functions, and simultaneously fitting the other regions of the SPEP, I concluded that nothing could predictably produce results that represented the clinical reality.

I thought maybe the challenge to obtain a reasonable fit related to the sloping baseline, so I though I would try to remove it. I will model the baseline in the most simplistic manner possible: as a sloped line.

Baseline Removal

I will arbitrarily define the tail of the $\gamma$ region to be those values having $y \leq 0.02$. Then I will connect the first (x,y) point from the $\gamma$ region and connect it to the tail.

gamma.tail <- filter(gamma.data, y <= 0.02) 
baseline.data <- rbind(gamma.data[1,],gamma.tail)
names(baseline.data) <- c("x","y")
baseline.fun <- approxfun(baseline.data)
plot(y~x, data = gamma.data, type = "l")
lines(baseline.data$x,baseline.fun(baseline.data$x), col = "blue")

gamma.tail <- filter(gamma.data, y <= 0.02)

baseline.data <- rbind(gamma.data[1,],gamma.tail)

names(baseline.data) <- c("x","y")

baseline.fun <- approxfun(baseline.data)

plot(y~x, data = gamma.data, type = "l")

lines(baseline.data$x,baseline.fun(baseline.data$x), col = "blue")

plot of chunk unnamed-chunk-24

Now we can define a new dataframe gamma.no.base that has the baseline removed:

gamma.no.base <- data.frame(x = gamma.data$x, y = gamma.data$y - baseline.fun(gamma.data$x))
plot(y~x, data = gamma.data, type = "l")
lines(y ~ x, data = gamma.no.base, lty = 2)
gamma.max <- findPeaks(gamma.no.base$y)[1:2] #rejects a number of extraneous peaks
abline(v = gamma.no.base$x[gamma.max])

gamma.no.base <- data.frame(x = gamma.data$x, y = gamma.data$y - baseline.fun(gamma.data$x))

plot(y~x, data = gamma.data, type = "l")

lines(y ~ x, data = gamma.no.base, lty = 2)

gamma.max <- findPeaks(gamma.no.base$y)[1:2] #rejects a number of extraneous peaks

abline(v = gamma.no.base$x[gamma.max])

plot of chunk unnamed-chunk-25

The black is the original $\gamma$ and the dashed has the baseline removed. This becomes and easy fit.

#Estimate the Ci
peak.heights <- gamma.no.base$y[gamma.max]
#Estimate the mu_i
gamma.mu <- gamma.no.base$x[gamma.max] #the same values as before
#Estimate the sigma_i from the FWHM
FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.no.base)
gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355

#Perform the fit
fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) +
                  C2*exp(-(x-mean2)**2/(2 * sigma2**2))),
           data = gamma.no.base,
           start = list(mean1 = gamma.mu[1],
                        mean2 = gamma.mu[2],
                        sigma1 = gamma.sigma[1],
                        sigma2 = gamma.sigma[2],
                        C1 = peak.heights[1],
                        C2 = peak.heights[2]),
           algorithm = "port")

#Plot the fit
dffit <- data.frame(x=seq(0, 1 , 0.001))
dffit$y <- predict(fit, newdata=dffit)
fit.sum <- summary(fit)
coef.fit <- fit.sum$coefficients[,1]
mu.fit <- coef.fit[1:2]
sigma.fit <- coef.fit[3:4]
C.fit <- coef.fit[5:6]

plot(y ~ x, data = gamma.no.base, type = "l")
legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum"))
lines(y ~ x, data = dffit, col ="red", cex = 0.2)
for(i in 1:2){
  x <- dffit$x
  y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2))
  lines(x,y, col = i + 2)
}

#Estimate the Ci

peak.heights <- gamma.no.base$y[gamma.max]

#Estimate the mu_i

gamma.mu <- gamma.no.base$x[gamma.max] #the same values as before

#Estimate the sigma_i from the FWHM

FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.no.base)

gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355

#Perform the fit

fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) +

C2*exp(-(x-mean2)**2/(2 * sigma2**2))),

data = gamma.no.base,

start = list(mean1 = gamma.mu[1],

mean2 = gamma.mu[2],

sigma1 = gamma.sigma[1],

sigma2 = gamma.sigma[2],

C1 = peak.heights[1],

C2 = peak.heights[2]),

algorithm = "port")

#Plot the fit

dffit <- data.frame(x=seq(0, 1 , 0.001))

dffit$y <- predict(fit, newdata=dffit)

fit.sum <- summary(fit)

coef.fit <- fit.sum$coefficients[,1]

mu.fit <- coef.fit[1:2]

sigma.fit <- coef.fit[3:4]

C.fit <- coef.fit[5:6]

plot(y ~ x, data = gamma.no.base, type = "l")

legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum"))

lines(y ~ x, data = dffit, col ="red", cex = 0.2)

for(i in 1:2){

x <- dffit$x

y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2))

lines(x,y, col = i + 2)

}

plot of chunk unnamed-chunk-26

Lo and behold…something that is not completely insane. The green is the monoclonal, the blue is the $\gamma$ background and the red is their sum, that is, the overall fit. A better fit could now we sought with weighting or with a more flexible distribution shape. In any case, the green peak is now easily determined. Since

\[\int_{-\infty}^{\infty} C_1 \exp\Big(-{\frac{(x-\mu_1)^2}{2\sigma_1^2}}\Big)dx = \sqrt{2\pi}\sigma C_1\]

A.mono <- sqrt(2*pi)*sigma.fit[1]*C.fit[1] %>% unname() 
A.mono <- round(A.mono,3)
A.mono

A.mono <- sqrt(2*pi)*sigma.fit[1]*C.fit[1] %>% unname()

A.mono <- round(A.mono,3)

A.mono

## sigma1 
##  0.024

1 2	## sigma1 ## 0.024

So this peak is 2.4% of the total area. Now, of course, this assumes that nothing under the baseline is attributable to the monoclonal peak and all belongs to normal $\gamma$-globulins, which is very unlikely to be true. However, the drop and tangent skimming methods also make assumptions about how the area under the curve contributes to the monoclonal protein. The point is to try to do something that will produce consistent results that can be followed over time. Obviously, if you thought there were three peaks in the $\gamma$-region, you'd have to set up your model accordingly.

All about that Base(line)

There are obviously better ways to model the baseline because this approach of a linear baseline is not going to work in situations where, for example, there is a small monoclonal in fast $\gamma$ dwarfed by normal $\gamma$-globulins. That is, like this:

plot of chunk unnamed-chunk-28

Something curvilinear or piecewise continuous and flexible enough for more circumstances is generally required.

There is also no guarantee that baseline removal, whatever the approach, is going to be a good solution in other circumstances. Given the diversity of monoclonal peak locations, sizes and shapes, I suspect one would need a few different approaches for different circumstances.

Conclusions

The data in the PDFs generated by EP software are processed (probably with splining or similar) followed by the stair-stepping seen above. It would be better to work with raw data from the scanner.
- This is particularly important if you are using nls() because nls() does not play nice with data having no noise (“Do not use nls on artificial 'zero-residual' data”)
Integrating monoclonal peaks under the $\gamma$ baseline (or $\beta$) is unlikely to be a one-size-fits all approach and may require application of a number of strategies to get meaningful results.
- Basline removal might be helpful at times.
Peak integration will require human adjudication.
While most monoclonal peaks show little skewing, better fitting is likely to be obtained with distributions that afford some skewing.
MASSFIX may soon make this entire discussion irrelevant.

Parting Thought

On the matter of fitting

In bringing many sons and daughters to glory, it was fitting that God, for whom and through whom everything exists, should make the pioneer of their salvation perfect through what he suffered.

Heb 2:10

Compare Tube Types with R – Repeated Measures ANOVA

August 21, 2017February 23, 2019 dtholmes@mail.ubc.ca

Background

Sometimes we might want to compare three or four tube types for a particular analyte on a group of patients or we might want to see if a particular analyte is stable over time in aliqioted samples. In these experiments are essentially doing the multivariable analogue of the paired t-test. In the tube-type experiment, the factor that is differing between the (‘paired’) groups is the container: serum separator tubes (SST), EDTA plasma tubes, plasma separator tubes (PST) etc. In a stability experiment, the factor that is differing is storage duration.

Since this is a fairly common clinical lab experiment, I thought I would just jot down how this is accomplished in R – though I must confess I know just about $\lim_{x\to0}x$ about statistics. In any case, the statistical test is a repeated-measures ANOVA and this is one way to do it (there are many) including an approach to the post-hoc testing.

Some Fake Data to Work With

I’m going to make some fake data. I tried to dig up the data from an experiment I did as a resident but alas, I think the raw data died on an old laptop. But fake data will do for demonstration purposes. Let’s suppose we are looking at parathyroid hormone (PTH) in three different blood collection tubes: SST, EDTA and PST. For the sake of argument, let’s say that we collect samples from 20 patients simultaneously and we anlayze them all as per our usual process. This means that each patient has three samples of material that should be otherwise identical outside of the effects of the collection contained.

library(magrittr)
set.seed(100) #to force the same pseudo-random each time
#data in pmol/L
#induce some heteroscedastic error
SST <- runif(20,3,50)  
PST <- 1.03*SST + rnorm(20,0,0.1)*SST #set the data up to show no difference
EDTA <- 1.15*SST + rnorm(20,0,0.1)*SST  #set the data up to show a difference
tube.data <- data.frame(SST,PST,EDTA) %>% round(.,1)
tube.data <- data.frame(Subject = factor(1:20), tube.data)

library(magrittr)

set.seed(100) #to force the same pseudo-random each time

#data in pmol/L

#induce some heteroscedastic error

SST <- runif(20,3,50)

PST <- 1.03*SST + rnorm(20,0,0.1)*SST #set the data up to show no difference

EDTA <- 1.15*SST + rnorm(20,0,0.1)*SST #set the data up to show a difference

tube.data <- data.frame(SST,PST,EDTA) %>% round(.,1)

tube.data <- data.frame(Subject = factor(1:20), tube.data)

This is the way we usually express (and receive) data like this in an Excel spreadsheet:

Subject	SST	PST	EDTA
1	17.5	18.1	19.9
2	15.1	15.7	20.0
3	29.0	29.2	32.9
4	5.7	6.2	6.4
5	25.0	26.1	27.0
6	25.7	26.4	29.0
7	41.2	40.8	48.1
8	20.4	22.1	24.3
9	28.7	26.9	36.0
10	11.0	13.9	13.7
11	32.4	31.9	36.9
12	44.5	49.2	57.4
13	16.2	17.1	15.7
14	21.7	24.1	26.3
15	38.8	36.8	42.6
16	34.4	34.0	44.2
17	12.6	12.1	14.1
18	19.8	20.9	25.4
19	19.9	18.2	23.0
20	35.4	37.4	34.1

This Excel-ish way of storing the data is referred to as the “datawide” format for obvious reasons.

Gather the Grain

As it turns out this is not the way that we want to store data to do the statistical analyses of interest. What we want to do is have the tube type in a single column because this is the factor that is different within the subjects. We want to gather() or melt() the data (depending on your package of choice) to be like so:

library(tidyr)
tube.data.2 <- gather(tube.data, key = "Subject")
tube.data.2 %>% kable()

library(tidyr)

tube.data.2 <- gather(tube.data, key = "Subject")

tube.data.2 %>% kable()

Subject	Subject	value
1	SST	17.5
2	SST	15.1
3	SST	29.0
4	SST	5.7
5	SST	25.0
6	SST	25.7
7	SST	41.2
8	SST	20.4
9	SST	28.7
10	SST	11.0
11	SST	32.4
12	SST	44.5
13	SST	16.2
14	SST	21.7
15	SST	38.8
16	SST	34.4
17	SST	12.6
18	SST	19.8
19	SST	19.9
20	SST	35.4
1	PST	18.1
2	PST	15.7
3	PST	29.2
4	PST	6.2
5	PST	26.1
6	PST	26.4
7	PST	40.8
8	PST	22.1
9	PST	26.9
10	PST	13.9
11	PST	31.9
12	PST	49.2
13	PST	17.1
14	PST	24.1
15	PST	36.8
16	PST	34.0
17	PST	12.1
18	PST	20.9
19	PST	18.2
20	PST	37.4
1	EDTA	19.9
2	EDTA	20.0
3	EDTA	32.9
4	EDTA	6.4
5	EDTA	27.0
6	EDTA	29.0
7	EDTA	48.1
8	EDTA	24.3
9	EDTA	36.0
10	EDTA	13.7
11	EDTA	36.9
12	EDTA	57.4
13	EDTA	15.7
14	EDTA	26.3
15	EDTA	42.6
16	EDTA	44.2
17	EDTA	14.1
18	EDTA	25.4
19	EDTA	23.0
20	EDTA	34.1

Now we see that there is a column for tube-type and a column for the PTH results which we can name accordingly. You can see why this called the “datalong” format.

names(tube.data.2) <- c("Subject", "Tube.Type", "PTH")
tube.data.2$Tube.Type <- as.factor(tube.data.2$Tube.Type) #turns tube type into factor

names(tube.data.2) <- c("Subject", "Tube.Type", "PTH")

tube.data.2$Tube.Type <- as.factor(tube.data.2$Tube.Type) #turns tube type into factor

Visualize

Summarize the data:

summary(tube.data)

1 2	summary(tube.data)

##     Subject        SST             PST             EDTA      
##  1      : 1   Min.   : 5.70   Min.   : 6.20   Min.   : 6.40  
##  2      : 1   1st Qu.:17.18   1st Qu.:17.85   1st Qu.:19.98  
##  3      : 1   Median :23.35   Median :25.10   Median :26.65  
##  4      : 1   Mean   :24.75   Mean   :25.36   Mean   :28.85  
##  5      : 1   3rd Qu.:32.90   3rd Qu.:32.42   3rd Qu.:36.23  
##  6      : 1   Max.   :44.50   Max.   :49.20   Max.   :57.40  
##  (Other):14

## Subject SST PST EDTA

## 1 : 1 Min. : 5.70 Min. : 6.20 Min. : 6.40

## 2 : 1 1st Qu.:17.18 1st Qu.:17.85 1st Qu.:19.98

## 3 : 1 Median :23.35 Median :25.10 Median :26.65

## 4 : 1 Mean :24.75 Mean :25.36 Mean :28.85

## 5 : 1 3rd Qu.:32.90 3rd Qu.:32.42 3rd Qu.:36.23

## 6 : 1 Max. :44.50 Max. :49.20 Max. :57.40

## (Other):14

Let’s just have a quick look graphically:

library(mcr)
plot(mcreg(SST, EDTA,
           method.reg = "PaBa",
           mref.name = "SST",
           mtest.name = "EDTA"))

library(mcr)

plot(mcreg(SST, EDTA,

method.reg = "PaBa",

mref.name = "SST",

mtest.name = "EDTA"))

plot of chunk unnamed-chunk-6

plot(mcreg(SST, PST,
           method.reg = "PaBa",
           mref.name = "SST",
           mtest.name = "PST"))

plot(mcreg(SST, PST,

method.reg = "PaBa",

mref.name = "SST",

mtest.name = "PST"))

plot of chunk unnamed-chunk-6

And as a boxplot with the points overtop:

boxplot(PTH ~ Tube.Type,
        data = tube.data.2,
        col = c("purple", "lightgreen", "gold"))
stripchart(PTH ~ Tube.Type,
           vertical = TRUE,
           data = tube.data.2, 
           method = "jitter",
           add = TRUE,
           pch = 20,
           col = rgb(0,0,0,0.5))

boxplot(PTH ~ Tube.Type,

data = tube.data.2,

col = c("purple", "lightgreen", "gold"))

stripchart(PTH ~ Tube.Type,

vertical = TRUE,

data = tube.data.2,

method = "jitter",

add = TRUE,

pch = 20,

col = rgb(0,0,0,0.5))

plot of chunk unnamed-chunk-7

Separate the Wheat from the Chaff

Now we want to make comparisons to see if these are different. To accomplish this, we will use the aov() function. This requires us to have data formatted “datalong” as it is in the tube.data.2 dataframe.

fit <- aov(PTH ~ Tube.Type + Error(Subject/Tube.Type), data=tube.data.2)

1 2	fit <- aov(PTH ~ Tube.Type + Error(Subject/Tube.Type), data=tube.data.2)

If you are like me, this syntax is confusing. But it goes like this. PTH is a function of Tube.Type which is straight forward–hence the PTH ~ Tube.Type bit. The error term has the Subject in front of the / and the factor that is different within the subjects (Tube.Type) after the /. That’s my grade 2 explanation from reading this and this and this.

summary(fit)

1 2	summary(fit)

## 
## Error: Subject
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 19   7307   384.6               
## 
## Error: Subject:Tube.Type
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## Tube.Type  2  195.9   97.97   22.47 3.63e-07 ***
## Residuals 38  165.7    4.36                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Error: Subject

## Df Sum Sq Mean Sq F value Pr(>F)

## Residuals 19 7307 384.6

## Error: Subject:Tube.Type

## Df Sum Sq Mean Sq F value Pr(>F)

## Tube.Type 2 195.9 97.97 22.47 3.63e-07 ***

## Residuals 38 165.7 4.36

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This tells us that there is a difference between the groups but it does not specify where the difference is.

I can’t see the difference. Can you see the difference?

Sorry – I just had to make a pop-culture reference to this. We want to be specific about where the differences are without making a Type I error which might arise if we blindly charge ahead and do multiple paired t-tests. One easy way to accomplish this is to use the pairwise.t.test() function which does corrections for multiple comparisons. You can choose from a number of approaches for adjustment for pairwise comparison. This requires the “response vector” which is PTH and the “grouping factor” which is the tube type.

# choices for p.adjust.method are: c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
pwt <- pairwise.t.test(tube.data.2$PTH, tube.data.2$Tube.Type, p.adj = "bonferroni", paired = TRUE)
pwt

# choices for p.adjust.method are: c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")

pwt <- pairwise.t.test(tube.data.2$PTH, tube.data.2$Tube.Type, p.adj = "bonferroni", paired = TRUE)

pwt

## 
##  Pairwise comparisons using paired t tests 
## 
## data:  tube.data.2$PTH and tube.data.2$Tube.Type 
## 
##     EDTA    PST    
## PST 0.00083 -      
## SST 7.9e-05 0.35033
## 
## P value adjustment method: bonferroni

## Pairwise comparisons using paired t tests

## data: tube.data.2$PTH and tube.data.2$Tube.Type

## EDTA PST

## PST 0.00083 -

## SST 7.9e-05 0.35033

## P value adjustment method: bonferroni

This is pretty easy to understand. There are statistically significant differences found between the EDTA and PST (p = 0.00083) and the EDTA and PST (p = 0.00008) but none between SST and PST (p = 0.35033).

Conclusion

Non-statistician’s approach to tube-type comparisons, which is also applicable to analyte stability studies. This is a one-way repeated measures ANOVA with one within-subjects factor. There is a great deal more to say on the matter by people who know much more in the citations in the links provided above.

God probably uses datawide format

All the nations will be gathered before him, and he will separate the people one from another as a shepherd separates the sheep from the goats. He will put the sheep on his right and the goats on his left.

(Matt 25:32-33)

Parse an Online Table into an R Dataframe – Westgard’s Biological Variation Database

August 14, 2017August 14, 2017 dtholmes@mail.ubc.ca

Background

From time to time I have wanted to bring an online table into an R dataframe. While in principle, the data can be cut and paste into Excel, sometimes the table is very large and sometimes the columns get goofed up in the process. Fortunately, there are a number of R tools for accomplishing this. I am just going to show one approach using the rvest package. The rvest package also makes it possible to interact with forms on webpages to request specific material which can then be scraped. I think you will see the potential if you look here.

In our (simple) case, we will apply this process to Westgard's desirable assay specifications as shown on his website. The goal is to parse out the biological variation tables, get them into a dataframe and the write to csv or xlsx.

Reading in the Data

The first thing to do is to load the rvest and httr packages and define an html session with the html_session() function.

library(rvest)
library(httr)
wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("LabRtorian"))

library(rvest)

library(httr)

wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("LabRtorian"))

Now looking at the webpage, you can see that there are 8 columns in the tables of interest. So, we will define an empty dataframe with 8 columns.

#define empty table to hold all the content
biotable = data.frame(matrix(NA,0, 8))

#define empty table to hold all the content

biotable = data.frame(matrix(NA,0, 8))

We need to know which part of the document to scrape. This is a little obscure, but following the instructions in this post, we can determine that the xpaths we need are:

/html/body/div[1]/div[3]/div/main/article/div/table[1]

/html/body/div[1]/div[3]/div/main/article/div/table[2]

/html/body/div[1]/div[3]/div/main/article/div/table[3]

…

etc.

There are 8 such tables in the whole webpage. We can define a character vector for these as such:

xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

1 2	xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

Now we make a loop to scrape the 8 tables and with each iteration of the loop, append the scraped subtable to the main dataframe called biotable using the rbind() function. We have to use the parameter fill = TRUE in the html_table() function because the table does not happen to always a uniform number of columns.

for (j in 1:8){                
  subtable <- wg %>%
  read_html() %>%
  html_nodes(xpath =  xpaths[j]) %>%
  html_table(., fill = TRUE) 
  subtable <- subtable[[1]]
  biotable <- rbind(biotable,subtable)
}

for (j in 1:8){

subtable <- wg %>%

read_html() %>%

html_nodes(xpath = xpaths[j]) %>%

html_table(., fill = TRUE)

subtable <- subtable[[1]]

biotable <- rbind(biotable,subtable)

}

Clean Up

Now that we have the raw data out, we can have a quick look at it:

X1	X2	X3	X4	X5	X6	X7	X8
	Analyte	Number of Papers	Biological Variation	Biological Variation	Desirable specification	Desirable specification	Desirable specification
	Analyte	Number of Papers	CVI	CVg	I(%)	B(%)	TE(%)
S-	11-Desoxycortisol	2	21.3	31.5	10.7	9.5	27.1
S-	17-Hydroxyprogesterone	2	19.6	50.4	9.8	13.5	29.7
U-	4-hydroxy-3-methoximandelate (VMA)	1	22.2	47.0	11.1	13.0	31.3
S-	5' Nucleotidase	2	23.2	19.9	11.6	7.6	26.8
U-	5'-Hydroxyindolacetate, concentration	1	20.3	33.2	10.2	9.7	26.5
S-	α1-Acid Glycoprotein	3	11.3	24.9	5.7	6.8	16.2
S-	α1-Antichymotrypsin	1	13.5	18.3	6.8	5.7	16.8
S-	α1-Antitrypsin	3	5.9	16.3	3.0	4.3	9.2

We can see that we need define column names and we need to get rid of some rows containing extraneous column header information. There are actually 8 such sets of headers to remove.

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE")
names(biotable) <- table.header

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE")

names(biotable) <- table.header

Let's now find rows we don't want and remove them.

for.removal <- grep("Analyte", biotable$Analyte)
biotable <- biotable[-for.removal,]

for.removal <- grep("Analyte", biotable$Analyte)

biotable <- biotable[-for.removal,]

You will find that the table has missing data which is written as “- – -”. This should be now replaced by NA and the column names should be assigned to sequential integers. Also, we will remove all the minus signs after the specimen type. I'm not sure what they add.

biotable[biotable == "---"] <- NA
row.names(biotable) <- 1:nrow(biotable)
biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

biotable[biotable == "---"] <- NA

row.names(biotable) <- 1:nrow(biotable)

biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

Check it Out

Just having another look at the first 10 rows:

Sample	Analyte	NumPapers	CVI	CVG	I	B	TE
S	11-Desoxycortisol	2	21.3	31.5	10.7	9.5	27.1
S	17-Hydroxyprogesterone	2	19.6	50.4	9.8	13.5	29.7
U	4-hydroxy-3-methoximandelate (VMA)	1	22.2	47.0	11.1	13.0	31.3
S	5' Nucleotidase	2	23.2	19.9	11.6	7.6	26.8
U	5'-Hydroxyindolacetate, concentration	1	20.3	33.2	10.2	9.7	26.5
S	α1-Acid Glycoprotein	3	11.3	24.9	5.7	6.8	16.2
S	α1-Antichymotrypsin	1	13.5	18.3	6.8	5.7	16.8
S	α1-Antitrypsin	3	5.9	16.3	3.0	4.3	9.2
S	α1-Globulins	2	11.4	22.6	5.7	6.3	15.7
U	α1-Microglobulin, concentration, first morning	1	33.0	58.0	16.5	16.7	43.9

Now examining the structure:

str(biotable)

1 2	str(biotable)

## 'data.frame':    370 obs. of  8 variables:
##  $ Sample   : chr  "S" "S" "U" "S" ...
##  $ Analyte  : chr  "11-Desoxycortisol" "17-Hydroxyprogesterone" "4-hydroxy-3-methoximandelate (VMA)" "5' Nucleotidase" ...
##  $ NumPapers: chr  "2" "2" "1" "2" ...
##  $ CVI      : chr  "21.3" "19.6" "22.2" "23.2" ...
##  $ CVG      : chr  "31.5" "50.4" "47.0" "19.9" ...
##  $ I        : chr  "10.7" "9.8" "11.1" "11.6" ...
##  $ B        : chr  "9.5" "13.5" "13.0" "7.6" ...
##  $ TE       : chr  "27.1" "29.7" "31.3" "26.8" ...

## 'data.frame': 370 obs. of 8 variables:

## $ Sample : chr "S" "S" "U" "S" ...

## $ Analyte : chr "11-Desoxycortisol" "17-Hydroxyprogesterone" "4-hydroxy-3-methoximandelate (VMA)" "5' Nucleotidase" ...

## $ NumPapers: chr "2" "2" "1" "2" ...

## $ CVI : chr "21.3" "19.6" "22.2" "23.2" ...

## $ CVG : chr "31.5" "50.4" "47.0" "19.9" ...

## $ I : chr "10.7" "9.8" "11.1" "11.6" ...

## $ B : chr "9.5" "13.5" "13.0" "7.6" ...

## $ TE : chr "27.1" "29.7" "31.3" "26.8" ...

It's kind-of undesirable to have numbers as characters so…

#convert appropriate columns to numeric
biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

#convert appropriate columns to numeric

biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

Write the Data

Using the xlsx package, you can output the table to an Excel file in the current working directory.

library(xlsx)
write.xlsx(biotable,
            file = "Westgard_Biological_Variation.xlsx",
            row.names = FALSE)

library(xlsx)

write.xlsx(biotable,

file = "Westgard_Biological_Variation.xlsx",

row.names = FALSE)

If you are having trouble getting xlsx to install, then just write as csv.

write.csv(biotable,
            file = "Westgard_Biological_Variation.csv",
            row.names = FALSE)

write.csv(biotable,

file = "Westgard_Biological_Variation.csv",

row.names = FALSE)

Conclusion

You can now use the same general approach to parse any table you have web access to, no mater how small or big it is. Here is a complete script in one place:

library(httr)
library(rvest)
library(xlsx)

wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("yournamehere"))
xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

#define empty dataframe
biotable = data.frame(matrix(NA,0, 8))

#loop over the 8 html tables
for (j in 1:8){                
  subtable <- wg %>%
  read_html() %>%
  html_nodes(xpath =  xpaths[j] ) %>%
  html_table(., fill = TRUE) 
  subtable <- subtable[[1]]
  biotable <- rbind(biotable,subtable)
}

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE")
names(biotable) <- table.header

#remove extraneous rows
for.removal <- grep("Analyte", biotable$Analyte)
biotable <- biotable[-for.removal,]

#make missing data into NA
biotable[ biotable == "---" ] <- NA
row.names(biotable) <- 1:nrow(biotable)

#convert appropriate columns to numeric
biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

#get rid of minus signs in column 1
biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

write.xlsx(biotable,
            file = "Westgard_Biological_Variation.xlsx",
            row.names = FALSE)

write.csv(biotable,
            file = "Westgard_Biological_Variation.csv",
            row.names = FALSE)

library(httr)

library(rvest)

library(xlsx)

wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("yournamehere"))

xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

#define empty dataframe

biotable = data.frame(matrix(NA,0, 8))

#loop over the 8 html tables

for (j in 1:8){

subtable <- wg %>%

read_html() %>%

html_nodes(xpath = xpaths[j] ) %>%

html_table(., fill = TRUE)

subtable <- subtable[[1]]

biotable <- rbind(biotable,subtable)

}

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE")

names(biotable) <- table.header

#remove extraneous rows

for.removal <- grep("Analyte", biotable$Analyte)

biotable <- biotable[-for.removal,]

#make missing data into NA

biotable[ biotable == "---" ] <- NA

row.names(biotable) <- 1:nrow(biotable)

#convert appropriate columns to numeric

biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

#get rid of minus signs in column 1

biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

write.xlsx(biotable,

file = "Westgard_Biological_Variation.xlsx",

row.names = FALSE)

write.csv(biotable,

file = "Westgard_Biological_Variation.csv",

row.names = FALSE)

Parting Thought on Tables

You prepare a table before me in the presence of my enemies. You anoint my head with oil; my cup overflows.

(Psalm 23:5)

Determine the CV of a Calculated Lab Reportable – Bioavailable Testosterone

August 7, 2017August 7, 2017 dtholmes@mail.ubc.ca

Background

At the AACC meeting last week, some of my friends were bugging me that I had not made a blog post in 10 months. Without getting into it too much, let's just say I can blame Cerner. Thanks also to a prod from a friend, here is an approach to a fairly common problem.

We all report calculated quantities out of our laboratories–quantities such as LDL cholesterol, non-HDL cholesterol, aldosterone:renin ratio, free testosterone, eGFR etc. How does one determine the precision (i.e. imprecision) of a calculated quantity. While earlier in my life, I might go to the trouble of trying to do such calculations analytically using the rules of error propagation, in my later years, I am more pragmatic and I'm happy to use a computational approach.

In this example, we will model the precision in calculated bioavailable testosterone (CBAT). Without explanation, I provide an R function for CBAT (and free testosterone) where testosterone is reported in nmol/L, sex hormone binding globulin (SHBG) is reported in nmol/L, and albumin is reported in g/L. Using the Vermeulen Equation as discussed in this publication, you can calculate CBAT as follows:

cbat <- function(TT,SHBG,ALB = 43){
    Kalb <- 3.6*10^4
    Kshbg <- 10^9
    N <- 1 + Kalb*ALB/69000
    a <- N*Kshbg
    b <- N + Kshbg*(SHBG - TT)/10^9
    c <- -TT/10^9
    FT <- (-b + sqrt(b^2 - 4*a*c))/(2*a)*10^9
    cbat <- N*FT
    return(list(free.T = FT, cbat = cbat))
}

cbat <- function(TT,SHBG,ALB = 43){

Kalb <- 3.6*10^4

Kshbg <- 10^9

N <- 1 + Kalb*ALB/69000

a <- N*Kshbg

b <- N + Kshbg*(SHBG - TT)/10^9

c <- -TT/10^9

FT <- (-b + sqrt(b^2 - 4*a*c))/(2*a)*10^9

cbat <- N*FT

return(list(free.T = FT, cbat = cbat))

}

To sanity-check this, we can use this online calculator. Taking a typical male testosterone of 20 nmol/L, an SHBG of 50 nmol/L and an albumin of 43 g/L, we get the following:

cbat(20,50)

1 2	cbat(20,50)

## $free.T
## [1] 0.3273049
## 
## $cbat
## [1] 7.670319

## $free.T

## [1] 0.3273049

## $cbat

## [1] 7.670319

which is confirmed by the online calculator. Because the function is vectorized, we an submit a vector of testosterone results and SHBG results and get a vector of CBAT results.

cbat(c(10,20,30), c(40,50,60))

1 2	cbat(c(10,20,30), c(40,50,60))

## $free.T
## [1] 0.1738837 0.3273049 0.4661380
## 
## $cbat
## [1]  4.074926  7.670319 10.923842

## $free.T

## [1] 0.1738837 0.3273049 0.4661380

## $cbat

## [1] 4.074926 7.670319 10.923842

Precision of Components

We now need some precision data for the three components. However, in our lab, we just substitute 43 g/L for the albumin, so we will leave that term out of the analysis and limit our precision calculation to testosterone and SHBG. This will allow us to present the precision as surface plots as a function of total testosterone and SHBG.

We do testosterone by LC-MS/MS using Deborah French's method. In the last three months, the precision has been 3.9% at 0.78 nmol/L, 5.5% at 6.7 nmol/L, 5.2% at 18.0 nmol/L, and 6.0% at 28.2 nmol/L. We are using the Roche Cobas e601 SHBG method which, according to the package insert, has precision of 1.8% at 14.9 nmol/L, 2.1 % at 45.7 nmol/L, and 4.0% at 219 nmol/L.

cv.tt <- c(3.9, 5.5, 5.2, 6.0)
conc.tt <- c(0.78, 6.7, 18.0, 28.2)
tt.df <- data.frame(conc.tt,cv.tt)

plot(cv.tt ~ conc.tt, data = tt.df,
                    main = "Precision Profile of Testosterone",
                    xlab = "Testosterone (nmol/L)",
                    ylab = "CV Testosterone (%)",
                    ylim = c(0,8),
                    type = "o")

cv.tt <- c(3.9, 5.5, 5.2, 6.0)

conc.tt <- c(0.78, 6.7, 18.0, 28.2)

tt.df <- data.frame(conc.tt,cv.tt)

plot(cv.tt ~ conc.tt, data = tt.df,

main = "Precision Profile of Testosterone",

xlab = "Testosterone (nmol/L)",

ylab = "CV Testosterone (%)",

ylim = c(0,8),

type = "o")

plot of chunk unnamed-chunk-4

cv.shbg <- c(1.8, 2.1, 4.0)
conc.shbg <- c(14.9,45.7,219)
shbg.df <- data.frame(cv.shbg, conc.shbg)
plot(cv.shbg ~ conc.shbg, data = shbg.df,
                    main = "Precision Profile of SHBG",
                    xlab = "SHBG (nmol/L)",
                    ylab = "CV SHGB (%)",
                    ylim = c(0,5),
                    type = "o")

cv.shbg <- c(1.8, 2.1, 4.0)

conc.shbg <- c(14.9,45.7,219)

shbg.df <- data.frame(cv.shbg, conc.shbg)

plot(cv.shbg ~ conc.shbg, data = shbg.df,

main = "Precision Profile of SHBG",

xlab = "SHBG (nmol/L)",

ylab = "CV SHGB (%)",

ylim = c(0,5),

type = "o")

plot of chunk unnamed-chunk-4

Build Approximation Functions

We will want to generate linear interpolations of these precision profiles. Generally, we might watnt to use non-linear regression to do this but I will just linearly interpolate with the approxfun() function. This will allow us to just call a function to get the approximate CV at concentrations other than those for which we have data.

tt.fun <- approxfun(x = tt.df$conc.tt, y = tt.df$cv.tt)
shbg.fun <- approxfun(x = shbg.df$conc.shbg, y = shbg.df$cv.shbg)

tt.fun <- approxfun(x = tt.df$conc.tt, y = tt.df$cv.tt)

shbg.fun <- approxfun(x = shbg.df$conc.shbg, y = shbg.df$cv.shbg)

Now, if we want to know the precision of SHBG at, say, 100 nmol/L, we can just write,

shbg.fun(100)

1 2	shbg.fun(100)

## [1] 2.695326

1	## [1] 2.695326

to obtain our precision result.

Random Simulation

Now let's build a grid of SHBG and total testosterone (TT) values at which we will calculate the precision for CBAT.

shbg <- seq(from = 15, to = 200, by = 5)
tt <- seq(from = 1, to = 28, by = 1)

shbg <- seq(from = 15, to = 200, by = 5)

tt <- seq(from = 1, to = 28, by = 1)

At each point on the grid, we will have to generate, say, 100000 random TT values and 100000 random SHBG values with the appropriate precision and then calculate the expected precision of CBAT at those concentrations.

Let's do this for a single pair of concentrations by way of example modelling the random analytical error as Gaussian using the rnorm() function.

# [SHBG] = 15 nmol/L
# [TT] = 5.0 nmol/L
set.seed(100) #just to get consistent results
rng.tt <- rnorm(100000, mean = 5.0, sd = tt.fun(5.0)/100*5.0)
rng.shbg <- rnorm(100000, mean = 15, sd = shbg.fun(15)/100*15)
rng.cbat <- cbat(rng.tt, rng.shbg)
cv.cbat <- sd(rng.cbat$cbat)/mean(rng.cbat$cbat)*100
cv.cbat

# [SHBG] = 15 nmol/L

# [TT] = 5.0 nmol/L

set.seed(100) #just to get consistent results

rng.tt <- rnorm(100000, mean = 5.0, sd = tt.fun(5.0)/100*5.0)

rng.shbg <- rnorm(100000, mean = 15, sd = shbg.fun(15)/100*15)

rng.cbat <- cbat(rng.tt, rng.shbg)

cv.cbat <- sd(rng.cbat$cbat)/mean(rng.cbat$cbat)*100

cv.cbat

## [1] 5.30598

1	## [1] 5.30598

So, we can build the process of calculating the CV of CBAT into a function as follows:

cbat.cv <- function(TT, SHBG, N = 100000){
  rng.tt <- rnorm(N, mean = TT, sd = tt.fun(TT)/100*TT)
  rng.shbg <- rnorm(N, mean = SHBG, sd = shbg.fun(SHBG)/100*SHBG)
  rng.cbat <- cbat(rng.tt, rng.shbg)
  cv <- sd(rng.cbat$cbat)/mean(rng.cbat$cbat)*100
  return(cv)
}

cbat.cv <- function(TT, SHBG, N = 100000){

rng.tt <- rnorm(N, mean = TT, sd = tt.fun(TT)/100*TT)

rng.shbg <- rnorm(N, mean = SHBG, sd = shbg.fun(SHBG)/100*SHBG)

rng.cbat <- cbat(rng.tt, rng.shbg)

cv <- sd(rng.cbat$cbat)/mean(rng.cbat$cbat)*100

return(cv)

}

Now, we can make a matrix of the data for presenting a plot, calculating the CV and appending it to the dataframe.

cv.grid <- expand.grid(tt, shbg)
names(cv.grid) <- c("tt", "shbg")
cv.grid$cv.cbat <- mapply(cbat.cv, cv.grid$tt, cv.grid$shbg)

cv.grid <- expand.grid(tt, shbg)

names(cv.grid) <- c("tt", "shbg")

cv.grid$cv.cbat <- mapply(cbat.cv, cv.grid$tt, cv.grid$shbg)

Now make plot using the wireframe() function.

library(lattice)
wireframe(cv.cbat ~ tt*shbg, data = cv.grid,
          xlab = "Testo \n (nmol/L)",
          ylab = "SHBG \n (nmol/L)",
          zlab = "CV \n (%)",
          drape = TRUE,
          colorkey = TRUE,
          col.regions = colorRampPalette(c("blue", "red", "yellow"))(100),
          scales = list(arrows=FALSE,cex=.5,tick.number = 10)
          )

library(lattice)

wireframe(cv.cbat ~ tt*shbg, data = cv.grid,

xlab = "Testo \n (nmol/L)",

ylab = "SHBG \n (nmol/L)",

zlab = "CV \n (%)",

drape = TRUE,

colorkey = TRUE,

col.regions = colorRampPalette(c("blue", "red", "yellow"))(100),

scales = list(arrows=FALSE,cex=.5,tick.number = 10)

)

plot of chunk unnamed-chunk-11

This shows us that the CV of CBAT ranges from about 4–8% over the TT and SHBG ranges we have looked at.

Conclusion

We have determined the CV of calculated bioavailable testosterone using random number simulations using empirical CV data and produced a surface plot of CV. This allows us to comment on the CV of this lab reportable as a function of the two variables by which it is determined.

Parting Thought on Monte Carlo Simulations

The die is cast into the lap, but its every decision is from the LORD.

(Prov 16:33)

The Lab-R-torian

Month: August 2017

Non-Linear Regression: Application to Monoclonal Peak Integration in Serum Protein Electrophoresis

Background

Easy Peaks

More Difficult Peaks

Getting the Data

Smoothing

Transformation

Find Extrema

Fitting

Isolate the \(\gamma\) Region

Attempt Something that Ultimately Does Not Work

Baseline Removal

All about that Base(line)

Conclusions

Compare Tube Types with R – Repeated Measures ANOVA

Background

Some Fake Data to Work With

Gather the Grain

Visualize

Separate the Wheat from the Chaff

I can’t see the difference. Can you see the difference?

Conclusion

Parse an Online Table into an R Dataframe – Westgard’s Biological Variation Database

Background

Reading in the Data

Clean Up

Check it Out

Write the Data

Conclusion

Determine the CV of a Calculated Lab Reportable – Bioavailable Testosterone

Background

Precision of Components

Build Approximation Functions

Random Simulation

Conclusion