September 2015 – The Lab-R-torian

Deming and Passing Bablok Regression in R

September 14, 2015September 21, 2015 dtholmes@mail.ubc.ca

Regression Methods

In this post we will be discussing how to perform Passing Bablok and Deming regression in R. Those who work in Clinical Chemistry know that these two approaches are required by the journals in the field. The idiosyncratic affection for these two forms of regression appears to be historical but this is something unlikely to change in my lifetime–hence the need to cover it here.

Along the way, we shall touch on the ways in which Deming and Passing Bablok differ from ordinary least squares (OLS) and from one another.

Creating some random data

Let's start by making some heteroscedastic random data that we can use for regression. We will use the command set.seed() to begin with because by this means, the reader can generate the same random data as the post. This function takes any number you wish as its argument, but if you set the same seed, you will get the same random numbers. We will generate 100 random $x$ values in the uniform distribution and then an accompanying 100 random $y$ values with proportional bias, constant bias and random noise that increases with $x$. I have added a bit of non–linearity because we do see this a fair bit in our work.

set.seed(20)
x <- runif(100,0,100)
y <- 1.10*x - 0.001*x^2 + rnorm(100,0,1)*(2 + 0.05*x) + 15

set.seed(20)

x <- runif(100,0,100)

y <- 1.10*x - 0.001*x^2 + rnorm(100,0,1)*(2 + 0.05*x) + 15

The constants I chose are arbitrary. I chose them to produce something resembling a comparison of, say, two automated immunoassays.

Let's quickly produce a scatter plot to see what our data looks like:

plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")

1	plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")

plot of chunk unnamed-chunk-2

Residuals in OLS

OLS regression minimizes the sum of squared residuals. In the case of OLS, the residual of a point is defined as the vertical distance from that point to the regression line. The regression line is chosen so that the sum of the squares of the residuals in minimal.

OLS regression assumes that there is no error in the $x$–axis values and that there is no heteroscedasticity, that is, the scatter of $y$ is constant. Neither of these assumptions is true in the case of bioanaytical method comparisons. In contrast, for calibration curves in mass–spectrometry, a linear response is plotted as a function of pre–defined calibrator concentration. This means that the $x$–axis has very little error and so OLS regression is an appropriate choice (though I doubt that the assumption about homoscedasticity is generally met).

OLS is part of R's base package. We can find the OLS regression line using lm() and we will store the results in the variable lin.reg.

plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")
lin.reg <- lm(y~x)
abline(lin.reg, col="blue")

plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")

lin.reg <- lm(y~x)

abline(lin.reg, col="blue")

plot of chunk unnamed-chunk-3

Just to demonstrate the point about residuals graphically, the following shows them in vertical red lines.

plot of chunk unnamed-chunk-4

Deming Regression

Deming regression differs from OLS regression in that it does not make the assumption that the $x$ values are free of error. It (more or less) defines the residual as the perpendicular distance from a point to its fitted value on the regression line.

Deming regression does not come as part of R's base package but can be performed using the MethComp and mcr packages. In this case, we will use the latter. If not already installed, you must install the mcr package with install.packages("mcr").

Then to perform Deming regression, we will load the mcr library and execute the following using the mcreg() command, storing the output in the variable dem.reg.

library(mcr)
dem.reg <- mcreg(x,y, method.reg = "Deming")

1 2	library(mcr) dem.reg <- mcreg(x,y, method.reg = "Deming")

By performing the str() command on dem.reg, we can see that the regression parameters are stored in the slot @para. Because the authors have used an S4 object as the output of their function, we don't address output as we would in lists (with a $), but rather with an @.

str(dem.reg)

1	str(dem.reg)

## Formal class 'MCResultResampling' [package "mcr"] with 21 slots
##   ..@ glob.coef  : num [1:2] 15.58 1.04
##   ..@ glob.sigma : num [1:2] 0.8165 0.0147
##   ..@ xmean      : num 46.8
##   ..@ nsamples   : int 999
##   ..@ nnested    : num 25
##   ..@ B0         : num [1:999] 15.9 15.4 16 16.1 15.6 ...
##   ..@ B1         : num [1:999] 1.01 1.04 1.02 1.04 1.03 ...
##   ..@ sigmaB0    : num [1:999] 0.794 0.766 0.846 0.815 0.737 ...
##   ..@ sigmaB1    : num [1:999] 0.0141 0.0142 0.0155 0.0141 0.0135 ...
##   ..@ MX         : num [1:999] 46.8 45.9 45.4 48.9 45.5 ...
##   ..@ bootcimeth : chr "quantile"
##   ..@ rng.seed   : num NA
##   ..@ rng.kind   : chr NA
##   ..@ data       :'data.frame':  100 obs. of  3 variables:
##   .. ..$ sid: Factor w/ 100 levels "S1","S10","S100",..: 1 13 24 35 46 57 68 79 90 2 ...
##   .. ..$ x  : num [1:100] 87.8 76.9 27.9 52.9 96.3 ...
##   .. ..$ y  : num [1:100] 110.8 93.5 45.6 76.6 116.6 ...
##   ..@ para       : num [1:2, 1:4] 15.58 1.04 NA NA 14.45 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:2] "Intercept" "Slope"
##   .. .. ..$ : chr [1:4] "EST" "SE" "LCI" "UCI"
##   ..@ mnames     : chr [1:2] "Method1" "Method2"
##   ..@ regmeth    : chr "Deming"
##   ..@ cimeth     : chr "bootstrap"
##   ..@ error.ratio: num 1
##   ..@ alpha      : num 0.05
##   ..@ weight     : Named num [1:100] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:100] "S1" "S2" "S3" "S4" ...

## Formal class 'MCResultResampling' [package "mcr"] with 21 slots

## ..@ glob.coef : num [1:2] 15.58 1.04

## ..@ glob.sigma : num [1:2] 0.8165 0.0147

## ..@ xmean : num 46.8

## ..@ nsamples : int 999

## ..@ nnested : num 25

## ..@ B0 : num [1:999] 15.9 15.4 16 16.1 15.6 ...

## ..@ B1 : num [1:999] 1.01 1.04 1.02 1.04 1.03 ...

## ..@ sigmaB0 : num [1:999] 0.794 0.766 0.846 0.815 0.737 ...

## ..@ sigmaB1 : num [1:999] 0.0141 0.0142 0.0155 0.0141 0.0135 ...

## ..@ MX : num [1:999] 46.8 45.9 45.4 48.9 45.5 ...

## ..@ bootcimeth : chr "quantile"

## ..@ rng.seed : num NA

## ..@ rng.kind : chr NA

## ..@ data :'data.frame': 100 obs. of 3 variables:

## .. ..$ sid: Factor w/ 100 levels "S1","S10","S100",..: 1 13 24 35 46 57 68 79 90 2 ...

## .. ..$ x : num [1:100] 87.8 76.9 27.9 52.9 96.3 ...

## .. ..$ y : num [1:100] 110.8 93.5 45.6 76.6 116.6 ...

## ..@ para : num [1:2, 1:4] 15.58 1.04 NA NA 14.45 ...

## .. ..- attr(*, "dimnames")=List of 2

## .. .. ..$ : chr [1:2] "Intercept" "Slope"

## .. .. ..$ : chr [1:4] "EST" "SE" "LCI" "UCI"

## ..@ mnames : chr [1:2] "Method1" "Method2"

## ..@ regmeth : chr "Deming"

## ..@ cimeth : chr "bootstrap"

## ..@ error.ratio: num 1

## ..@ alpha : num 0.05

## ..@ weight : Named num [1:100] 1 1 1 1 1 1 1 1 1 1 ...

## .. ..- attr(*, "names")= chr [1:100] "S1" "S2" "S3" "S4" ...

dem.reg@para

1	dem.reg@para

##                EST SE       LCI       UCI
## Intercept 15.57790 NA 14.446677 16.810321
## Slope      1.03658 NA  1.006434  1.066066

## EST SE LCI UCI

## Intercept 15.57790 NA 14.446677 16.810321

## Slope 1.03658 NA 1.006434 1.066066

The intercept and slope are stored in demreg@para[1] and dem.reg@para[2] respectively. Therefore, we can add the regression line as follows:

plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")
abline(dem.reg@para[1:2], col = "blue")

1 2	plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method") abline(dem.reg@para[1:2], col = "blue")

plot of chunk unnamed-chunk-7

To emphasize how the residuals are different from OLS we can plot them as before:

plot of chunk unnamed-chunk-8

We present the figure above for instructional purposes only. The usual way to present a residuals plot is to show the same picture rotated until the line is horizontal–this is a slight simplification but is essentially what is happening:

plot of chunk unnamed-chunk-9

Ratio of Variances

It is important to mention that if one knows that the $x$–axis method is subject to a different amount of random analytical variability than the $y$–axis method, one should provide the ratio of the variances of the two methods to mcreg(). In general, this requires us to have “CV” data from precision studies already available. Another approach is to perform every analysis in duplicate by both methods and use the data to estimate this ratio.

If the methods happen to have similar CVs throughout the analytical range, the default value of 1 is assumed. But suppose that the ratio of the CVs of the $x$ axis method to the $y$–axis method was 1.2, we could provide this in the regression call by setting the error.ratio parameter. The resulting regression parameters will be slightly different.

mcreg(x,y, method.reg = "Deming", error.ratio = 1.2)@para

1	mcreg(x,y, method.reg = "Deming", error.ratio = 1.2)@para

##                 EST SE      LCI       UCI
## Intercept 15.534921 NA 14.39904 16.777065
## Slope      1.037499 NA  1.00792  1.067316

## EST SE LCI UCI

## Intercept 15.534921 NA 14.39904 16.777065

## Slope 1.037499 NA 1.00792 1.067316

Weighting

In the case of heteroscedastic data, it would be customary to weight the regression which in the case of the mcr package is weighted as $1/x^2$. This means that having 0's in your $x$–data will cause the calculation to “crump”. In any case, if we wanted weighted regression parameters we would make the call:

w.dem.reg <- mcreg(x,y, method.reg = "WDeming")

1	w.dem.reg <- mcreg(x,y, method.reg = "WDeming")

## The global.sigma is calculated with Linnet's method

1	## The global.sigma is calculated with Linnet's method

w.dem.reg@para

1	w.dem.reg@para

##                 EST SE       LCI       UCI
## Intercept 13.788450 NA 12.858803 14.861006
## Slope      1.088119 NA  1.058042  1.116879

## EST SE LCI UCI

## Intercept 13.788450 NA 12.858803 14.861006

## Slope 1.088119 NA 1.058042 1.116879

And plotting both on the same figure:

plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")
abline(dem.reg@para[1:2], col = "blue")
abline(w.dem.reg@para[1:2], col = "green")
legend("topleft", c("Deming","Weighted Deming"), lty=c(1,1), col = c("blue","green"))

plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")

abline(dem.reg@para[1:2], col = "blue")

abline(w.dem.reg@para[1:2], col = "green")

legend("topleft", c("Deming","Weighted Deming"), lty=c(1,1), col = c("blue","green"))

plot of chunk unnamed-chunk-12

Passing Bablok

Passing Bablok regression is not performed by the minimization of residuals. Rather, all possible pairs of $x$–$y$ points are determined and slopes are calculated using each pair of points. Work–arounds are undertaken for pairs of points that generate infinite slopes and other peculiarities. In any case, the median of the $\frac{N(N-1)}{2!}$ possible slopes becomes the final slope estimate and the corresponding intercept can be calculated. With regards to weighted Passing Bablok regression, I’d like to acknowledge commenter glen_b for bringing to my attention that there is a paradigm for calculating the weighted median of pairwise slopes. See the comment section for a discussion.

Passing Bablok regression takes a lot of computational time as the number of points grows, so expect some delays on data sets larger than $N=100$ if you are using an ordinary computer. To get the Passing Bablok regression equation, we just change the method.reg parameter:

PB.reg <- mcreg(x,y, method.reg = "PaBa")
PB.reg@para

1 2	PB.reg <- mcreg(x,y, method.reg = "PaBa") PB.reg@para

##                 EST SE       LCI       UCI
## Intercept 14.684463 NA 13.648554 16.495846
## Slope      1.046021 NA  1.015893  1.075632

## EST SE LCI UCI

## Intercept 14.684463 NA 13.648554 16.495846

## Slope 1.046021 NA 1.015893 1.075632

and the procedures to plot this regression are identical. The mcreg() function does have an option for Passing Bablok regression on large data sets. See the instructions by typing help("mcreg") in the R terminal.

Outlier Effects

As a consequence of the means by which the slope is determined, the Passing Bablok method is relatively resistant to the effect of outlier(s) as compared to OLS and Deming. To demonstrate this, we can add on outlier to some data scattered about the line $y=x$ and show how all three methods are affected.

x <- 1:20
y <- c(1:19,10) + rnorm(20,0,0.5)

1 2	x <- 1:20 y <- c(1:19,10) + rnorm(20,0,0.5)

plot of chunk unnamed-chunk-15

Because of this outlier, the OLS slope drops to 0.84, the Deming slope to 0.91, while the Passing Bablok is much better off at 0.99.

Generating a Pretty Plot

The code authors of the mcr package have created a feature such that if you put the regression model inside the plot function, you can quickly generate a figure for yourself that has all the required information on it. For example,

plot(PB.reg)

1	plot(PB.reg)

plot of chunk unnamed-chunk-16

But this method of out–of–the–box figure is not very customizable and you may want it to appear differently for your publication. Never fear. There is a solution. The MCResult.plot() function offers complete customization of the figure so that you can show it exactly as you wish for your publication.

MCResult.plot(PB.reg, equal.axis = TRUE, x.lab = "x method", y.lab = "y method", points.col = "#FF7F5060", points.pch = 19, ci.area = TRUE, ci.area.col = "#0000FF50", main = "My Passing Bablok Regression", sub = "", add.grid = FALSE, points.cex = 1)

1	MCResult.plot(PB.reg, equal.axis = TRUE, x.lab = "x method", y.lab = "y method", points.col = "#FF7F5060", points.pch = 19, ci.area = TRUE, ci.area.col = "#0000FF50", main = "My Passing Bablok Regression", sub = "", add.grid = FALSE, points.cex = 1)

custom mcr plot

In this example, I have created semi–transparent “darkorchid4” (hex = #68228B) points and a semi–transparent blue (hex = #0000FF) confidence band of the regression. Maybe darkorchid would not be my first choice for a publication after all, but it demonstrates the customization. Additionally, I have suppressed my least favourite features of the default plot method. Specifically, the sub="" term removes the sentence at the bottom margin and the add.grid = FALSE prevents the grid from being plotted. Enter help(MCResult.plot) for the complete low–down on customization.

Conclusion

We have seen how to perform Deming and Passing Bablok regression in the R programming language and have touched on how the methods differ “under the hood”. We have used the mcr to perform the regressions and have shown how you can beautify your plot.

The reader should have a look at the rlm() function in the MASS package and the rq() function in the quantreg package to see other robust (outlier–resistant) regression approaches. A good tutorial can be found here

I hope that makes it easy for you.

-Dan

May all your paths (and regressions) be straight:

Trust in the Lord with all your heart
and lean not on your own understanding;
in all your ways submit to him,
and he will make your paths straight.

Proverbs 3:5-6

NA NA NA NA, Hey Hey Hey, Goodbye

September 5, 2015September 7, 2015 dtholmes@mail.ubc.ca

Removing NA’s from a Data Frame in R

The Problem

Suppose you are doing a method comparison for which some results are above or below the linear range of your assay(s). Generally, these will appear in your spreadsheet (gasp!) program as $< x$ or $> y$ or, in the case of our mass spectrometer, “No Peak”. When you read these data into R using read.csv(), R will turn then into factors, which I personally find super–annoying and which inspired this conference badge (see bottom right) as I learned from University of British Columbia prof Jenny Bryan.

For this reason, when we read the data in, it is convenient to choose the option stringsAsFactors = FALSE. In doing so, the data will be treated as strings and be in the character class. But for regression comparison purposes, we need to make the data numeric and all of the $< x$ and $> y$ results will be converted to NA. In this post, we want to address a few questions that follow:

How do we find all the NA results?
How can we replace them with a numeric (like 0)?
How can we rid ourselves of rows containing NA?

Finding NA's

Let's read in the data which comes from a method comparison of serum aldosterone between our laboratory and Russ Grant's laboratory (LabCorp) published here. I'll read in the data with stringsAsFactors = FALSE. These are aldosterone results in pmol/L. To convert to ng/dL, divide by 27.7.

myData<-read.csv("Comparison.csv", sep=",", stringsAsFactors = FALSE)
str(myData)

1 2	myData<-read.csv("Comparison.csv", sep=",", stringsAsFactors = FALSE) str(myData)

## 'data.frame':    96 obs. of  3 variables:
##  $ Sample.Num: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Aldo.Us   : chr  "462.3" "433.2" "37.7" "137.7" ...
##  $ Aldo.Them : num  457.2 418.1 42.1 133.9 27.4 ...

## 'data.frame': 96 obs. of 3 variables:

## $ Sample.Num: int 1 2 3 4 5 6 7 8 9 10 ...

## $ Aldo.Us : chr "462.3" "433.2" "37.7" "137.7" ...

## $ Aldo.Them : num 457.2 418.1 42.1 133.9 27.4 ...

head(myData)

1	head(myData)

##   Sample.Num Aldo.Us Aldo.Them
## 1          1   462.3     457.2
## 2          2   433.2     418.1
## 3          3    37.7      42.1
## 4          4   137.7     133.9
## 5          5    29.4      27.4
## 6          6   552.1     639.7

## Sample.Num Aldo.Us Aldo.Them

## 1 1 462.3 457.2

## 2 2 433.2 418.1

## 3 3 37.7 42.1

## 4 4 137.7 133.9

## 5 5 29.4 27.4

## 6 6 552.1 639.7

You can see the problem immediately, our data (“Aldo.Us”) is a character vector. This is not good for regression. Why did this happen? We can find out:

myData$Aldo.Us

1	myData$Aldo.Us

##  [1] "462.3"   "433.2"   "37.7"    "137.7"   "29.4"    "552.1"   "41.6"   
##  [8] "158.7"   "1198"    "478.4"   "160.7"   "167.9"   "211.6"   "493.3"  
## [15] "195.6"   "649.8"   "644"     "534.1"   "212.7"   "413.3"   "150.7"  
## [22] "451.2"   "25.8"    "118.8"   "496.1"   "486.1"   "846.8"   "139.9"  
## [29] "No Peak" "98.3"    "113.8"   "230.7"   "530.2"   "26.6"    "390.3"  
## [36] "782.8"   "886.7"   "83.4"    "44"      "71.2"    "657"     "321.6"  
## [43] "188.6"   "451.2"   "485.3"   "No Peak" "144.9"   "249.6"   "682"    
## [50] "601.9"   "330.5"   "216.6"   "500.3"   "20.5"    "271.5"   "196.7"  
## [57] "309.4"   "235.7"   "171.7"   "124.9"   "293.6"   "345.4"   "243.5"  
## [64] "75.1"    "508.3"   "442.4"   "531.3"   "317.4"   "647.9"   "562"    
## [71] "366.5"   "37.1"    "231.6"   "73.7"    "526.3"   "No Peak" "165.6"  
## [78] "105.8"   "77.8"    "211.6"   "125.8"   "76.5"    "58.2"    "111.9"  
## [85] "238.5"   "31.6"    "156.8"   "191.7"   "402.5"   "108.9"   "183.7"  
## [92] "314.4"   "90"      "98.9"    "144.9"   "971.4"

## [1] "462.3" "433.2" "37.7" "137.7" "29.4" "552.1" "41.6"

## [8] "158.7" "1198" "478.4" "160.7" "167.9" "211.6" "493.3"

## [15] "195.6" "649.8" "644" "534.1" "212.7" "413.3" "150.7"

## [22] "451.2" "25.8" "118.8" "496.1" "486.1" "846.8" "139.9"

## [29] "No Peak" "98.3" "113.8" "230.7" "530.2" "26.6" "390.3"

## [36] "782.8" "886.7" "83.4" "44" "71.2" "657" "321.6"

## [43] "188.6" "451.2" "485.3" "No Peak" "144.9" "249.6" "682"

## [50] "601.9" "330.5" "216.6" "500.3" "20.5" "271.5" "196.7"

## [57] "309.4" "235.7" "171.7" "124.9" "293.6" "345.4" "243.5"

## [64] "75.1" "508.3" "442.4" "531.3" "317.4" "647.9" "562"

## [71] "366.5" "37.1" "231.6" "73.7" "526.3" "No Peak" "165.6"

## [78] "105.8" "77.8" "211.6" "125.8" "76.5" "58.2" "111.9"

## [85] "238.5" "31.6" "156.8" "191.7" "402.5" "108.9" "183.7"

## [92] "314.4" "90" "98.9" "144.9" "971.4"

Ahhh…it's the dreaded “No Peak”. This is what the mass spectrometer has put in its data file. So, let's force everything to numeric:

myData$Aldo.Us <- as.numeric(myData$Aldo.Us)

1	myData$Aldo.Us <- as.numeric(myData$Aldo.Us)

## Warning: NAs introduced by coercion

1	## Warning: NAs introduced by coercion

We see the warnings about the introduction of NAs. And we get:

myData$Aldo.Us

1	myData$Aldo.Us

##  [1]  462.3  433.2   37.7  137.7   29.4  552.1   41.6  158.7 1198.0  478.4
## [11]  160.7  167.9  211.6  493.3  195.6  649.8  644.0  534.1  212.7  413.3
## [21]  150.7  451.2   25.8  118.8  496.1  486.1  846.8  139.9     NA   98.3
## [31]  113.8  230.7  530.2   26.6  390.3  782.8  886.7   83.4   44.0   71.2
## [41]  657.0  321.6  188.6  451.2  485.3     NA  144.9  249.6  682.0  601.9
## [51]  330.5  216.6  500.3   20.5  271.5  196.7  309.4  235.7  171.7  124.9
## [61]  293.6  345.4  243.5   75.1  508.3  442.4  531.3  317.4  647.9  562.0
## [71]  366.5   37.1  231.6   73.7  526.3     NA  165.6  105.8   77.8  211.6
## [81]  125.8   76.5   58.2  111.9  238.5   31.6  156.8  191.7  402.5  108.9
## [91]  183.7  314.4   90.0   98.9  144.9  971.4

## [1] 462.3 433.2 37.7 137.7 29.4 552.1 41.6 158.7 1198.0 478.4

## [11] 160.7 167.9 211.6 493.3 195.6 649.8 644.0 534.1 212.7 413.3

## [21] 150.7 451.2 25.8 118.8 496.1 486.1 846.8 139.9 NA 98.3

## [31] 113.8 230.7 530.2 26.6 390.3 782.8 886.7 83.4 44.0 71.2

## [41] 657.0 321.6 188.6 451.2 485.3 NA 144.9 249.6 682.0 601.9

## [51] 330.5 216.6 500.3 20.5 271.5 196.7 309.4 235.7 171.7 124.9

## [61] 293.6 345.4 243.5 75.1 508.3 442.4 531.3 317.4 647.9 562.0

## [71] 366.5 37.1 231.6 73.7 526.3 NA 165.6 105.8 77.8 211.6

## [81] 125.8 76.5 58.2 111.9 238.5 31.6 156.8 191.7 402.5 108.9

## [91] 183.7 314.4 90.0 98.9 144.9 971.4

summary(myData$Aldo.Us)

1	summary(myData$Aldo.Us)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    20.5   118.8   230.7   305.5   478.4  1198.0       3

1 2	## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 20.5 118.8 230.7 305.5 478.4 1198.0 3

Now we have 3 NAs. We want to find them and get rid of them. From the screen we could figure out where the NAs were and manually replace them. This is OK on such a small data set but when you start dealing with data sets having thousands or millions of rows, approaches like this are impractical. So, let's do it right.

If we naively try to use an equality we find out nothing.

which(myData$Aldo.Us==NA)

1	which(myData$Aldo.Us==NA)

## integer(0)

1	## integer(0)

Hunh? Whasgoinon?

This occurs because NA means “unknown”. Think about it this way. If one patient's result is NA and another patient's result is NA, then are the results equal? No, they are not (necessarily) equal, they are both unknown and so the comparison should be unknown also. This is why we do not get a result of TRUE when we ask the following question:

NA==NA

NA==NA

## [1] NA

## [1] NA

So, when we ask R if unknown #1 is equal to unknown #2, it responds with “I dunno.”, or “NA”. So if we want to find the NAs, we should inquire as follows:

is.na(myData$Aldo.Us)

1	is.na(myData$Aldo.Us)

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [23] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE

## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [45] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE

## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

or, for less verbose output:

which(is.na(myData$Aldo.Us))

1	which(is.na(myData$Aldo.Us))

## [1] 29 46 76

1	## [1] 29 46 76

Hey Hey! Ho Ho! Those NAs have got to go!

Now we know where they are, in rows 29, 46, and 76. We can replace them with 0, which is OK but may pose problems if we use weighted regression (i.e. if we have a 0 in the x-data and we weight data by 1/x). Alternatively, we can delete the rows entirely.

To replace them with 0, we can write:

myData$Aldo.Us[which(is.na(myData$Aldo.Us))] <- 0

1	myData$Aldo.Us[which(is.na(myData$Aldo.Us))] <- 0

and this is equivalent:

myData$Aldo.Us[is.na(myData$Aldo.Us)] <- 0

1	myData$Aldo.Us[is.na(myData$Aldo.Us)] <- 0

To remove the whole corresponding row, we can write:

myDataBeGoneNA <- myData[-which(is.na(myData$Aldo.Us)),]

1	myDataBeGoneNA <- myData[-which(is.na(myData$Aldo.Us)),]

or:

myDataBeGoneNA <- myData[!is.na(myData$Aldo.Us),]

1	myDataBeGoneNA <- myData[!is.na(myData$Aldo.Us),]

Complete Cases

What if there were NA's hiding all over the place in multiple columns and we wanted to banish any row containing one or more NA? In this case, the complete.cases() function is one way to go:

complete.cases(myData)

1	complete.cases(myData)

##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [23]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [34]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [45]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [56]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [67]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [78]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [89]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## [23] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE

## [34] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## [45] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## [56] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## [67] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE

## [78] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## [89] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

This function shows us which rows have no NAs (the ones with TRUE as the result) and which rows have NAs (the three with FALSE). We can banish all rows containing any NAs generally as follows:

myDataBeGoneNA <- myData[complete.cases(myData),]

1	myDataBeGoneNA <- myData[complete.cases(myData),]

This data set now has 93 rows:

nrow(myDataBeGoneNA)

1	nrow(myDataBeGoneNA)

## [1] 93

## [1] 93

You could peruse the excluded data like this:

myData[!complete.cases(myData),]

1	myData[!complete.cases(myData),]

##    Sample.Num Aldo.Us Aldo.Them
## 29         29      NA       6.6
## 46         46      NA       7.0
## 76         76      NA       5.7

## Sample.Num Aldo.Us Aldo.Them

## 29 29 NA 6.6

## 46 46 NA 7.0

## 76 76 NA 5.7

na.omit()

Another way to remove incomplete cases is the na.omit() function (as Dr. Shannon Haymond pointed out to me). So this works too:

myDataBeGoneNA <- na.omit(myData)

1	myDataBeGoneNA <- na.omit(myData)

Row Numbers are Actually Names

In all of these approaches, you will notice something peculiar. Even though we have excluded the three rows, the row numbering still appears to imply that there are 96 rows:

tail(myDataBeGoneNA)

1	tail(myDataBeGoneNA)

##    Sample.Num Aldo.Us Aldo.Them
## 91         91   183.7     170.4
## 92         92   314.4     307.6
## 93         93    90.0     214.0
## 94         94    98.9      75.1
## 95         95   144.9     129.3
## 96         96   971.4     807.7

## Sample.Num Aldo.Us Aldo.Them

## 91 91 183.7 170.4

## 92 92 314.4 307.6

## 93 93 90.0 214.0

## 94 94 98.9 75.1

## 95 95 144.9 129.3

## 96 96 971.4 807.7

but if you check the dimensions, there are 93 rows:

nrow(myDataBeGoneNA)

1	nrow(myDataBeGoneNA)

## [1] 93

## [1] 93

Why? This is because the row numbers are not row numbers; they are numerical row names. When you exclude a row, none of the other row names change. This was bewildering to me in the beginning. I thought my exclusions had failed somehow.

Now we can move on

Once this is done, you can go on and do your regression, which, in this case, looks like this.

Comparison of Serum Aldosterone

Finally, if you are ever wondering what fraction of your data is comprised of NA, rather than the absolute number, you can do this as follows:

mean(is.na(myData$Aldo.Us))

1	mean(is.na(myData$Aldo.Us))

## [1] 0.03125

1	## [1] 0.03125

If you applied this to the whole dataframe, you get the fraction of NA's in the whole dataframe (again–thank you Shannon):

mean(is.na(myData))

1	mean(is.na(myData))

## [1] 0.01041667

1	## [1] 0.01041667

Final Thought:

is.na(newunderthesun)

1	is.na(newunderthesun)

## [1] TRUE

1	## [1] TRUE

Ecclesiastes 1:9.

-Dan

Unit Converter

September 2, 2015September 5, 2015 Stephen Master

Introduction

Dan continues to crank out book chapter-length posts, which probably means that I should jump in before getting further behind…so here we go.

In the next few posts, I’d like to cover some work to help you to process aggregated proficiency testing (PT) data. Interpreting PT data from groups such as the College of American Pathologists (CAP) is, of course, a fundamental task for lab management. Comparing your lab’s results to peer group data from other users of the same instrumentation helps to ensure that your patients receive consistent results, and it provides at least a crude measure to ensure that your instrument performance is “in the ballpark”. Of course, many assays show significant differences between instrument models and manufacturers that can lead to results that are not comparable as a patient moves from institution to institution (or when your own lab changes instruments!). There are a number of standardization and harmonization initiatives underway (see http://harmonization.net, for example) to address this, and understanding which assays show significant bias compared to benchmark studies or national guidelines is a critical task for laboratorians. All of this is further complicated by the fact that sample matrix can significantly affect assay results, and sample commutability is one important reason why we can’t just take, say, CAP PT survey results (not counting the accuracy-based surveys) and determine which assays aren’t harmonized.

However.

With all of those caveats, it can still be useful to look through PT data in a systematic way to compare instruments. Ideally, we’d like to have everything in an R-friendly format that would allow us to ask systematic questions about data (things like “for how many assays does instrument X differ from instrument Y by >30% using PT material?”, or “how many PT materials give good concordance across all manufacturers?”). If we have good, commutable, accuracy-based testing materials, we can do even better. The first task is all of this fun, however, is getting the data into a format that R is happy with; no one I know likes the idea of retyping numbers from paper reports. I’m hoping to talk more about this in a future post, as there are lots of fun R text processing issues lurking here. In the mean time, though, we have a much more modest preliminary task to tackle.

Simple unit conversion

I’m currently staring at a CAP PT booklet. It happens to be D-dimer, but you can pick your own favorite analyte (and PT provider, for that matter). Some of the results are in ng/mL, some are ug/mL, and one is in mg/L. Let’s create an R function that allows us to convert between sets of comparable units. Now, although I know that Dan is in love with SI units (#murica), we’ll start by simply converting molar→molar and gravimetric→gravimetric. Yes, we can add fancy analyte-by-analyte conversion tables in the future…but right now we just want to get things on the same scale. In the process, we’ll cover three useful R command families.

First of all, we should probably decide how we want the final function to look. I’m thinking of something like this:

results <- labunit.convert(2.3, "mg/dL", "g/L")
results

## [1] 0.023

…which converts 2.3 mg/dL to 0.023 g/L. We should also give ourselves bonus points if we can make it work with vectors. For example, we may have this data frame:

mydata

##   Value   Units Target.Units
## 1  2.30    g/dL         mg/L
## 2 47.00 nmol/mL      mmol/dL
## 3  0.19    IU/L        mIU/L

and we would like to be able to use our function like this:

labunit.convert(mydata$Value, mydata$Units, mydata$Target.Units)

## [1] 2.3e+04 4.7e-03 1.9e+02

We should also handle things that are simpler

labunit.convert(0.23, "g", "mg")

## [1] 230

Getting started

Now that we know where we’re going, let’s start by writing a function that just converts between two units and returns the log difference. We’ll call this function convert.one.unit(), and it will take two arguments:

convert.one.unit("mg", "ng")

## [1] 6

Basically, we want to take a character variable (like, say, “dL”) and break it into two pieces: the metric prefix (“d”) and the base unit (“L”). If it isn’t something we recognize, the function should quit and complain (you could also make it return ‘NA’ and just give a warning instead, but we’ll hold off on that for now). We’ll start with a list of things that we want to recognize.

convert.one.unit <- function (unitin, unitout) {
  metric.prefixes <- c("y", "z", "a", "f", "p", "n", "u", "m", "c", "d", "", "da", "h", "k", "M", "G", "T", "P", "E", "Z", "Y")
  metric.logmultipliers <- c(-24, -21, -18, -15, -12, -9, -6, -3, -2, -1, 0, 1, 2, 3, 6, 9, 12, 15, 18, 21, 24)
  units.for.lab <- c("mol", "g", "L", "U", "IU")

Notice that the metric.prefixes variable contains the appropriate one- or two-character prefixes, and metric.logmultipliers has the corresponding log multiplier (for example, metric.prefixes[8] = “m”, and metric.logmultipliers[8] is -3). It’s also worth noting the "" (metric.prefixes[11]), which corresponds to a log multiplier of 0. The fact that "" is a zero-length string instead of a null means that we can search for it in a vector…which will be very handy!

And now for some regular expressions

This is the point where we tackle the first of the three command families that I told you about. If you’re not familiar with “regular expressions” in R or another language (Perl, Python, whatever), this is your entry point into some very useful text searching capabilities. Basically, a regular expression is a way of specifying a search for a matching text pattern, and it’s used with a number of R commands (grep(), grepl(), gsub(), regexpr(), regexec(), etc.). We’ll use gsub() as an example, since it’s one that many people are familiar with. Suppose that I have the character string “This is not a test”, and I want to change it to “This is a test”. I can feed gsub() a pattern that I want to recognize and some text that I want to use to replace the pattern. For example:

my.string <- "This is not a test"
my.altered.string <- gsub("not a ", "", my.string)   # replace "not a " with an empty string, ""
my.altered.string

## [1] "This is test"

That’s fine as far as it goes, but we will drive ourselves crazy if we’re limited to explicit matches. What if, for example, we also to also recognize “This is not…a test”, or “This is not my kind of a test”? We could write three different gsub statements, but that would get old fairly quickly. Instead of exactly matching the text, we’ll use a pattern. A regular expression that will match all three of our input statements is "not.+a ", so we can do the following:

gsub("not.+a ", "", "This is not a test")

## [1] "This is test"

gsub("not.+a ", "", "This is not my kind of a test")

## [1] "This is test"

You can read the regular expression "not.+a " as “match the letters ‘not’ followed by a group of one or more characters (denoted by the special symbol ‘.’) followed by an ‘a’”. You can find some very nice tutorials on regular expressions through Google, but for the purposes of this brief lesson I’ll give you a mini-cheat sheet that probably handles 90% of the regular expressions that I have to write:

Special Character	Meaning
.	match any character
\d	match any digit
\D	match anything that isn’t a digit
\s	match white space
\S	match anything that isn’t white space
\t	match a tab (less important in R, since you usually already have things in a data frame)
^	match the beginning of the string (i.e. “^Bob” matches “Bob my uncle” but not “Uncle Bob”)
$	match the end of the string
*	match the previous thing when it occurs 0 or more times
+	match the previous thing when it occurs 1 or more times
?	match the previous thing when it occurs 0 or 1 times
( .. )	(parentheses) enclose a group of choices or a particular substring in the match
\|	match this OR that (e.g. “(Bob\|Pete)” matches “Dr. Bob Smith” or “Dr. Pete Jones” but not “Dr. Sam Jones”

It’s also important to remember for things like "\d" that R uses backslashes as the escape character…so you actually have to write a double backslash, like this: "\\d". A regular expression to match one or more digits would be "\\d+".

OK, back to work. Our next step is to remove all white space from the unit text (we want "dL" to be handled the same way as " dL" or "dL "), so we’ll add the following lines:

  unitin <- gsub("\\s", "", unitin)
  unitout <- gsub("\\s", "", unitout)

See what we’ve done? We asked gsub() to replace every instance of white space (the regular expression is "\\s") with "". Easy.

Paste, briefly

Next, we want to put together a regular expression that will detect any of our metric.prefixes or units.for.lab. To save typing, we’ll do it with paste(), the second of our three R command families for the day. You probably already know about paste(), but if not, it’s basically the way to join R character variables into one big string. paste("Hi", "there") gives “Hi there” (paste() defaults to joining things with a space), paste("Super", "cali", "fragi", "listic", sep="") changes the separator to "" and gives us “Supercalifragilistic”. paste0() does the same thing as paste(..., sep=""). The little nuance that it’s worth noting today is that we are going to join together elements from a single vector rather than a bunch of separate variables…so we need to use the collapse = "..." option, where we set collapse to whatever character we want. You remember from the last section that | (OR) lets us put a bunch of alternative matches into our regular expression, so we will join all of the prefixes like this:

  prefix.combo <- paste0(metric.prefixes, collapse = "|")
  prefix.combo

## [1] "y|z|a|f|p|n|u|m|c|d||da|h|k|M|G|T|P|E|Z|Y"

What we’re really after is a regular expression that matches the beginning of the string, followed by 0 or 1 matches to one of the prefixes, followed by a match to one of the units. Soooo…

  prefix.combo <- paste0(metric.prefixes, collapse = "|")
  unit.combo <- paste0(units.for.lab, collapse = "|")
  
  unit.search <- paste0("^(", prefix.combo, ")?(", unit.combo, ")$")

  unit.search

## [1] "^(y|z|a|f|p|n|u|m|c|d||da|h|k|M|G|T|P|E|Z|Y)?(mol|g|L|U|IU)$"

So much nicer than trying to type that by hand. Next we’ll do actual pattern matching using the regexec() command. regexec(), as the documentation so nicely states, returns a list of vectors of substring matches. This is useful, since it means that we’ll get one match for the prefix (in the first set of parentheses of our regular expression), and one match for the units (in the second set of parentheses of our regular expression). I don’t want to belabor the details of this, but if we feed the output of regexec() to the regmatches() command, we can pull out one string for our prefix and another for our units. Since these are returned as a list, we’ll also use unlist() to coerce our results into one nice vector. If the length of that vector is 0, indicating no match, an error is generated.

  match.unit.in <- unlist(regmatches(unitin, regexec(unit.search, unitin)))
  match.unit.out <- unlist(regmatches(unitout, regexec(unit.search, unitout)))
  
  if (length(match.unit.in) == 0) stop(paste0("Can't parse input units (", unitin, ")"))
  if (length(match.unit.out) == 0) stop(paste0("Can't parse output units (", unitout, ")"))

If we were to take a closer look look at match.unit.in, we would see that the first entry is the full match, the second entry is the prefix match, and the third entry is the unit match. To make sure that the units agree (i.e. that we’re not trying to convert grams into liters or something similar), we use:

  if (match.unit.in[3] != match.unit.out[3]) stop("Base units don't match")

…and then finish by using the match() command to find the index in the metric.prefixes vector corresponding to the correct prefix (note that if there’s no prefix matched, it matches the "" entry of the vector–very handy). That index allows us to pull out the corresponding log multiplier, and we then return the difference to get a conversion factor. Our final function looks like this1:

convert.one.unit <- function (unitin, unitout) {
  # the prefix codes for the metric system
  metric.prefixes <- c("y", "z", "a", "f", "p", "n", "u", "m", "c", "d", "", "da", "h", "k", "M", "G", "T", "P", "E", "Z", "Y")
  # ...and their corresponding log multipliers
  metric.logmultipliers <- c(-24, -21, -18, -15, -12, -9, -6, -3, -2, -1, 0, 1, 2, 3, 6, 9, 12, 15, 18, 21, 24)
  # The units that we'd like to detect.  I guess we could add distance, but that's not too relevant to most of the analytes that I can think of
  units.for.lab <- c("mol", "g", "L", "U", "IU")

  # remove white space
  unitin <- gsub("\\s", "", unitin)
  unitout <- gsub("\\s", "", unitout)
  
  # build the pieces of our regular expression...
  prefix.combo <- paste0(metric.prefixes, collapse = "|")
  unit.combo <- paste0(units.for.lab, collapse = "|")

  # ...and stitch it all together
  unit.search <- paste0("^(", prefix.combo, ")?(", unit.combo, ")$")

  # identify the matches
  match.unit.in <- unlist(regmatches(unitin, regexec(unit.search, unitin)))
  match.unit.out <- unlist(regmatches(unitout, regexec(unit.search, unitout)))
  
  if (length(match.unit.in) == 0) stop(paste0("Can't parse input units (", unitin, ")"))
  if (length(match.unit.out) == 0) stop(paste0("Can't parse output units (", unitout, ")"))
  
  if (match.unit.in[3] != match.unit.out[3]) stop("Base units don't match")
  
  # get the appropriate log multipliers
  logmult.in <- metric.logmultipliers[match(match.unit.in[2], metric.prefixes)]
  logmult.out <- metric.logmultipliers[match(match.unit.out[2], metric.prefixes)]
  
  # return the appropriate (log) conversion factor
  return(logmult.in - logmult.out)
}


# Try it out
convert.one.unit("mL","L")

## [1] -3

‘Apply’-ing yourself

We’re actually most of the way there now. The final family of commands that we’d like to use is apply(), with various flavors that allow you to repeatedly apply (no surprise) a function to many entries of a variable. Dan mentioned this in his last post. He also mentioned not understanding the bad press that for loops get when they’re small. I completely agree with him, but the issue tends to arise when you’re used to a language like C (yes, I know we’re talking about compiled vs. interpreted in that case), where your loops are blazingly fast. You come to R and try nested loops that run from 1:10000, and then you have to go for coffee. lapply(), mapply(), mapply(), apply(), etc. have advantages in the R world. Might as well go with the flow on this one.

We’re going to make a convert.multiple.units() function that takes unitsin and unitsout vectors, binds them together as two columns, and then runs apply() to feed them to convert.one.unit(). Because apply() lets us interate a function over either dimension of a matrix, we can bind the two columns (a vector of original units and a vector of target units) and then iterate over each pair by rows (that’s what the 1 means as the second argument of apply(): it applies the function by row). If the anonymous function syntax throws you off…let us know in the comments, and we’ll cover it some time. For now, just understand that the last part of the line feeds values to the convert.one.unit()function.

convert.multiple.units <- function (unitsin, unitsout) {
  apply(cbind(unitsin, unitsout), 1, function (x) {convert.one.unit(x[1], x[2])})
}

Finally, we’ll go back to our original labunit.convert() function. Our overall plan is to split each unit by recognizing the “/” character using strsplit(). This returns a list of vectors of split groups (i.e. “mg/dL” becomes the a list where the first element is a character vector (“mg”, “dl”)). We then make sure that the lengths match (i.e. if the input is “mg/dL” and the output if “g/mL” that’s OK, but if the output is “g” then that’s a problem), obtain all the multipliers, and then add them all up. We add because they’re logs…and actually we mostly subtract, because we’re dividing. For cuteness points, we return 2*x[1] - sum(x), which will accurately calculate not only conversions like mg→g and mg/dL→g/L, but will even do crazy stuff like U/g/L→mU/kg/dL. Don’t ask me why you’d want to do that, but it works. The final multiplier is used to convert the vector of values (good for you if you notice that we didn’t check to make sure that the length of the values vector matched the unitsin vector…but we can always recycle our values that way).

labunit.convert <- function (values, unitsin, unitsout) {
  insep <- strsplit(unitsin, "/")
  outsep <- strsplit(unitsout, "/")

  lengthsin <- sapply(insep, length)
  lengthsout <- sapply(outsep, length)
  
  if (!all(lengthsin == lengthsout)) stop("Input and output units can't be converted")

  multipliers <- mapply(convert.multiple.units, insep, outsep)
  
  final.multiplier <- apply(t(multipliers), 1, function (x) {2*x[1] - sum(x)})
  
  return(values * 10^final.multiplier)
}

OK, enough. Back over to you, Dan. We now have a piece of code that we can use when we start comparing PT data from different instruments. That’s the immediate plan for future posts2, and before long there may even be an entry with nice graphics like those of my Canadian colleague.

-SRM

I received a request to convert “G/L” to “M/mL”, which was interpreted as converting billions/L to millions/mL. This requires changing our convert.one.unit() function to handle a “no units” case. Actually, it’s not as difficult as it sounds; if we just add an empty string (i.e. "") to the end of the units.for.lab vector, our regular expression does the right thing. Your edited line would read units.for.lab <- c("mol", "g", "L", "U", "IU", ""). The reason this works, incidentally, is that there’s no overlap (except "") between the prefixes and the units, so the pattern match doesn’t have a chance to be confused.↩
Following Dan’s lead, I should point out a major caveat to any such plans is James 4:13-15. Double extra credit if you are interested enough to look it up.↩

Regression Methods

Creating some random data

Residuals in OLS

Deming Regression

Ratio of Variances

Weighting

Passing Bablok

Outlier Effects

Generating a Pretty Plot

Conclusion

Trust in the Lord with all your heart and lean not on your own understanding; in all your ways submit to him, and he will make your paths straight. Proverbs 3:5-6

Removing NA’s from a Data Frame in R

The Problem

Finding NA's

Hey Hey! Ho Ho! Those NAs have got to go!

Complete Cases

na.omit()

Row Numbers are Actually Names

Now we can move on

Final Thought:

Introduction

Simple unit conversion

Getting started

And now for some regular expressions

Paste, briefly

‘Apply’-ing yourself

Trust in the Lord with all your heart
and lean not on your own understanding;
in all your ways submit to him,
and he will make your paths straight.

Proverbs 3:5-6