dtholmes@mail.ubc.ca

Deming and Passing Bablok Regression in R

September 14, 2015September 21, 2015 dtholmes@mail.ubc.ca

Regression Methods

In this post we will be discussing how to perform Passing Bablok and Deming regression in R. Those who work in Clinical Chemistry know that these two approaches are required by the journals in the field. The idiosyncratic affection for these two forms of regression appears to be historical but this is something unlikely to change in my lifetime–hence the need to cover it here.

Along the way, we shall touch on the ways in which Deming and Passing Bablok differ from ordinary least squares (OLS) and from one another.

Creating some random data

Let's start by making some heteroscedastic random data that we can use for regression. We will use the command set.seed() to begin with because by this means, the reader can generate the same random data as the post. This function takes any number you wish as its argument, but if you set the same seed, you will get the same random numbers. We will generate 100 random $x$ values in the uniform distribution and then an accompanying 100 random $y$ values with proportional bias, constant bias and random noise that increases with $x$. I have added a bit of non–linearity because we do see this a fair bit in our work.

set.seed(20)
x <- runif(100,0,100)
y <- 1.10*x - 0.001*x^2 + rnorm(100,0,1)*(2 + 0.05*x) + 15

set.seed(20)

x <- runif(100,0,100)

y <- 1.10*x - 0.001*x^2 + rnorm(100,0,1)*(2 + 0.05*x) + 15

The constants I chose are arbitrary. I chose them to produce something resembling a comparison of, say, two automated immunoassays.

Let's quickly produce a scatter plot to see what our data looks like:

plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")

1	plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")

plot of chunk unnamed-chunk-2

Residuals in OLS

OLS regression minimizes the sum of squared residuals. In the case of OLS, the residual of a point is defined as the vertical distance from that point to the regression line. The regression line is chosen so that the sum of the squares of the residuals in minimal.

OLS regression assumes that there is no error in the $x$–axis values and that there is no heteroscedasticity, that is, the scatter of $y$ is constant. Neither of these assumptions is true in the case of bioanaytical method comparisons. In contrast, for calibration curves in mass–spectrometry, a linear response is plotted as a function of pre–defined calibrator concentration. This means that the $x$–axis has very little error and so OLS regression is an appropriate choice (though I doubt that the assumption about homoscedasticity is generally met).

OLS is part of R's base package. We can find the OLS regression line using lm() and we will store the results in the variable lin.reg.

plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")
lin.reg <- lm(y~x)
abline(lin.reg, col="blue")

plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")

lin.reg <- lm(y~x)

abline(lin.reg, col="blue")

plot of chunk unnamed-chunk-3

Just to demonstrate the point about residuals graphically, the following shows them in vertical red lines.

plot of chunk unnamed-chunk-4

Deming Regression

Deming regression differs from OLS regression in that it does not make the assumption that the $x$ values are free of error. It (more or less) defines the residual as the perpendicular distance from a point to its fitted value on the regression line.

Deming regression does not come as part of R's base package but can be performed using the MethComp and mcr packages. In this case, we will use the latter. If not already installed, you must install the mcr package with install.packages("mcr").

Then to perform Deming regression, we will load the mcr library and execute the following using the mcreg() command, storing the output in the variable dem.reg.

library(mcr)
dem.reg <- mcreg(x,y, method.reg = "Deming")

1 2	library(mcr) dem.reg <- mcreg(x,y, method.reg = "Deming")

By performing the str() command on dem.reg, we can see that the regression parameters are stored in the slot @para. Because the authors have used an S4 object as the output of their function, we don't address output as we would in lists (with a $), but rather with an @.

str(dem.reg)

1	str(dem.reg)

## Formal class 'MCResultResampling' [package "mcr"] with 21 slots
##   ..@ glob.coef  : num [1:2] 15.58 1.04
##   ..@ glob.sigma : num [1:2] 0.8165 0.0147
##   ..@ xmean      : num 46.8
##   ..@ nsamples   : int 999
##   ..@ nnested    : num 25
##   ..@ B0         : num [1:999] 15.9 15.4 16 16.1 15.6 ...
##   ..@ B1         : num [1:999] 1.01 1.04 1.02 1.04 1.03 ...
##   ..@ sigmaB0    : num [1:999] 0.794 0.766 0.846 0.815 0.737 ...
##   ..@ sigmaB1    : num [1:999] 0.0141 0.0142 0.0155 0.0141 0.0135 ...
##   ..@ MX         : num [1:999] 46.8 45.9 45.4 48.9 45.5 ...
##   ..@ bootcimeth : chr "quantile"
##   ..@ rng.seed   : num NA
##   ..@ rng.kind   : chr NA
##   ..@ data       :'data.frame':  100 obs. of  3 variables:
##   .. ..$ sid: Factor w/ 100 levels "S1","S10","S100",..: 1 13 24 35 46 57 68 79 90 2 ...
##   .. ..$ x  : num [1:100] 87.8 76.9 27.9 52.9 96.3 ...
##   .. ..$ y  : num [1:100] 110.8 93.5 45.6 76.6 116.6 ...
##   ..@ para       : num [1:2, 1:4] 15.58 1.04 NA NA 14.45 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:2] "Intercept" "Slope"
##   .. .. ..$ : chr [1:4] "EST" "SE" "LCI" "UCI"
##   ..@ mnames     : chr [1:2] "Method1" "Method2"
##   ..@ regmeth    : chr "Deming"
##   ..@ cimeth     : chr "bootstrap"
##   ..@ error.ratio: num 1
##   ..@ alpha      : num 0.05
##   ..@ weight     : Named num [1:100] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:100] "S1" "S2" "S3" "S4" ...

## Formal class 'MCResultResampling' [package "mcr"] with 21 slots

## ..@ glob.coef : num [1:2] 15.58 1.04

## ..@ glob.sigma : num [1:2] 0.8165 0.0147

## ..@ xmean : num 46.8

## ..@ nsamples : int 999

## ..@ nnested : num 25

## ..@ B0 : num [1:999] 15.9 15.4 16 16.1 15.6 ...

## ..@ B1 : num [1:999] 1.01 1.04 1.02 1.04 1.03 ...

## ..@ sigmaB0 : num [1:999] 0.794 0.766 0.846 0.815 0.737 ...

## ..@ sigmaB1 : num [1:999] 0.0141 0.0142 0.0155 0.0141 0.0135 ...

## ..@ MX : num [1:999] 46.8 45.9 45.4 48.9 45.5 ...

## ..@ bootcimeth : chr "quantile"

## ..@ rng.seed : num NA

## ..@ rng.kind : chr NA

## ..@ data :'data.frame': 100 obs. of 3 variables:

## .. ..$ sid: Factor w/ 100 levels "S1","S10","S100",..: 1 13 24 35 46 57 68 79 90 2 ...

## .. ..$ x : num [1:100] 87.8 76.9 27.9 52.9 96.3 ...

## .. ..$ y : num [1:100] 110.8 93.5 45.6 76.6 116.6 ...

## ..@ para : num [1:2, 1:4] 15.58 1.04 NA NA 14.45 ...

## .. ..- attr(*, "dimnames")=List of 2

## .. .. ..$ : chr [1:2] "Intercept" "Slope"

## .. .. ..$ : chr [1:4] "EST" "SE" "LCI" "UCI"

## ..@ mnames : chr [1:2] "Method1" "Method2"

## ..@ regmeth : chr "Deming"

## ..@ cimeth : chr "bootstrap"

## ..@ error.ratio: num 1

## ..@ alpha : num 0.05

## ..@ weight : Named num [1:100] 1 1 1 1 1 1 1 1 1 1 ...

## .. ..- attr(*, "names")= chr [1:100] "S1" "S2" "S3" "S4" ...

dem.reg@para

1	dem.reg@para

##                EST SE       LCI       UCI
## Intercept 15.57790 NA 14.446677 16.810321
## Slope      1.03658 NA  1.006434  1.066066

## EST SE LCI UCI

## Intercept 15.57790 NA 14.446677 16.810321

## Slope 1.03658 NA 1.006434 1.066066

The intercept and slope are stored in demreg@para[1] and dem.reg@para[2] respectively. Therefore, we can add the regression line as follows:

plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")
abline(dem.reg@para[1:2], col = "blue")

1 2	plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method") abline(dem.reg@para[1:2], col = "blue")

plot of chunk unnamed-chunk-7

To emphasize how the residuals are different from OLS we can plot them as before:

plot of chunk unnamed-chunk-8

We present the figure above for instructional purposes only. The usual way to present a residuals plot is to show the same picture rotated until the line is horizontal–this is a slight simplification but is essentially what is happening:

plot of chunk unnamed-chunk-9

Ratio of Variances

It is important to mention that if one knows that the $x$–axis method is subject to a different amount of random analytical variability than the $y$–axis method, one should provide the ratio of the variances of the two methods to mcreg(). In general, this requires us to have “CV” data from precision studies already available. Another approach is to perform every analysis in duplicate by both methods and use the data to estimate this ratio.

If the methods happen to have similar CVs throughout the analytical range, the default value of 1 is assumed. But suppose that the ratio of the CVs of the $x$ axis method to the $y$–axis method was 1.2, we could provide this in the regression call by setting the error.ratio parameter. The resulting regression parameters will be slightly different.

mcreg(x,y, method.reg = "Deming", error.ratio = 1.2)@para

1	mcreg(x,y, method.reg = "Deming", error.ratio = 1.2)@para

##                 EST SE      LCI       UCI
## Intercept 15.534921 NA 14.39904 16.777065
## Slope      1.037499 NA  1.00792  1.067316

## EST SE LCI UCI

## Intercept 15.534921 NA 14.39904 16.777065

## Slope 1.037499 NA 1.00792 1.067316

Weighting

In the case of heteroscedastic data, it would be customary to weight the regression which in the case of the mcr package is weighted as $1/x^2$. This means that having 0's in your $x$–data will cause the calculation to “crump”. In any case, if we wanted weighted regression parameters we would make the call:

w.dem.reg <- mcreg(x,y, method.reg = "WDeming")

1	w.dem.reg <- mcreg(x,y, method.reg = "WDeming")

## The global.sigma is calculated with Linnet's method

1	## The global.sigma is calculated with Linnet's method

w.dem.reg@para

1	w.dem.reg@para

##                 EST SE       LCI       UCI
## Intercept 13.788450 NA 12.858803 14.861006
## Slope      1.088119 NA  1.058042  1.116879

## EST SE LCI UCI

## Intercept 13.788450 NA 12.858803 14.861006

## Slope 1.088119 NA 1.058042 1.116879

And plotting both on the same figure:

plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")
abline(dem.reg@para[1:2], col = "blue")
abline(w.dem.reg@para[1:2], col = "green")
legend("topleft", c("Deming","Weighted Deming"), lty=c(1,1), col = c("blue","green"))

plot(x,y, main = "Regression Comparison", xlab = "Current Method", ylab = "New Method")

abline(dem.reg@para[1:2], col = "blue")

abline(w.dem.reg@para[1:2], col = "green")

legend("topleft", c("Deming","Weighted Deming"), lty=c(1,1), col = c("blue","green"))

plot of chunk unnamed-chunk-12

Passing Bablok

Passing Bablok regression is not performed by the minimization of residuals. Rather, all possible pairs of $x$–$y$ points are determined and slopes are calculated using each pair of points. Work–arounds are undertaken for pairs of points that generate infinite slopes and other peculiarities. In any case, the median of the $\frac{N(N-1)}{2!}$ possible slopes becomes the final slope estimate and the corresponding intercept can be calculated. With regards to weighted Passing Bablok regression, I’d like to acknowledge commenter glen_b for bringing to my attention that there is a paradigm for calculating the weighted median of pairwise slopes. See the comment section for a discussion.

Passing Bablok regression takes a lot of computational time as the number of points grows, so expect some delays on data sets larger than $N=100$ if you are using an ordinary computer. To get the Passing Bablok regression equation, we just change the method.reg parameter:

PB.reg <- mcreg(x,y, method.reg = "PaBa")
PB.reg@para

1 2	PB.reg <- mcreg(x,y, method.reg = "PaBa") PB.reg@para

##                 EST SE       LCI       UCI
## Intercept 14.684463 NA 13.648554 16.495846
## Slope      1.046021 NA  1.015893  1.075632

## EST SE LCI UCI

## Intercept 14.684463 NA 13.648554 16.495846

## Slope 1.046021 NA 1.015893 1.075632

and the procedures to plot this regression are identical. The mcreg() function does have an option for Passing Bablok regression on large data sets. See the instructions by typing help("mcreg") in the R terminal.

Outlier Effects

As a consequence of the means by which the slope is determined, the Passing Bablok method is relatively resistant to the effect of outlier(s) as compared to OLS and Deming. To demonstrate this, we can add on outlier to some data scattered about the line $y=x$ and show how all three methods are affected.

x <- 1:20
y <- c(1:19,10) + rnorm(20,0,0.5)

1 2	x <- 1:20 y <- c(1:19,10) + rnorm(20,0,0.5)

plot of chunk unnamed-chunk-15

Because of this outlier, the OLS slope drops to 0.84, the Deming slope to 0.91, while the Passing Bablok is much better off at 0.99.

Generating a Pretty Plot

The code authors of the mcr package have created a feature such that if you put the regression model inside the plot function, you can quickly generate a figure for yourself that has all the required information on it. For example,

plot(PB.reg)

1	plot(PB.reg)

plot of chunk unnamed-chunk-16

But this method of out–of–the–box figure is not very customizable and you may want it to appear differently for your publication. Never fear. There is a solution. The MCResult.plot() function offers complete customization of the figure so that you can show it exactly as you wish for your publication.

MCResult.plot(PB.reg, equal.axis = TRUE, x.lab = "x method", y.lab = "y method", points.col = "#FF7F5060", points.pch = 19, ci.area = TRUE, ci.area.col = "#0000FF50", main = "My Passing Bablok Regression", sub = "", add.grid = FALSE, points.cex = 1)

1	MCResult.plot(PB.reg, equal.axis = TRUE, x.lab = "x method", y.lab = "y method", points.col = "#FF7F5060", points.pch = 19, ci.area = TRUE, ci.area.col = "#0000FF50", main = "My Passing Bablok Regression", sub = "", add.grid = FALSE, points.cex = 1)

custom mcr plot

In this example, I have created semi–transparent “darkorchid4” (hex = #68228B) points and a semi–transparent blue (hex = #0000FF) confidence band of the regression. Maybe darkorchid would not be my first choice for a publication after all, but it demonstrates the customization. Additionally, I have suppressed my least favourite features of the default plot method. Specifically, the sub="" term removes the sentence at the bottom margin and the add.grid = FALSE prevents the grid from being plotted. Enter help(MCResult.plot) for the complete low–down on customization.

Conclusion

We have seen how to perform Deming and Passing Bablok regression in the R programming language and have touched on how the methods differ “under the hood”. We have used the mcr to perform the regressions and have shown how you can beautify your plot.

The reader should have a look at the rlm() function in the MASS package and the rq() function in the quantreg package to see other robust (outlier–resistant) regression approaches. A good tutorial can be found here

I hope that makes it easy for you.

-Dan

May all your paths (and regressions) be straight:

Trust in the Lord with all your heart
and lean not on your own understanding;
in all your ways submit to him,
and he will make your paths straight.

Proverbs 3:5-6

NA NA NA NA, Hey Hey Hey, Goodbye

September 5, 2015September 7, 2015 dtholmes@mail.ubc.ca

Removing NA’s from a Data Frame in R

The Problem

Suppose you are doing a method comparison for which some results are above or below the linear range of your assay(s). Generally, these will appear in your spreadsheet (gasp!) program as $< x$ or $> y$ or, in the case of our mass spectrometer, “No Peak”. When you read these data into R using read.csv(), R will turn then into factors, which I personally find super–annoying and which inspired this conference badge (see bottom right) as I learned from University of British Columbia prof Jenny Bryan.

For this reason, when we read the data in, it is convenient to choose the option stringsAsFactors = FALSE. In doing so, the data will be treated as strings and be in the character class. But for regression comparison purposes, we need to make the data numeric and all of the $< x$ and $> y$ results will be converted to NA. In this post, we want to address a few questions that follow:

How do we find all the NA results?
How can we replace them with a numeric (like 0)?
How can we rid ourselves of rows containing NA?

Finding NA's

Let's read in the data which comes from a method comparison of serum aldosterone between our laboratory and Russ Grant's laboratory (LabCorp) published here. I'll read in the data with stringsAsFactors = FALSE. These are aldosterone results in pmol/L. To convert to ng/dL, divide by 27.7.

myData<-read.csv("Comparison.csv", sep=",", stringsAsFactors = FALSE)
str(myData)

1 2	myData<-read.csv("Comparison.csv", sep=",", stringsAsFactors = FALSE) str(myData)

## 'data.frame':    96 obs. of  3 variables:
##  $ Sample.Num: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Aldo.Us   : chr  "462.3" "433.2" "37.7" "137.7" ...
##  $ Aldo.Them : num  457.2 418.1 42.1 133.9 27.4 ...

## 'data.frame': 96 obs. of 3 variables:

## $ Sample.Num: int 1 2 3 4 5 6 7 8 9 10 ...

## $ Aldo.Us : chr "462.3" "433.2" "37.7" "137.7" ...

## $ Aldo.Them : num 457.2 418.1 42.1 133.9 27.4 ...

head(myData)

1	head(myData)

##   Sample.Num Aldo.Us Aldo.Them
## 1          1   462.3     457.2
## 2          2   433.2     418.1
## 3          3    37.7      42.1
## 4          4   137.7     133.9
## 5          5    29.4      27.4
## 6          6   552.1     639.7

## Sample.Num Aldo.Us Aldo.Them

## 1 1 462.3 457.2

## 2 2 433.2 418.1

## 3 3 37.7 42.1

## 4 4 137.7 133.9

## 5 5 29.4 27.4

## 6 6 552.1 639.7

You can see the problem immediately, our data (“Aldo.Us”) is a character vector. This is not good for regression. Why did this happen? We can find out:

myData$Aldo.Us

1	myData$Aldo.Us

##  [1] "462.3"   "433.2"   "37.7"    "137.7"   "29.4"    "552.1"   "41.6"   
##  [8] "158.7"   "1198"    "478.4"   "160.7"   "167.9"   "211.6"   "493.3"  
## [15] "195.6"   "649.8"   "644"     "534.1"   "212.7"   "413.3"   "150.7"  
## [22] "451.2"   "25.8"    "118.8"   "496.1"   "486.1"   "846.8"   "139.9"  
## [29] "No Peak" "98.3"    "113.8"   "230.7"   "530.2"   "26.6"    "390.3"  
## [36] "782.8"   "886.7"   "83.4"    "44"      "71.2"    "657"     "321.6"  
## [43] "188.6"   "451.2"   "485.3"   "No Peak" "144.9"   "249.6"   "682"    
## [50] "601.9"   "330.5"   "216.6"   "500.3"   "20.5"    "271.5"   "196.7"  
## [57] "309.4"   "235.7"   "171.7"   "124.9"   "293.6"   "345.4"   "243.5"  
## [64] "75.1"    "508.3"   "442.4"   "531.3"   "317.4"   "647.9"   "562"    
## [71] "366.5"   "37.1"    "231.6"   "73.7"    "526.3"   "No Peak" "165.6"  
## [78] "105.8"   "77.8"    "211.6"   "125.8"   "76.5"    "58.2"    "111.9"  
## [85] "238.5"   "31.6"    "156.8"   "191.7"   "402.5"   "108.9"   "183.7"  
## [92] "314.4"   "90"      "98.9"    "144.9"   "971.4"

## [1] "462.3" "433.2" "37.7" "137.7" "29.4" "552.1" "41.6"

## [8] "158.7" "1198" "478.4" "160.7" "167.9" "211.6" "493.3"

## [15] "195.6" "649.8" "644" "534.1" "212.7" "413.3" "150.7"

## [22] "451.2" "25.8" "118.8" "496.1" "486.1" "846.8" "139.9"

## [29] "No Peak" "98.3" "113.8" "230.7" "530.2" "26.6" "390.3"

## [36] "782.8" "886.7" "83.4" "44" "71.2" "657" "321.6"

## [43] "188.6" "451.2" "485.3" "No Peak" "144.9" "249.6" "682"

## [50] "601.9" "330.5" "216.6" "500.3" "20.5" "271.5" "196.7"

## [57] "309.4" "235.7" "171.7" "124.9" "293.6" "345.4" "243.5"

## [64] "75.1" "508.3" "442.4" "531.3" "317.4" "647.9" "562"

## [71] "366.5" "37.1" "231.6" "73.7" "526.3" "No Peak" "165.6"

## [78] "105.8" "77.8" "211.6" "125.8" "76.5" "58.2" "111.9"

## [85] "238.5" "31.6" "156.8" "191.7" "402.5" "108.9" "183.7"

## [92] "314.4" "90" "98.9" "144.9" "971.4"

Ahhh…it's the dreaded “No Peak”. This is what the mass spectrometer has put in its data file. So, let's force everything to numeric:

myData$Aldo.Us <- as.numeric(myData$Aldo.Us)

1	myData$Aldo.Us <- as.numeric(myData$Aldo.Us)

## Warning: NAs introduced by coercion

1	## Warning: NAs introduced by coercion

We see the warnings about the introduction of NAs. And we get:

myData$Aldo.Us

1	myData$Aldo.Us

##  [1]  462.3  433.2   37.7  137.7   29.4  552.1   41.6  158.7 1198.0  478.4
## [11]  160.7  167.9  211.6  493.3  195.6  649.8  644.0  534.1  212.7  413.3
## [21]  150.7  451.2   25.8  118.8  496.1  486.1  846.8  139.9     NA   98.3
## [31]  113.8  230.7  530.2   26.6  390.3  782.8  886.7   83.4   44.0   71.2
## [41]  657.0  321.6  188.6  451.2  485.3     NA  144.9  249.6  682.0  601.9
## [51]  330.5  216.6  500.3   20.5  271.5  196.7  309.4  235.7  171.7  124.9
## [61]  293.6  345.4  243.5   75.1  508.3  442.4  531.3  317.4  647.9  562.0
## [71]  366.5   37.1  231.6   73.7  526.3     NA  165.6  105.8   77.8  211.6
## [81]  125.8   76.5   58.2  111.9  238.5   31.6  156.8  191.7  402.5  108.9
## [91]  183.7  314.4   90.0   98.9  144.9  971.4

## [1] 462.3 433.2 37.7 137.7 29.4 552.1 41.6 158.7 1198.0 478.4

## [11] 160.7 167.9 211.6 493.3 195.6 649.8 644.0 534.1 212.7 413.3

## [21] 150.7 451.2 25.8 118.8 496.1 486.1 846.8 139.9 NA 98.3

## [31] 113.8 230.7 530.2 26.6 390.3 782.8 886.7 83.4 44.0 71.2

## [41] 657.0 321.6 188.6 451.2 485.3 NA 144.9 249.6 682.0 601.9

## [51] 330.5 216.6 500.3 20.5 271.5 196.7 309.4 235.7 171.7 124.9

## [61] 293.6 345.4 243.5 75.1 508.3 442.4 531.3 317.4 647.9 562.0

## [71] 366.5 37.1 231.6 73.7 526.3 NA 165.6 105.8 77.8 211.6

## [81] 125.8 76.5 58.2 111.9 238.5 31.6 156.8 191.7 402.5 108.9

## [91] 183.7 314.4 90.0 98.9 144.9 971.4

summary(myData$Aldo.Us)

1	summary(myData$Aldo.Us)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    20.5   118.8   230.7   305.5   478.4  1198.0       3

1 2	## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 20.5 118.8 230.7 305.5 478.4 1198.0 3

Now we have 3 NAs. We want to find them and get rid of them. From the screen we could figure out where the NAs were and manually replace them. This is OK on such a small data set but when you start dealing with data sets having thousands or millions of rows, approaches like this are impractical. So, let's do it right.

If we naively try to use an equality we find out nothing.

which(myData$Aldo.Us==NA)

1	which(myData$Aldo.Us==NA)

## integer(0)

1	## integer(0)

Hunh? Whasgoinon?

This occurs because NA means “unknown”. Think about it this way. If one patient's result is NA and another patient's result is NA, then are the results equal? No, they are not (necessarily) equal, they are both unknown and so the comparison should be unknown also. This is why we do not get a result of TRUE when we ask the following question:

NA==NA

NA==NA

## [1] NA

## [1] NA

So, when we ask R if unknown #1 is equal to unknown #2, it responds with “I dunno.”, or “NA”. So if we want to find the NAs, we should inquire as follows:

is.na(myData$Aldo.Us)

1	is.na(myData$Aldo.Us)

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [23] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE

## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [45] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE

## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

or, for less verbose output:

which(is.na(myData$Aldo.Us))

1	which(is.na(myData$Aldo.Us))

## [1] 29 46 76

1	## [1] 29 46 76

Hey Hey! Ho Ho! Those NAs have got to go!

Now we know where they are, in rows 29, 46, and 76. We can replace them with 0, which is OK but may pose problems if we use weighted regression (i.e. if we have a 0 in the x-data and we weight data by 1/x). Alternatively, we can delete the rows entirely.

To replace them with 0, we can write:

myData$Aldo.Us[which(is.na(myData$Aldo.Us))] <- 0

1	myData$Aldo.Us[which(is.na(myData$Aldo.Us))] <- 0

and this is equivalent:

myData$Aldo.Us[is.na(myData$Aldo.Us)] <- 0

1	myData$Aldo.Us[is.na(myData$Aldo.Us)] <- 0

To remove the whole corresponding row, we can write:

myDataBeGoneNA <- myData[-which(is.na(myData$Aldo.Us)),]

1	myDataBeGoneNA <- myData[-which(is.na(myData$Aldo.Us)),]

or:

myDataBeGoneNA <- myData[!is.na(myData$Aldo.Us),]

1	myDataBeGoneNA <- myData[!is.na(myData$Aldo.Us),]

Complete Cases

What if there were NA's hiding all over the place in multiple columns and we wanted to banish any row containing one or more NA? In this case, the complete.cases() function is one way to go:

complete.cases(myData)

1	complete.cases(myData)

##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [23]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [34]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [45]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [56]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [67]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [78]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [89]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## [23] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE

## [34] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## [45] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## [56] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## [67] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE

## [78] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## [89] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

This function shows us which rows have no NAs (the ones with TRUE as the result) and which rows have NAs (the three with FALSE). We can banish all rows containing any NAs generally as follows:

myDataBeGoneNA <- myData[complete.cases(myData),]

1	myDataBeGoneNA <- myData[complete.cases(myData),]

This data set now has 93 rows:

nrow(myDataBeGoneNA)

1	nrow(myDataBeGoneNA)

## [1] 93

## [1] 93

You could peruse the excluded data like this:

myData[!complete.cases(myData),]

1	myData[!complete.cases(myData),]

##    Sample.Num Aldo.Us Aldo.Them
## 29         29      NA       6.6
## 46         46      NA       7.0
## 76         76      NA       5.7

## Sample.Num Aldo.Us Aldo.Them

## 29 29 NA 6.6

## 46 46 NA 7.0

## 76 76 NA 5.7

na.omit()

Another way to remove incomplete cases is the na.omit() function (as Dr. Shannon Haymond pointed out to me). So this works too:

myDataBeGoneNA <- na.omit(myData)

1	myDataBeGoneNA <- na.omit(myData)

Row Numbers are Actually Names

In all of these approaches, you will notice something peculiar. Even though we have excluded the three rows, the row numbering still appears to imply that there are 96 rows:

tail(myDataBeGoneNA)

1	tail(myDataBeGoneNA)

##    Sample.Num Aldo.Us Aldo.Them
## 91         91   183.7     170.4
## 92         92   314.4     307.6
## 93         93    90.0     214.0
## 94         94    98.9      75.1
## 95         95   144.9     129.3
## 96         96   971.4     807.7

## Sample.Num Aldo.Us Aldo.Them

## 91 91 183.7 170.4

## 92 92 314.4 307.6

## 93 93 90.0 214.0

## 94 94 98.9 75.1

## 95 95 144.9 129.3

## 96 96 971.4 807.7

but if you check the dimensions, there are 93 rows:

nrow(myDataBeGoneNA)

1	nrow(myDataBeGoneNA)

## [1] 93

## [1] 93

Why? This is because the row numbers are not row numbers; they are numerical row names. When you exclude a row, none of the other row names change. This was bewildering to me in the beginning. I thought my exclusions had failed somehow.

Now we can move on

Once this is done, you can go on and do your regression, which, in this case, looks like this.

Comparison of Serum Aldosterone

Finally, if you are ever wondering what fraction of your data is comprised of NA, rather than the absolute number, you can do this as follows:

mean(is.na(myData$Aldo.Us))

1	mean(is.na(myData$Aldo.Us))

## [1] 0.03125

1	## [1] 0.03125

If you applied this to the whole dataframe, you get the fraction of NA's in the whole dataframe (again–thank you Shannon):

mean(is.na(myData))

1	mean(is.na(myData))

## [1] 0.01041667

1	## [1] 0.01041667

Final Thought:

is.na(newunderthesun)

1	is.na(newunderthesun)

## [1] TRUE

1	## [1] TRUE

Ecclesiastes 1:9.

-Dan

A Closer Look at TAT Time Dependence

August 28, 2015September 2, 2015 dtholmes@mail.ubc.ca

The Problem

We want to have a closer look at the time–dependence of turn around times (TATs). In particular, we would like to see if there is a significant trend in TAT over time (improvement or deterioration) and we would like the data to inform us of slowdowns and potentially unexpected problems that occur throughout each week. This should allow us to identify areas of the pre-analytical and/or analytical process phlebotomists that require attention.

My interest in this topic (which in past seemed entirely banal) came from the frustration of receiving monthly TAT reports showing spaghetti plots produced in Excel. In examining these figures is was entirely unclear to me whether any observed changes in the median (the only measure of central tendency provided) represented stochastic behaviour or a real problem. Utlimately, we want to be able to identify real problems in the preanalytical and analytical process but to do this, we need to visualize the data in a more sophisticated manner.

To do this, we are going to look at order–to–file times for a whole year for a nameless test X. You should be able to modify this approach to the manner in which your data is provided to you.

The real data was a little dirty but I have pre–cleaned it—this will have to be the topic of another post. In short, I purged the cancelled tests, removed duplicate records and limited my analysis to stat tests based on a stat flag that is stored in the laboratory information system (LIS). I won’t discuss this process here. The buffed–up file is named “2014_and_All_Clean.txt”. This happens to be a tab–delimited txt file. For this reason, I used read.delim() rather than read.csv(). These are basically the same function with different defaults for the seperator–one uses a comma and the other uses a tab. Please see our first post on TAT to understand how we are using the lubridate function ymd_hm().

Loading the Data

library(lubridate)
myData <- read.delim(file = "2014_and_All_clean.txt")
myData$ordered <- ymd_hm(myData$ordered)
myData$collected <- ymd_hm(myData$collected)
myData$received <- ymd_hm(myData$received)
myData$resulted <- ymd_hm(myData$resulted)
#confirm success
head(myData)

library(lubridate)

myData <- read.delim(file = "2014_and_All_clean.txt")

myData$ordered <- ymd_hm(myData$ordered)

myData$collected <- ymd_hm(myData$collected)

myData$received <- ymd_hm(myData$received)

myData$resulted <- ymd_hm(myData$resulted)

#confirm success

head(myData)

##               ordered           collected            received
## 1 2014-01-01 17:53:00 2014-01-01 17:54:00 2014-01-01 18:08:00
## 2 2014-01-01 15:10:00 2014-01-01 15:19:00 2014-01-01 15:21:00
## 3 2014-01-01 17:07:00 2014-01-01 17:15:00 2014-01-01 17:17:00
## 4 2014-01-01 18:20:00 2014-01-01 18:30:00 2014-01-01 18:35:00
## 5 2014-01-01 11:19:00 2014-01-01 11:25:00 2014-01-01 11:29:00
## 6 2014-01-01 11:00:00 2014-01-01 11:08:00 2014-01-01 11:11:00
##              resulted
## 1 2014-01-01 18:45:00
## 2 2014-01-01 16:33:00
## 3 2014-01-01 18:09:00
## 4 2014-01-01 19:45:00
## 5 2014-01-01 13:33:00
## 6 2014-01-01 11:47:00

## ordered collected received

## 1 2014-01-01 17:53:00 2014-01-01 17:54:00 2014-01-01 18:08:00

## 2 2014-01-01 15:10:00 2014-01-01 15:19:00 2014-01-01 15:21:00

## 3 2014-01-01 17:07:00 2014-01-01 17:15:00 2014-01-01 17:17:00

## 4 2014-01-01 18:20:00 2014-01-01 18:30:00 2014-01-01 18:35:00

## 5 2014-01-01 11:19:00 2014-01-01 11:25:00 2014-01-01 11:29:00

## 6 2014-01-01 11:00:00 2014-01-01 11:08:00 2014-01-01 11:11:00

## resulted

## 1 2014-01-01 18:45:00

## 2 2014-01-01 16:33:00

## 3 2014-01-01 18:09:00

## 4 2014-01-01 19:45:00

## 5 2014-01-01 13:33:00

## 6 2014-01-01 11:47:00

Now we want to look at a TAT. As in our first post on this topic, we will look at the order–to–file time.

otf <- difftime(myData$resulted, myData$ordered,units = "min")
myData <- cbind(myData,otf)

1 2	otf <- difftime(myData$resulted, myData$ordered,units = "min") myData <- cbind(myData,otf)

Sanity Check

Let’s just have a quick look at this to make sure nothing crazy is happening.

hist(as.numeric(myData$otf),xlim = c(0,200),breaks = 150, col = "orange", xlab = "TAT for X (min)", main = "Histogram of TAT for X")

1	hist(as.numeric(myData$otf),xlim = c(0,200),breaks = 150, col = "orange", xlab = "TAT for X (min)", main = "Histogram of TAT for X")

summary(as.numeric(myData$otf))

1	summary(as.numeric(myData$otf))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   51.00   62.00   70.85   77.00 1656.00

1 2	## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3.00 51.00 62.00 70.85 77.00 1656.00

Some Nutty Stuff

We do note one thing—there is a sample with a TAT of 1656 min. This is a little crazy so we could investigate those samples to see if this is real (because of a lost sample) or an artifact of an add–on analysis being misidentified as a stat or some other similar nonsensical event.

If you wanted to list all of these extreme outliers for the year, you could do so like this:

library("dplyr")

1	library("dplyr")

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Attaching package: 'dplyr'

## The following objects are masked from 'package:lubridate':

## intersect, setdiff, union

## The following objects are masked from 'package:stats':

## filter, lag

## The following objects are masked from 'package:base':

## intersect, setdiff, setequal, union

head(arrange(myData,desc(otf)),10)

1	head(arrange(myData,desc(otf)),10)

##                ordered           collected            received
## 1  2014-09-21 21:50:00 2014-09-21 21:58:00 2014-09-21 22:04:00
## 2  2014-09-07 14:59:00 2014-09-08 12:00:00 2014-09-08 12:04:00
## 3  2014-03-18 13:45:00 2014-03-18 14:10:00 2014-03-18 14:29:00
## 4  2014-04-01 01:48:00 2014-04-01 02:03:00 2014-04-01 02:06:00
## 5  2014-04-01 02:21:00 2014-04-01 02:28:00 2014-04-01 02:31:00
## 6  2014-09-04 21:35:00 2014-09-04 21:45:00 2014-09-04 22:08:00
## 7  2014-10-08 10:38:00 2014-10-08 10:42:00 2014-10-08 10:46:00
## 8  2014-05-25 13:23:00 2014-05-25 13:35:00 2014-05-25 13:45:00
## 9  2014-08-06 16:50:00 2014-08-06 17:09:00 2014-08-06 17:15:00
## 10 2014-08-02 11:52:00 2014-08-02 21:30:00 2014-08-02 21:44:00
##               resulted       otf
## 1  2014-09-23 01:26:00 1656 mins
## 2  2014-09-08 13:26:00 1347 mins
## 3  2014-03-19 11:17:00 1292 mins
## 4  2014-04-01 17:29:00  941 mins
## 5  2014-04-01 16:20:00  839 mins
## 6  2014-09-05 11:07:00  812 mins
## 7  2014-10-08 22:41:00  723 mins
## 8  2014-05-26 01:20:00  717 mins
## 9  2014-08-07 03:46:00  656 mins
## 10 2014-08-02 22:44:00  652 mins

## ordered collected received

## 1 2014-09-21 21:50:00 2014-09-21 21:58:00 2014-09-21 22:04:00

## 2 2014-09-07 14:59:00 2014-09-08 12:00:00 2014-09-08 12:04:00

## 3 2014-03-18 13:45:00 2014-03-18 14:10:00 2014-03-18 14:29:00

## 4 2014-04-01 01:48:00 2014-04-01 02:03:00 2014-04-01 02:06:00

## 5 2014-04-01 02:21:00 2014-04-01 02:28:00 2014-04-01 02:31:00

## 6 2014-09-04 21:35:00 2014-09-04 21:45:00 2014-09-04 22:08:00

## 7 2014-10-08 10:38:00 2014-10-08 10:42:00 2014-10-08 10:46:00

## 8 2014-05-25 13:23:00 2014-05-25 13:35:00 2014-05-25 13:45:00

## 9 2014-08-06 16:50:00 2014-08-06 17:09:00 2014-08-06 17:15:00

## 10 2014-08-02 11:52:00 2014-08-02 21:30:00 2014-08-02 21:44:00

## resulted otf

## 1 2014-09-23 01:26:00 1656 mins

## 2 2014-09-08 13:26:00 1347 mins

## 3 2014-03-19 11:17:00 1292 mins

## 4 2014-04-01 17:29:00 941 mins

## 5 2014-04-01 16:20:00 839 mins

## 6 2014-09-05 11:07:00 812 mins

## 7 2014-10-08 22:41:00 723 mins

## 8 2014-05-26 01:20:00 717 mins

## 9 2014-08-07 03:46:00 656 mins

## 10 2014-08-02 22:44:00 652 mins

which gives you the TAT of the 10 (or whatever number you prefer) worst specimens for the year. Obviously when you do this kind of analysis on your own data, you will retain the specimen ID in the data set and you could explore what is going on here–whether these are add–ons etc. You discover interesting things when you dig into your data.

Time Dependence

But we are interested in time-dependence of the TAT, so let’s look at a scatterplot of the whole year.

start <- ceiling_date(min(myData$collected))
finish <- start+days(356)
plot(myData$collected,myData$otf, pch = 19, xlim = c(start,finish),col = "#00000020",cex = 0.5, ylim = c(0,200), ylab = "TAT of X (min)", xlab = "Date")

start <- ceiling_date(min(myData$collected))

finish <- start+days(356)

plot(myData$collected,myData$otf, pch = 19, xlim = c(start,finish),col = "#00000020",cex = 0.5, ylim = c(0,200), ylab = "TAT of X (min)", xlab = "Date")

So, that’s pretty hard to draw inferences from. We can see that there are some outliers with inconceivably low TAT. We will have to investigate what is going on with those collections but not right now. These outliers will not affect the non–parametric measures of central tendency.

Tunnelling Down

Let’s have a look at one week.

finish <- start + days(7)
ticks <- seq(from = start, to = finish, by = "days")
plot(myData$collected,myData$otf, pch = 19, xlim = c(start,finish), col = "#00000020", cex = 0.5, ylim = c(0,200), ylab = "TAT of X (min)", xlab = "", xaxt = "n")
axis.POSIXct(side = 1, ticks, at = ticks, las = 2, cex.axis = 0.6, col.axis = "gray30", format = "%b %d %Y")
mtext("Date of analysis", side = 1, line = 4)

finish <- start + days(7)

ticks <- seq(from = start, to = finish, by = "days")

plot(myData$collected,myData$otf, pch = 19, xlim = c(start,finish), col = "#00000020", cex = 0.5, ylim = c(0,200), ylab = "TAT of X (min)", xlab = "", xaxt = "n")

axis.POSIXct(side = 1, ticks, at = ticks, las = 2, cex.axis = 0.6, col.axis = "gray30", format = "%b %d %Y")

mtext("Date of analysis", side = 1, line = 4)

See the first post on this topic for more information about the plotting parameters.

We can see there is a definite (and unsurprising) periodicity in the number of tests per hour. We can look at “volumes” another time. What we want to do now is look for time–dependence in the TAT so we can ultimately investigate what days of the week and times of the day are worse. But we don’t want to do this for one week—we want to do this for all weeks in the year. It would be nice, for example to plot all the Sundays, Mondays, Tuesdays etc overlapping and then see if we can see day–of–week and time–of–day trends.

Some More Lubridate Magic

Therefore, we need to assign every point in our myData dataframe a day of the week. The lubridate function wday() does this for us.

start

start

## [1] "2014-01-01 02:39:00 UTC"

1	## [1] "2014-01-01 02:39:00 UTC"

wday(start, label = TRUE)

1	wday(start, label = TRUE)

## [1] Wed
## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat

1 2	## [1] Wed ## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat

#if you want them as numbers, just leave the label option out.
wday(start)

1 2	#if you want them as numbers, just leave the label option out. wday(start)

## [1] 4

## [1] 4

So, January 1, 2014 was a Wednesday, which is the 4th day of the week. Let’s assign the day of the week for all our days and then bind this to our data.

weekday <- wday(myData$collected)
myData <- cbind(myData,weekday)
head(myData)

weekday <- wday(myData$collected)

myData <- cbind(myData,weekday)

head(myData)

##               ordered           collected            received
## 1 2014-01-01 17:53:00 2014-01-01 17:54:00 2014-01-01 18:08:00
## 2 2014-01-01 15:10:00 2014-01-01 15:19:00 2014-01-01 15:21:00
## 3 2014-01-01 17:07:00 2014-01-01 17:15:00 2014-01-01 17:17:00
## 4 2014-01-01 18:20:00 2014-01-01 18:30:00 2014-01-01 18:35:00
## 5 2014-01-01 11:19:00 2014-01-01 11:25:00 2014-01-01 11:29:00
## 6 2014-01-01 11:00:00 2014-01-01 11:08:00 2014-01-01 11:11:00
##              resulted      otf weekday
## 1 2014-01-01 18:45:00  52 mins       4
## 2 2014-01-01 16:33:00  83 mins       4
## 3 2014-01-01 18:09:00  62 mins       4
## 4 2014-01-01 19:45:00  85 mins       4
## 5 2014-01-01 13:33:00 134 mins       4
## 6 2014-01-01 11:47:00  47 mins       4

## ordered collected received

## 1 2014-01-01 17:53:00 2014-01-01 17:54:00 2014-01-01 18:08:00

## 2 2014-01-01 15:10:00 2014-01-01 15:19:00 2014-01-01 15:21:00

## 3 2014-01-01 17:07:00 2014-01-01 17:15:00 2014-01-01 17:17:00

## 4 2014-01-01 18:20:00 2014-01-01 18:30:00 2014-01-01 18:35:00

## 5 2014-01-01 11:19:00 2014-01-01 11:25:00 2014-01-01 11:29:00

## 6 2014-01-01 11:00:00 2014-01-01 11:08:00 2014-01-01 11:11:00

## resulted otf weekday

## 1 2014-01-01 18:45:00 52 mins 4

## 2 2014-01-01 16:33:00 83 mins 4

## 3 2014-01-01 18:09:00 62 mins 4

## 4 2014-01-01 19:45:00 85 mins 4

## 5 2014-01-01 13:33:00 134 mins 4

## 6 2014-01-01 11:47:00 47 mins 4

Now let’s plot all of the Monday–data for the whole year and look at the with–day–trends for Mondays. We are going to convert all of the TATs and times at which they are collected to decimal numbers so we don’t run into any hassles. (Yes, I ran into hassles when I did not do this.)

This little function accomplishes this for us:

#convert the time to decimal number of hours since midnight for simplicity in plotting
timeconvert = function(t){hour(t)+minute(t)/60}

1 2	#convert the time to decimal number of hours since midnight for simplicity in plotting timeconvert = function(t){hour(t)+minute(t)/60}

mondayData <- subset(myData,weekday == 2)
mondayTimes <- timeconvert(mondayData$collected)
mondayData <- cbind(mondayData,times = mondayTimes)
mondayData$otf <- as.numeric(mondayData$otf)
head(mondayData)

mondayData <- subset(myData,weekday == 2)

mondayTimes <- timeconvert(mondayData$collected)

mondayData <- cbind(mondayData,times = mondayTimes)

mondayData$otf <- as.numeric(mondayData$otf)

head(mondayData)

##                 ordered           collected            received
## 225 2014-01-06 23:35:00 2014-01-06 23:30:00 2014-01-07 00:09:00
## 226 2014-01-06 11:02:00 2014-01-06 11:10:00 2014-01-06 11:14:00
## 227 2014-01-06 14:08:00 2014-01-06 14:20:00 2014-01-06 14:24:00
## 228 2014-01-06 10:06:00 2014-01-06 10:16:00 2014-01-06 10:19:00
## 229 2014-01-06 14:42:00 2014-01-06 15:09:00 2014-01-06 15:16:00
## 230 2014-01-06 20:02:00 2014-01-06 20:17:00 2014-01-06 20:22:00
##                resulted otf weekday    times
## 225 2014-01-07 00:46:00  71       2 23.50000
## 226 2014-01-06 12:42:00 100       2 11.16667
## 227 2014-01-06 16:02:00 114       2 14.33333
## 228 2014-01-06 11:44:00  98       2 10.26667
## 229 2014-01-06 16:04:00  82       2 15.15000
## 230 2014-01-06 20:55:00  53       2 20.28333

## ordered collected received

## 225 2014-01-06 23:35:00 2014-01-06 23:30:00 2014-01-07 00:09:00

## 226 2014-01-06 11:02:00 2014-01-06 11:10:00 2014-01-06 11:14:00

## 227 2014-01-06 14:08:00 2014-01-06 14:20:00 2014-01-06 14:24:00

## 228 2014-01-06 10:06:00 2014-01-06 10:16:00 2014-01-06 10:19:00

## 229 2014-01-06 14:42:00 2014-01-06 15:09:00 2014-01-06 15:16:00

## 230 2014-01-06 20:02:00 2014-01-06 20:17:00 2014-01-06 20:22:00

## resulted otf weekday times

## 225 2014-01-07 00:46:00 71 2 23.50000

## 226 2014-01-06 12:42:00 100 2 11.16667

## 227 2014-01-06 16:02:00 114 2 14.33333

## 228 2014-01-06 11:44:00 98 2 10.26667

## 229 2014-01-06 16:04:00 82 2 15.15000

## 230 2014-01-06 20:55:00 53 2 20.28333

So this seems to have worked and now we can make a scatter plot.

Monday Monday, So Good to Me?

plot(mondayData$times,mondayData$otf, pch = 19, col = "#00000020", cex = 0.5, ylim = c(0,200), xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)")
axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")

1 2	plot(mondayData$times,mondayData$otf, pch = 19, col = "#00000020", cex = 0.5, ylim = c(0,200), xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)") axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")

But now for the interesting part. We want to see how the median TAT is related to the time of day. We might want to look at, say, the running median over one–hour window all day long. Notice that I have made the times, t, go from 0.5 to 23.5 because these are the only times for which a 60 min moving median can be calculated. Otherwise we’d have this really annoying situation where we’d have to fetch data from the last half-hour of Sunday and the first half-hour of Tuesday. I don’t need that level of perfectionism at present.

plot(mondayData$times,mondayData$otf, pch = 19, col = "#00000020",cex = 0.5, ylim = c(30,100),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)")
axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")

#60 min moving median calculation:
#Create a point at every minute of the day
t <- seq(from = 0.5,to = 23.5, by = 1/60)
#create and empty vector to store the data
med.tats <- vector()

for(i in 1:length(t)){
  #get all points within an hour of the minute
  tats <- subset(mondayData,(mondayData$times >= (t[i]-0.5)) & (mondayData$times < (t[i]+0.5)))
  #alternatively using filter() from the dplyr package
  #tats <- filter(mondayData, times >= (t[i] - 0.5) & times < (t[i] + 0.5))
  med.tats[i] <- median(tats$otf)
}

lines(t,med.tats, col = "blue")
grid(NA,NULL)

plot(mondayData$times,mondayData$otf, pch = 19, col = "#00000020",cex = 0.5, ylim = c(30,100),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)")

axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")

#60 min moving median calculation:

#Create a point at every minute of the day

t <- seq(from = 0.5,to = 23.5, by = 1/60)

#create and empty vector to store the data

med.tats <- vector()

for(i in 1:length(t)){

#get all points within an hour of the minute

tats <- subset(mondayData,(mondayData$times >= (t[i]-0.5)) & (mondayData$times < (t[i]+0.5)))

#alternatively using filter() from the dplyr package

#tats <- filter(mondayData, times >= (t[i] - 0.5) & times < (t[i] + 0.5))

med.tats[i] <- median(tats$otf)

}

lines(t,med.tats, col = "blue")

grid(NA,NULL)

Removing the For–ness

Many R folks don’t like for–loops and would rather use the apply() family of functions. I’m not sure I always understand contempt towards loops for small simple tasks but if you wanted to accomplish the same looping task without using a for–loop, you could do as follows:

t <- seq(from = 0.5, to = 23.5, by = 1/60)
movingmed<-function(t,x){
  tats <- filter(x, times >= (t - 0.5) & times < (t + 0.5))
  median(tats$otf)
}
med.tats <- sapply(t, movingmed, x=mondayData)

t <- seq(from = 0.5, to = 23.5, by = 1/60)

movingmed<-function(t,x){

tats <- filter(x, times >= (t - 0.5) & times < (t + 0.5))

median(tats$otf)

}

med.tats <- sapply(t, movingmed, x=mondayData)

Smoothing

This approach is reasonable but the problem (I have found) is that it is computationally expensive on large data sets. For this reason, it is nice to use a canned smoothing algorithm like LOWESS which is much faster. The parameter f of the lowess function has a default of 2/3 which in our case results in a fit that is way–too smoothed. I played around with f until I got something that more or less tracked with the 60–min moving median. There are many approaches to smoothing–don’t get lost in the vortex.

Lowess Smoothing

plot(mondayData$times,mondayData$otf, pch = 19, col = "#00000020",cex = 0.5, ylim = c(30,100),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)")
axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")
#running median
#create a point at every minute of the day
t <- seq(from = 0.5,to = 23.5, by = 1/60)
med.tats <- vector()
for(i in 1:length(t)){
  #get all points within an hour of the minute
  tats <- subset(mondayData,(mondayData$times >= (t[i] - 0.5)) & (mondayData$times < (t[i] + 0.5)))
  #alternatively using filter() from the dplyr package
  #tats <- filter(mondayData, times >= (t[i] - 0.5) & times < (t[i] + 0.5))
  med.tats[i] <- median(tats$otf)
}
lines(t,med.tats, col = "blue")
grid(col = "black")
mondayFit <- lowess(mondayData$times,mondayData$otf,f = 0.05)
lines(mondayFit,col = "red")

plot(mondayData$times,mondayData$otf, pch = 19, col = "#00000020",cex = 0.5, ylim = c(30,100),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)")

axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")

#running median

#create a point at every minute of the day

t <- seq(from = 0.5,to = 23.5, by = 1/60)

med.tats <- vector()

for(i in 1:length(t)){

#get all points within an hour of the minute

tats <- subset(mondayData,(mondayData$times >= (t[i] - 0.5)) & (mondayData$times < (t[i] + 0.5)))

#alternatively using filter() from the dplyr package

#tats <- filter(mondayData, times >= (t[i] - 0.5) & times < (t[i] + 0.5))

med.tats[i] <- median(tats$otf)

}

lines(t,med.tats, col = "blue")

grid(col = "black")

mondayFit <- lowess(mondayData$times,mondayData$otf,f = 0.05)

lines(mondayFit,col = "red")

So, that’s cool. Now lets loop over all the days of the week and make plots for each day.

#create a 2x4 plot window
par(mfrow = c(2,4))
#loop over days
for (i in 1:7){
  #make the lowess fit
  mydayData <- subset(myData,weekday == i)
  mydayTimes <- timeconvert(mydayData$collected)
  mydayData <- cbind(mydayData,times = mydayTimes)
  mydayFit <- lowess(mydayData$times,mydayData$otf,f = 0.05)
  plot(NA,NA,ylim = c(30,100), xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)")
  axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")
  
  #running median
  #create a point at every minute of the day
  t <- seq(from = 0.5,to = 23.5, by = 1/60)
  med.tats <- vector()
  for(j in 1:length(t)){
    #get all points within an hour of the minute
    tats <- subset(mydayData,(mydayData$times >= (t[j] - 0.5)) & (mydayData$times < (t[j] + 0.5)))
    #alternatively using filter() from the dplyr package
    #tats <- filter(mydayData, times >= (t[j] - 0.5) & times < (t[j] + 0.5))
    med.tats[j] <- median(tats$otf)
  }
  lines(t,med.tats, col = "blue")
  grid(col = "black")
  mydayFit <- lowess(mydayData$times,mydayData$otf,f = 0.05)
  lines(mydayFit,col = "red")
  
  #put in horizontal gridding
  grid(NA,NULL)
  #you could add a legend on all the plots if you wanted
  #legend("bottomright", c("moving median","lowess"), lty = c(1,1), col = c("blue","red", byt = "n"))
}

#create a 2x4 plot window

par(mfrow = c(2,4))

#loop over days

for (i in 1:7){

#make the lowess fit

mydayData <- subset(myData,weekday == i)

mydayTimes <- timeconvert(mydayData$collected)

mydayData <- cbind(mydayData,times = mydayTimes)

mydayFit <- lowess(mydayData$times,mydayData$otf,f = 0.05)

plot(NA,NA,ylim = c(30,100), xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)")

axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")

#running median

#create a point at every minute of the day

t <- seq(from = 0.5,to = 23.5, by = 1/60)

med.tats <- vector()

for(j in 1:length(t)){

#get all points within an hour of the minute

tats <- subset(mydayData,(mydayData$times >= (t[j] - 0.5)) & (mydayData$times < (t[j] + 0.5)))

#alternatively using filter() from the dplyr package

#tats <- filter(mydayData, times >= (t[j] - 0.5) & times < (t[j] + 0.5))

med.tats[j] <- median(tats$otf)

}

lines(t,med.tats, col = "blue")

grid(col = "black")

mydayFit <- lowess(mydayData$times,mydayData$otf,f = 0.05)

lines(mydayFit,col = "red")

#put in horizontal gridding

grid(NA,NULL)

#you could add a legend on all the plots if you wanted

#legend("bottomright", c("moving median","lowess"), lty = c(1,1), col = c("blue","red", byt = "n"))

}

Now, let’s overplot all the lowess fits on a single graph and see what practical observations we can make. I have increased the lowess() smoothing to make things easier to look at.

par(mfrow = c(1,1))
plot(0,0,ylim = c(40,80),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)")
axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")
for (i in 1:7){
  mydayData <- subset(myData,weekday == i)
  mydayTimes <- timeconvert(mydayData$collected)
  mydayData <- cbind(mydayData,times = mydayTimes)
  mydayFit <- lowess(mydayData$times,mydayData$otf,f = 0.1)
  lines(mydayFit, col = rainbow(7)[i], ylim = c(40,80),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)", main = paste("TAT for",wday(i,label = TRUE),"in 2014"))
}
legend("bottomright",as.character(wday(1:7,label = TRUE)), col = rainbow(7), lty = rep(1,7), cex = 0.5)

par(mfrow = c(1,1))

plot(0,0,ylim = c(40,80),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)")

axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")

for (i in 1:7){

mydayData <- subset(myData,weekday == i)

mydayTimes <- timeconvert(mydayData$collected)

mydayData <- cbind(mydayData,times = mydayTimes)

mydayFit <- lowess(mydayData$times,mydayData$otf,f = 0.1)

lines(mydayFit, col = rainbow(7)[i], ylim = c(40,80),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)", main = paste("TAT for",wday(i,label = TRUE),"in 2014"))

}

legend("bottomright",as.character(wday(1:7,label = TRUE)), col = rainbow(7), lty = rep(1,7), cex = 0.5)

Observations

We can immediately see some issues. Weekends in the early hours of the morning are bad. 8 am is bad across all days. Noon is generally problematic and particularly so on Saturdays. There is also a slowdown in mid–afternoon and in the early evenings. Saturday midnight is the most problematic time, although the endpoints of the figure have fewer local weighting points and their confidence intervals are wider. This is something we can cover another time.

Remember, also, this is only the median we have looked at. Other horrors may be lurking in the 90th percentile.

Next time what we will do is move all of this TAT visualization to a 3D representation so we can more easily spot the problematic times.

-Dan

The lot is cast into the lap, but its every decision is from the LORD.

Proverbs 16:33

Generating Meaningful Turaround Time Plots for Clinical Laboratory Medicine

August 11, 2015August 11, 2015 dtholmes@mail.ubc.ca

The Problem

It is standard practice in Clinical Laboratory Medicine to monitor turn around times (TATs) for high volume tests like potassium (K), Troponin (Tn) and Hemoglobin (Hb). The term TAT is typically understood to mean “the time elapsed from when the doctor orders the test to the time the result is available in the Laboratory Information System (LIS)”. This of course does not take into account the lag between the result availability and the time when the physician logs in to view it and respond, but let’s just say that we are not there yet.

Traditionally, some dedicated soul would take .csv extracts from the LIS and do laborious things in Excel to generate the median TAT for the month for each test and each lab location for which they were responsible. Not only is it impossible to automate such a process, it is entirely manual and produces fairly uninformative output since (at least at our site) only medians were generated.

What really frustrates physicians is not where the median goes each month, it is the behaviour of, say, the 90th percentile of TAT or the outliers. These are the ones they remember.

R allows us to produce a much more informative figure in an automatable fashion. I provide here an example of a TAT figure for Hb with some statistical metric included.

Look at the Data

Let’s start by reading in our data and looking at how it is structured.

myData<-read.csv(file = "Hb_TAT_data.csv",header = TRUE)
str(myData)
head(myData)

myData<-read.csv(file = "Hb_TAT_data.csv",header = TRUE)

str(myData)

head(myData)

## 'data.frame':    4497 obs. of  6 variables:
##  $ specimenID: int  4221 5281 5308 5320 5356 5374 5375 5376 5241 5270 ...
##  $ ordered   : Factor w/ 4244 levels "2015-06-30 23:15",..: 4 68 86 92 115 126 127 128 46 61 ...
##  $ collected : Factor w/ 4245 levels "2015-07-01 00:02",..: 2 72 88 95 114 126 128 129 46 65 ...
##  $ received  : Factor w/ 4098 levels "2015-07-01 00:07",..: 2 69 81 88 107 119 120 122 45 62 ...
##  $ resulted  : Factor w/ 3765 levels "2015-07-01 00:11",..: 2 70 79 86 104 113 114 116 42 63 ...
##  $ result    : int  126 110 134 117 135 113 109 111 106 129 ...

## specimenID ordered collected received
## 1 4221 2015-06-30 23:28 2015-07-01 00:08 2015-07-01 00:29
## 2 5281 2015-07-01 14:12 2015-07-01 14:25 2015-07-01 14:30
## 3 5308 2015-07-01 15:56 2015-07-01 16:03 2015-07-01 16:10
## 4 5320 2015-07-01 16:57 2015-07-01 17:12 2015-07-01 17:20
## 5 5356 2015-07-01 20:00 2015-07-01 20:07 2015-07-01 20:18
##6 5374 2015-07-01 21:37 2015-07-01 21:40 2015-07-01 21:44
 resulted result
1 2015-07-01 00:37 126
2 2015-07-01 14:50 110
3 2015-07-01 16:16 134
4 2015-07-01 17:23 117
5 2015-07-01 20:23 135
6 2015-07-01 21:53 113

## 'data.frame': 4497 obs. of 6 variables:

## $ specimenID: int 4221 5281 5308 5320 5356 5374 5375 5376 5241 5270 ...

## $ ordered : Factor w/ 4244 levels "2015-06-30 23:15",..: 4 68 86 92 115 126 127 128 46 61 ...

## $ collected : Factor w/ 4245 levels "2015-07-01 00:02",..: 2 72 88 95 114 126 128 129 46 65 ...

## $ received : Factor w/ 4098 levels "2015-07-01 00:07",..: 2 69 81 88 107 119 120 122 45 62 ...

## $ resulted : Factor w/ 3765 levels "2015-07-01 00:11",..: 2 70 79 86 104 113 114 116 42 63 ...

## $ result : int 126 110 134 117 135 113 109 111 106 129 ...

## specimenID ordered collected received

## 1 4221 2015-06-30 23:28 2015-07-01 00:08 2015-07-01 00:29

## 2 5281 2015-07-01 14:12 2015-07-01 14:25 2015-07-01 14:30

## 3 5308 2015-07-01 15:56 2015-07-01 16:03 2015-07-01 16:10

## 4 5320 2015-07-01 16:57 2015-07-01 17:12 2015-07-01 17:20

## 5 5356 2015-07-01 20:00 2015-07-01 20:07 2015-07-01 20:18

##6 5374 2015-07-01 21:37 2015-07-01 21:40 2015-07-01 21:44

resulted result

1 2015-07-01 00:37 126

2 2015-07-01 14:50 110

3 2015-07-01 16:16 134

4 2015-07-01 17:23 117

5 2015-07-01 20:23 135

6 2015-07-01 21:53 113

In this simplified anonymized data set we can see that we have 4497 observations with all of the necessary time points to calculate the turnaround times of the preanalytical and analytical processes. For the sake of this example, let’s focus on the order-to-file time.

We are going to need to handle the dates, for which there is only one package worth discussing, namely lubridate.

library(lubridate)

1	library(lubridate)

Basic Data Preparation

The first thing we need to do is to convert the order, collect, receive and result times to lubridate objects (i.e. time and date objects) so that we can do some algebra on them. We can see from the structure of myData that the order, collect, receive and result time points are in the format “YYYY-MM-DD HH:MM”. Therefore we can use the lubridate function ymd_hm() to perform the conversion.

myData$ordered<-ymd_hm(myData$ordered)
myData$collected<-ymd_hm(myData$collected)
myData$received<-ymd_hm(myData$received)
myData$resulted<-ymd_hm(myData$resulted)

myData$ordered<-ymd_hm(myData$ordered)

myData$collected<-ymd_hm(myData$collected)

myData$received<-ymd_hm(myData$received)

myData$resulted<-ymd_hm(myData$resulted)

Applying str() again to myData, you will see that the dates and times are now POSIXct, that is, they are now dates and times. This allows use to calculate the order-to-file TAT, we can do with the difftime() function exporting the result in minutes. We will also append the order-to-file (otf) TAT to the dataframe and do some quick sanity-checking.

Sanity Check

otf<-difftime(myData$resulted,myData$ordered,units = "min")
myData<-cbind(myData,otf)
summary(as.numeric(myData$otf))
hist(as.numeric(myData$otf), main = "Histogram of Hb TATs", breaks = 60, col = "darkred",xlim = c(0,200), xlab = "Order to File in Minutes")

otf<-difftime(myData$resulted,myData$ordered,units = "min")

myData<-cbind(myData,otf)

summary(as.numeric(myData$otf))

hist(as.numeric(myData$otf), main = "Histogram of Hb TATs", breaks = 60, col = "darkred",xlim = c(0,200), xlab = "Order to File in Minutes")

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   22.00   30.00   36.58   43.00  694.00

1 2	## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3.00 22.00 30.00 36.58 43.00 694.00

This looks reasonable, so we can proceed with a TAT scatterplot.

Scatterplot

plot(myData$ordered,myData$otf, pch = 19, main = "Hemoglobin TAT",xlab = "Date",ylab = "TAT (min)")

1	plot(myData$ordered,myData$otf, pch = 19, main = "Hemoglobin TAT",xlab = "Date",ylab = "TAT (min)")

Beautifying

This is kind-of problematic because we really want to focus on results in the 0-200 minute range. There are some wild-outliers as occurs in real life because of instrument down-time, add-ons, etc. We can leave this matter for the present. Notice that I have displayed every day on the x-axis because this will allow us to investigate any problems we see. So we will adjust the ylim and we will also make the plot points semitransparent by using hexidecimal colour codes followed by a fractional transparency expressed in hexidecimal. Black is “#000000” and “20” is hexidecimal for 32 which is 32/256 or 12.5% opacity.

#make the points semistranparent and a little smaller
plot(myData$ordered,myData$otf, pch = 19, main = "Hemoglobin TAT",xlab = "Date of Analysis",ylab = "TAT (min)", ylim = c(0,200), col = "#00000020",cex = 0.5)

1 2	#make the points semistranparent and a little smaller plot(myData$ordered,myData$otf, pch = 19, main = "Hemoglobin TAT",xlab = "Date of Analysis",ylab = "TAT (min)", ylim = c(0,200), col = "#00000020",cex = 0.5)

We’ll accept the fact that we know that there are a number of outliers. We could easily have a plot that displayed them or a tabular summary of them.

Now we will need to prepare the vector of daily medians, 10th and 90th percentiles to plot. We will loop through each day of the month and then calculate the statistics for that day.

#calculate the start and end date
#fist collection in the month's reporting is always collected in the previous month so ceiling forces startDate to be first day of the month we are interested in. Same is true for endDate.
startDate <- ceiling_date(min(myData$ordered),"day")
endDate <- ceiling_date(max(myData$ordered),"day")
days <- seq (from = startDate, to = endDate, by = 'days')
#when we plot stastics, we want them at mid day
middays <- days + hours(12)
tenth<-vector()
fiftieth<-vector()
ninetieth<-vector()

for ( i in seq_along(days) ) {
  daysData<-subset(myData,myData$ordered >= days[i]&myData$ordered<days[i+1])
  tenth[i]<-quantile(daysData$otf,probs = 0.10)
  fiftieth[i]<-median(daysData$otf)
  ninetieth[i]<-quantile(daysData$otf,probs = 0.90)
}

quantileData<-data.frame(middays,tenth,fiftieth,ninetieth)

plot(myData$ordered,myData$otf, pch = 19, main = "Hemoglobin TAT",xlab = "Date of Analysis", ylab = "TAT (min)", ylim = c(0,200), col = "#00000020",cex = 0.5)
lines(quantileData$middays,quantileData$tenth, col = "red")
lines(quantileData$middays,quantileData$ninetieth, col = "red")
lines(quantileData$middays,quantileData$fiftieth, col = "blue")

#calculate the start and end date

#fist collection in the month's reporting is always collected in the previous month so ceiling forces startDate to be first day of the month we are interested in. Same is true for endDate.

startDate <- ceiling_date(min(myData$ordered),"day")

endDate <- ceiling_date(max(myData$ordered),"day")

days <- seq (from = startDate, to = endDate, by = 'days')

#when we plot stastics, we want them at mid day

middays <- days + hours(12)

tenth<-vector()

fiftieth<-vector()

ninetieth<-vector()

for ( i in seq_along(days) ) {

daysData<-subset(myData,myData$ordered >= days[i]&myData$ordered<days[i+1])

tenth[i]<-quantile(daysData$otf,probs = 0.10)

fiftieth[i]<-median(daysData$otf)

ninetieth[i]<-quantile(daysData$otf,probs = 0.90)

}

quantileData<-data.frame(middays,tenth,fiftieth,ninetieth)

plot(myData$ordered,myData$otf, pch = 19, main = "Hemoglobin TAT",xlab = "Date of Analysis", ylab = "TAT (min)", ylim = c(0,200), col = "#00000020",cex = 0.5)

lines(quantileData$middays,quantileData$tenth, col = "red")

lines(quantileData$middays,quantileData$ninetieth, col = "red")

lines(quantileData$middays,quantileData$fiftieth, col = "blue")

But this is not all that easy to look at. First, it’s kind-of ugly and second, if we find a problem date, we can’t read it from the figure. So let’s start by fixing the x-axis labels:

plot(myData$ordered,myData$otf, pch = 19, main = "Hemoglobin TAT", xlab = "", ylab = "TAT (min)", ylim = c(0,200), col = "#00000020",cex = 0.4,xaxt = "n")
#don't plot first day of next month on axis
axis.POSIXct(side = 1, quantileData$middays[1:length(quantileData$middays)-1],at = quantileData$middays[1:length(quantileData$middays)-1], las = 2, cex.axis = 0.6, col.axis = "gray30", format = "%b %d %Y")
#allows me to move the xlab down manually so as not to overwrite the dates.
mtext("Date of analysis", side = 1, line = 4)

plot(myData$ordered,myData$otf, pch = 19, main = "Hemoglobin TAT", xlab = "", ylab = "TAT (min)", ylim = c(0,200), col = "#00000020",cex = 0.4,xaxt = "n")

#don't plot first day of next month on axis

axis.POSIXct(side = 1, quantileData$middays[1:length(quantileData$middays)-1],at = quantileData$middays[1:length(quantileData$middays)-1], las = 2, cex.axis = 0.6, col.axis = "gray30", format = "%b %d %Y")

#allows me to move the xlab down manually so as not to overwrite the dates.

mtext("Date of analysis", side = 1, line = 4)

To paint the central 80% as a band, we will need to use the polygon() function. I am going to write a function to which and x-vector and two y-vectors is supplied which then fills the area between then with a supplied color. Naturally, the three vectors must have the same length.

#colours in the space between two curves.
fillitin<-function(x,ymin,ymax,colour){
  for (i in 1:length(x)){
    #define the x coordinates of the vertices of the polygon
    xvert<-c(x[i],x[i],x[i+1],x[i+1])
    #define the y coordinates of the vertices of the polygon
    yvert<-c(ymin[i],ymax[i],ymax[i+1],ymin[i+1])
    polygon(xvert,yvert, col = colour, border = NA)
  }
}

#now add these effects to the existing figure
fillitin(middays,tenth,ninetieth,"#FF000020")
lines(quantileData$middays,quantileData$fiftieth, col = "blue")
lines(quantileData$middays,quantileData$tenth, col = "red")
lines(quantileData$middays,quantileData$ninetieth, col = "red")

#colours in the space between two curves.

fillitin<-function(x,ymin,ymax,colour){

for (i in 1:length(x)){

#define the x coordinates of the vertices of the polygon

xvert<-c(x[i],x[i],x[i+1],x[i+1])

#define the y coordinates of the vertices of the polygon

yvert<-c(ymin[i],ymax[i],ymax[i+1],ymin[i+1])

polygon(xvert,yvert, col = colour, border = NA)

}

#now add these effects to the existing figure

fillitin(middays,tenth,ninetieth,"#FF000020")

lines(quantileData$middays,quantileData$fiftieth, col = "blue")

lines(quantileData$middays,quantileData$tenth, col = "red")

lines(quantileData$middays,quantileData$ninetieth, col = "red")

Final Product

Now we should just finish it off with a legend.

legend("topright",c("Median","Central 80%"),lty = c(1,1),col = c("blue","red"),inset = .05)

1	legend("topright",c("Median","Central 80%"),lty = c(1,1),col = c("blue","red"),inset = .05)

And that is a little more informative. There are many features you could add from this point – like smoothing, statistical analysis, outlier report. You could also loop over different tests, examine both the preanalytical and analytical processes at different locations, and produce a pdf report using MarkDown for all the institutions you look after.

-Dan

Regression Methods

Creating some random data

Residuals in OLS

Deming Regression

Ratio of Variances

Weighting

Passing Bablok

Outlier Effects

Generating a Pretty Plot

Conclusion

Trust in the Lord with all your heart and lean not on your own understanding; in all your ways submit to him, and he will make your paths straight. Proverbs 3:5-6

Removing NA’s from a Data Frame in R

The Problem

Finding NA's

Hey Hey! Ho Ho! Those NAs have got to go!

Complete Cases

na.omit()

Row Numbers are Actually Names

Now we can move on

Final Thought:

The Problem

Loading the Data

Sanity Check

Some Nutty Stuff

Time Dependence

Tunnelling Down

Some More Lubridate Magic

Monday Monday, So Good to Me?

Removing the For–ness

Smoothing

Lowess Smoothing

Observations

The lot is cast into the lap, but its every decision is from the LORD.

Proverbs 16:33

The Problem

Look at the Data

Basic Data Preparation

Sanity Check

Scatterplot

Beautifying

Final Product

“The LORD detests dishonest scales, but accurate weights find favor with him.”

Proverbs 11:1

Trust in the Lord with all your heart
and lean not on your own understanding;
in all your ways submit to him,
and he will make your paths straight.

Proverbs 3:5-6