bias – The Lab-R-torian

Parse an Online Table into an R Dataframe – Westgard’s Biological Variation Database

August 14, 2017August 14, 2017 dtholmes@mail.ubc.ca

Background

From time to time I have wanted to bring an online table into an R dataframe. While in principle, the data can be cut and paste into Excel, sometimes the table is very large and sometimes the columns get goofed up in the process. Fortunately, there are a number of R tools for accomplishing this. I am just going to show one approach using the rvest package. The rvest package also makes it possible to interact with forms on webpages to request specific material which can then be scraped. I think you will see the potential if you look here.

In our (simple) case, we will apply this process to Westgard's desirable assay specifications as shown on his website. The goal is to parse out the biological variation tables, get them into a dataframe and the write to csv or xlsx.

Reading in the Data

The first thing to do is to load the rvest and httr packages and define an html session with the html_session() function.

library(rvest)
library(httr)
wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("LabRtorian"))

library(rvest)

library(httr)

wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("LabRtorian"))

Now looking at the webpage, you can see that there are 8 columns in the tables of interest. So, we will define an empty dataframe with 8 columns.

#define empty table to hold all the content
biotable = data.frame(matrix(NA,0, 8))

#define empty table to hold all the content

biotable = data.frame(matrix(NA,0, 8))

We need to know which part of the document to scrape. This is a little obscure, but following the instructions in this post, we can determine that the xpaths we need are:

/html/body/div[1]/div[3]/div/main/article/div/table[1]

/html/body/div[1]/div[3]/div/main/article/div/table[2]

/html/body/div[1]/div[3]/div/main/article/div/table[3]

…

etc.

There are 8 such tables in the whole webpage. We can define a character vector for these as such:

xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

1 2	xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

Now we make a loop to scrape the 8 tables and with each iteration of the loop, append the scraped subtable to the main dataframe called biotable using the rbind() function. We have to use the parameter fill = TRUE in the html_table() function because the table does not happen to always a uniform number of columns.

for (j in 1:8){                
  subtable <- wg %>%
  read_html() %>%
  html_nodes(xpath =  xpaths[j]) %>%
  html_table(., fill = TRUE) 
  subtable <- subtable[[1]]
  biotable <- rbind(biotable,subtable)
}

for (j in 1:8){

subtable <- wg %>%

read_html() %>%

html_nodes(xpath = xpaths[j]) %>%

html_table(., fill = TRUE)

subtable <- subtable[[1]]

biotable <- rbind(biotable,subtable)

}

Clean Up

Now that we have the raw data out, we can have a quick look at it:

X1	X2	X3	X4	X5	X6	X7	X8
	Analyte	Number of Papers	Biological Variation	Biological Variation	Desirable specification	Desirable specification	Desirable specification
	Analyte	Number of Papers	CVI	CVg	I(%)	B(%)	TE(%)
S-	11-Desoxycortisol	2	21.3	31.5	10.7	9.5	27.1
S-	17-Hydroxyprogesterone	2	19.6	50.4	9.8	13.5	29.7
U-	4-hydroxy-3-methoximandelate (VMA)	1	22.2	47.0	11.1	13.0	31.3
S-	5' Nucleotidase	2	23.2	19.9	11.6	7.6	26.8
U-	5'-Hydroxyindolacetate, concentration	1	20.3	33.2	10.2	9.7	26.5
S-	α1-Acid Glycoprotein	3	11.3	24.9	5.7	6.8	16.2
S-	α1-Antichymotrypsin	1	13.5	18.3	6.8	5.7	16.8
S-	α1-Antitrypsin	3	5.9	16.3	3.0	4.3	9.2

We can see that we need define column names and we need to get rid of some rows containing extraneous column header information. There are actually 8 such sets of headers to remove.

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE")
names(biotable) <- table.header

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE")

names(biotable) <- table.header

Let's now find rows we don't want and remove them.

for.removal <- grep("Analyte", biotable$Analyte)
biotable <- biotable[-for.removal,]

for.removal <- grep("Analyte", biotable$Analyte)

biotable <- biotable[-for.removal,]

You will find that the table has missing data which is written as “- – -”. This should be now replaced by NA and the column names should be assigned to sequential integers. Also, we will remove all the minus signs after the specimen type. I'm not sure what they add.

biotable[biotable == "---"] <- NA
row.names(biotable) <- 1:nrow(biotable)
biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

biotable[biotable == "---"] <- NA

row.names(biotable) <- 1:nrow(biotable)

biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

Check it Out

Just having another look at the first 10 rows:

Sample	Analyte	NumPapers	CVI	CVG	I	B	TE
S	11-Desoxycortisol	2	21.3	31.5	10.7	9.5	27.1
S	17-Hydroxyprogesterone	2	19.6	50.4	9.8	13.5	29.7
U	4-hydroxy-3-methoximandelate (VMA)	1	22.2	47.0	11.1	13.0	31.3
S	5' Nucleotidase	2	23.2	19.9	11.6	7.6	26.8
U	5'-Hydroxyindolacetate, concentration	1	20.3	33.2	10.2	9.7	26.5
S	α1-Acid Glycoprotein	3	11.3	24.9	5.7	6.8	16.2
S	α1-Antichymotrypsin	1	13.5	18.3	6.8	5.7	16.8
S	α1-Antitrypsin	3	5.9	16.3	3.0	4.3	9.2
S	α1-Globulins	2	11.4	22.6	5.7	6.3	15.7
U	α1-Microglobulin, concentration, first morning	1	33.0	58.0	16.5	16.7	43.9

Now examining the structure:

str(biotable)

1 2	str(biotable)

## 'data.frame':    370 obs. of  8 variables:
##  $ Sample   : chr  "S" "S" "U" "S" ...
##  $ Analyte  : chr  "11-Desoxycortisol" "17-Hydroxyprogesterone" "4-hydroxy-3-methoximandelate (VMA)" "5' Nucleotidase" ...
##  $ NumPapers: chr  "2" "2" "1" "2" ...
##  $ CVI      : chr  "21.3" "19.6" "22.2" "23.2" ...
##  $ CVG      : chr  "31.5" "50.4" "47.0" "19.9" ...
##  $ I        : chr  "10.7" "9.8" "11.1" "11.6" ...
##  $ B        : chr  "9.5" "13.5" "13.0" "7.6" ...
##  $ TE       : chr  "27.1" "29.7" "31.3" "26.8" ...

## 'data.frame': 370 obs. of 8 variables:

## $ Sample : chr "S" "S" "U" "S" ...

## $ Analyte : chr "11-Desoxycortisol" "17-Hydroxyprogesterone" "4-hydroxy-3-methoximandelate (VMA)" "5' Nucleotidase" ...

## $ NumPapers: chr "2" "2" "1" "2" ...

## $ CVI : chr "21.3" "19.6" "22.2" "23.2" ...

## $ CVG : chr "31.5" "50.4" "47.0" "19.9" ...

## $ I : chr "10.7" "9.8" "11.1" "11.6" ...

## $ B : chr "9.5" "13.5" "13.0" "7.6" ...

## $ TE : chr "27.1" "29.7" "31.3" "26.8" ...

It's kind-of undesirable to have numbers as characters so…

#convert appropriate columns to numeric
biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

#convert appropriate columns to numeric

biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

Write the Data

Using the xlsx package, you can output the table to an Excel file in the current working directory.

library(xlsx)
write.xlsx(biotable,
            file = "Westgard_Biological_Variation.xlsx",
            row.names = FALSE)

library(xlsx)

write.xlsx(biotable,

file = "Westgard_Biological_Variation.xlsx",

row.names = FALSE)

If you are having trouble getting xlsx to install, then just write as csv.

write.csv(biotable,
            file = "Westgard_Biological_Variation.csv",
            row.names = FALSE)

write.csv(biotable,

file = "Westgard_Biological_Variation.csv",

row.names = FALSE)

Conclusion

You can now use the same general approach to parse any table you have web access to, no mater how small or big it is. Here is a complete script in one place:

library(httr)
library(rvest)
library(xlsx)

wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("yournamehere"))
xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

#define empty dataframe
biotable = data.frame(matrix(NA,0, 8))

#loop over the 8 html tables
for (j in 1:8){                
  subtable <- wg %>%
  read_html() %>%
  html_nodes(xpath =  xpaths[j] ) %>%
  html_table(., fill = TRUE) 
  subtable <- subtable[[1]]
  biotable <- rbind(biotable,subtable)
}

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE")
names(biotable) <- table.header

#remove extraneous rows
for.removal <- grep("Analyte", biotable$Analyte)
biotable <- biotable[-for.removal,]

#make missing data into NA
biotable[ biotable == "---" ] <- NA
row.names(biotable) <- 1:nrow(biotable)

#convert appropriate columns to numeric
biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

#get rid of minus signs in column 1
biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

write.xlsx(biotable,
            file = "Westgard_Biological_Variation.xlsx",
            row.names = FALSE)

write.csv(biotable,
            file = "Westgard_Biological_Variation.csv",
            row.names = FALSE)

library(httr)

library(rvest)

library(xlsx)

wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("yournamehere"))

xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

#define empty dataframe

biotable = data.frame(matrix(NA,0, 8))

#loop over the 8 html tables

for (j in 1:8){

subtable <- wg %>%

read_html() %>%

html_nodes(xpath = xpaths[j] ) %>%

html_table(., fill = TRUE)

subtable <- subtable[[1]]

biotable <- rbind(biotable,subtable)

}

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE")

names(biotable) <- table.header

#remove extraneous rows

for.removal <- grep("Analyte", biotable$Analyte)

biotable <- biotable[-for.removal,]

#make missing data into NA

biotable[ biotable == "---" ] <- NA

row.names(biotable) <- 1:nrow(biotable)

#convert appropriate columns to numeric

biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

#get rid of minus signs in column 1

biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

write.xlsx(biotable,

file = "Westgard_Biological_Variation.xlsx",

row.names = FALSE)

write.csv(biotable,

file = "Westgard_Biological_Variation.csv",

row.names = FALSE)

Parting Thought on Tables

You prepare a table before me in the presence of my enemies. You anoint my head with oil; my cup overflows.

(Psalm 23:5)

Making Youden Plots in R

October 15, 2015October 15, 2015 dtholmes@mail.ubc.ca

Background

I was honoured by a site visit by Drs. Yeo-Min Yun and Junghan Song of the Korean Society for Clinical Chemistry a few weeks ago. As both professors are on the organizing committee of the Cherry Blossom Symposium for Lab Automation in Seoul in Spring 2016, their primary motivation for visiting was to discuss mass spectrometry sample prep automation but later we got on to the topic of automated reporting for quality assurance schemes.

Naturally, I was promoting R, R-Markdown and knitr as a good pipeline for automated Quality Assurance (QA) report.

This brought to mind Youden Plots which are used by DGKL in their reports. I like DGKL reports for three reasons:

They are accuracy based against GC-MS when it comes to steroids.
I can see all the other LC-MS/MS methods immediately.
Youden plots look like a target and can be assessed rapidly.

The data for a Youden plot is generated by providing a number of laboratories aliquots from two separate unknown samples, which we will call A and B. Every lab analyzes both samples and a scatter plot of the A and B results are generated–the A results on the $x$–axis and the B results on the $y$–axis. Once this is completed, limits of acceptability are plotted and outliers can be identified.

In Youden's original formulation of the plot (see page 133-1 this online document) he required that the concentrations of the A and B samples be close to one another. As you might guess, in clinical medicine, this is not all that useful because we often want to test more than one part of the analytical range in an external quality assurance (EQA) scheme. One workaround for this is the make a Youden plot of the standard normal variates for the A and B samples, that is to plot $z_b = \frac{b_i-\bar{b}} {\sigma_b}$ vs $z_a = \frac{a_i-\bar{a}} {\sigma_a}$, where $a_i$ and $b_i$ are the individual values of the A and B samples from the $i$ labs. This has the disadvantage of representing the results in a manner that is not easily assessed from a clinical perspective.

While there are published approaches to coping with this problem, these are out of scope here but I will show you a couple of other ways I have seen Youden plots represented. If you want to see R code to generate the classic Youden Plot, it can be found in this this stackoverflow post and below.

Random Data

Let's start by generating some data. For the sake of argument, let's say we are looking at testosterone results in males and measured in nmol/L. Suppose that the A sample has a true concentration of 5.3 nmol/L and the B sample has a true concentration of 16.2 nmol/L. Let's also assume that they are all performed by the same analytical method. If you have looked at EQA reports, you will know that a scatter plot of results for the A and B samples does not typically look like this.

plot of chunk unnamed-chunk-1

The (mock) data abaove are bivariate Gaussian and uncorrelated. In reality we often see something that looks a little more like this:

plot of chunk unnamed-chunk-2

That is, the A and B values are usually correlated.

Rectangular Youden Plots

The most common manner in which you will see a Youden plot prepared is just a box with mean $\pm$ 2SD and $\pm$ 3SD limits.

plot(A,B, xlim = c(0,10), ylim = c(0,30), pch=19, col="#00000080")
grid()

A.mu <- mean(A)
A.sd <- sd(A)
B.mu <- mean(B)
B.sd <- sd(B)

#draw a box around the 2SD limit
rect(xleft = A.mu - 2*A.sd, ybottom = B.mu - 2*B.sd, xright = A.mu + 2*A.sd, ytop = B.mu + 2*B.sd, lwd = 2, border = "orange")
#draw a box around the 3SD limit
rect(xleft = A.mu - 3*A.sd, ybottom = B.mu - 3*B.sd, xright = A.mu + 3*A.sd, ytop = B.mu + 3*B.sd, lwd = 2, border = "red")


#draw a diagonal line - which is unnecessary but you will see people do it
lines(c(A.mu - 3*A.sd, A.mu + 3*A.sd),c(B.mu - 3*B.sd, B.mu + 3*B.sd))
#draw horizontal line
lines(c(A.mu - 3*A.sd, A.mu + 3*A.sd), c(B.mu, B.mu))
#draw vertical line
lines(c(A.mu, A.mu), c(B.mu - 3*B.sd, B.mu + 3*B.sd))

#add a legend
legend("topleft", c("2SD limit","3SD limit"), col=c("orange","red"), lty=c(1,1))

plot(A,B, xlim = c(0,10), ylim = c(0,30), pch=19, col="#00000080")

grid()

A.mu <- mean(A)

A.sd <- sd(A)

B.mu <- mean(B)

B.sd <- sd(B)

#draw a box around the 2SD limit

rect(xleft = A.mu - 2*A.sd, ybottom = B.mu - 2*B.sd, xright = A.mu + 2*A.sd, ytop = B.mu + 2*B.sd, lwd = 2, border = "orange")

#draw a box around the 3SD limit

rect(xleft = A.mu - 3*A.sd, ybottom = B.mu - 3*B.sd, xright = A.mu + 3*A.sd, ytop = B.mu + 3*B.sd, lwd = 2, border = "red")

#draw a diagonal line - which is unnecessary but you will see people do it

lines(c(A.mu - 3*A.sd, A.mu + 3*A.sd),c(B.mu - 3*B.sd, B.mu + 3*B.sd))

#draw horizontal line

lines(c(A.mu - 3*A.sd, A.mu + 3*A.sd), c(B.mu, B.mu))

#draw vertical line

lines(c(A.mu, A.mu), c(B.mu - 3*B.sd, B.mu + 3*B.sd))

#add a legend

legend("topleft", c("2SD limit","3SD limit"), col=c("orange","red"), lty=c(1,1))

plot of chunk unnamed-chunk-3

Non Parametric

Obviously, if you prefer, you could prepare this Youden plot in a non-parametric fashion by plotting the medians and the non-parametrically calculated 1st, 2.5th, 97.5th, and 99th percentiles. In this case, the code would be:

plot(A,B, xlim = c(0,10), ylim = c(0,30), pch=19, col="#00000080")
grid()

A.med <- median(A)
A.quants <- quantile(A, probs=c(0.01,0.025,0.975,0.99))
B.med <- median(B)
B.quants <- quantile(B, probs=c(0.01,0.025,0.975,0.99))

#draw a box around the central 95% limit
rect(xleft = A.quants[2], ybottom = B.quants[2], xright = A.quants[3], ytop = B.quants[3], lwd = 2, border = "orange")
#draw a box around the central 99% limit
rect(xleft = A.quants[1], ybottom = B.quants[1], xright = A.quants[4], ytop = B.quants[4], lwd = 2, border = "red")

#draw a vertical line
lines(c(A.quants[1],A.quants[4]),c(B.med,B.med))
#draw vertical line
lines(c(A.med,A.med),c(B.quants[1],B.quants[4]))

legend("topleft", c("Central 95%","Central 99%"), col=c("orange","red"), lty=c(1,1))

plot(A,B, xlim = c(0,10), ylim = c(0,30), pch=19, col="#00000080")

grid()

A.med <- median(A)

A.quants <- quantile(A, probs=c(0.01,0.025,0.975,0.99))

B.med <- median(B)

B.quants <- quantile(B, probs=c(0.01,0.025,0.975,0.99))

#draw a box around the central 95% limit

rect(xleft = A.quants[2], ybottom = B.quants[2], xright = A.quants[3], ytop = B.quants[3], lwd = 2, border = "orange")

#draw a box around the central 99% limit

rect(xleft = A.quants[1], ybottom = B.quants[1], xright = A.quants[4], ytop = B.quants[4], lwd = 2, border = "red")

#draw a vertical line

lines(c(A.quants[1],A.quants[4]),c(B.med,B.med))

#draw vertical line

lines(c(A.med,A.med),c(B.quants[1],B.quants[4]))

legend("topleft", c("Central 95%","Central 99%"), col=c("orange","red"), lty=c(1,1))

plot of chunk unnamed-chunk-4

Note that if you increase the number of points, this non-parametric plot will not quite converge to look like the parametric one shown above because $\mu \pm 2\sigma$ actually encompases 95.45% of the data in a univariate normal distribution, and $\mu \pm 3\sigma$ actually encompases 99.73% of the data.

FYI

Be ye not deceived. Even with the non-parametric approach or a parametric approach with the correct $z$-scores, the orange (“Central 95%” or “2SD”) boxes shown above do not house 95% of the data and the red (“Central 99%” or “3SD”) boxes do not house 99% of the data. You can see this pretty easily if you consider the case of uncorrelated data. Let's take 100000 random pairs of uncorrelated Gaussian data with $\mu = 10$ and $\sigma = 1$.

Five percent of data points are excluded by the vertical orange lines shown below at the $\mu_A \pm 1.96 \sigma_A$ and 5% of data are excluded by the horizontal orange lines positioned at $\mu_B \pm 1.96 \sigma_B$.

Points will fall into the one of the 4 areas shaded yellow 0.95 x 0.025 or 2.375 % of the time and points will fall into one of the 4 areas shaded purple 0.025 x 0.025 or 0.0625 % of the time. This means that the “2SD” box actually encloses 100 – 4×2.375 – 4×0.0625 % = 90.25 % of the data.

plot of chunk unnamed-chunk-5

Much more directly, the probability of both A and B falling inside the center square is 0.95×0.95 = 0.9025 = 90.25%.

You can do a random simulation to prove this to yourself:

#generate the data, n=10000
A <- rnorm(10^5,10,1)
B <- rnorm(10^5,10,1)
# convert results to z-scores
scale.df <- data.frame(scA = scale(A), scB = scale(B))
#calculate how many samples are inside the centre box
nrow(subset(scale.df, abs(scA)<1.96 & abs(scB)<1.96))

#generate the data, n=10000

A <- rnorm(10^5,10,1)

B <- rnorm(10^5,10,1)

# convert results to z-scores

scale.df <- data.frame(scA = scale(A), scB = scale(B))

#calculate how many samples are inside the centre box

nrow(subset(scale.df, abs(scA)<1.96 & abs(scB)<1.96))

## [1] 90245

1	## [1] 90245

This is pretty darn close to the 90.25% we were expecting.

Elliptical Youden Plots

The rectangular plot shown works (with caveats described) but there is something slightly undesirable about it because a point could be off in the corner, far away from the other data, but still inside the 3SD box. It seems much preferable to encircle the data with an ellipse. Fortunately, there is a built in function to achieve this in the car package which makes the code is very simple. The other nice thing is that the ellipses are actually calculated to house 95% and 99% of the data respectively.

library(car)
dataEllipse(A, B, levels=c(0.95, 0.99), fill = FALSE, xlim = c(0,10), ylim = c(0,30), pch = 19, col=c("#00000080", "red"), center.pch = FALSE)

library(car)

dataEllipse(A, B, levels=c(0.95, 0.99), fill = FALSE, xlim = c(0,10), ylim = c(0,30), pch = 19, col=c("#00000080", "red"), center.pch = FALSE)

plot of chunk unnamed-chunk-8

To generate a two–colour equivalent to what we have above we draw the Youden plot in two stages.

dataEllipse(A, B, levels = 0.95, fill = FALSE, xlim = c(0,10), ylim = c(0,30), pch = 19, col = c("#00000080", "orange"), center.pch = FALSE)

dataEllipse(A, B, levels = 0.99, fill = FALSE, col = c(NA, "red"), center.pch = FALSE, plot.points = FALSE)

legend("topleft", c("Central 95%","Central 99%"), col = c("orange","red"), lty=c(1,1))

dataEllipse(A, B, levels = 0.95, fill = FALSE, xlim = c(0,10), ylim = c(0,30), pch = 19, col = c("#00000080", "orange"), center.pch = FALSE)

dataEllipse(A, B, levels = 0.99, fill = FALSE, col = c(NA, "red"), center.pch = FALSE, plot.points = FALSE)

legend("topleft", c("Central 95%","Central 99%"), col = c("orange","red"), lty=c(1,1))

plot of chunk unnamed-chunk-9

Now we might want to add the horizontal, vertical lines.

#store the points of the outer ellipse in a matrix
outer.ellipse <- dataEllipse(A, B, levels = 0.99, xlim = c(0,10), ylim = c(0,30), pch = 19, col = c("#00000080", "red"), draw = TRUE, center.pch = FALSE)
#plot the innter ellipse
dataEllipse(A, B, levels = 0.95, fill = FALSE, xlim = c(0,10), ylim = c(0,30), pch = 19, col = c("#00000080", "orange"), center.pch = FALSE, plot.points = FALSE, add = TRUE)

#find the value of x that is closest to the mean value of A
vert.index <- which.min(abs(A.mu-outer.ellipse[,1]))
#calculate the vertical distance to centre
vert.dist <- outer.ellipse[,2][vert.index] - B.mu
#draw the line one end of ellipse to the other
lines(c(A.mu, A.mu), c(outer.ellipse[,2][vert.index], outer.ellipse[,2][vert.index] - 2*vert.dist))

#find the value of y that is closest to the mean valye of B
horiz.index <- which.min(abs(B.mu-outer.ellipse[,2]))
#calculate the horizonal distance to centre
horiz.dist <- outer.ellipse[,1][horiz.index] - A.mu
#draw the line one end of ellipse to the other
lines(c(outer.ellipse[,1][horiz.index], outer.ellipse[,1][horiz.index] - 2*horiz.dist), c(B.mu,B.mu))
#add a legend
legend("topleft", c("Central 95%","Central 99%"), col=c("orange","red"), lty=c(1,1))

#store the points of the outer ellipse in a matrix

outer.ellipse <- dataEllipse(A, B, levels = 0.99, xlim = c(0,10), ylim = c(0,30), pch = 19, col = c("#00000080", "red"), draw = TRUE, center.pch = FALSE)

#plot the innter ellipse

dataEllipse(A, B, levels = 0.95, fill = FALSE, xlim = c(0,10), ylim = c(0,30), pch = 19, col = c("#00000080", "orange"), center.pch = FALSE, plot.points = FALSE, add = TRUE)

#find the value of x that is closest to the mean value of A

vert.index <- which.min(abs(A.mu-outer.ellipse[,1]))

#calculate the vertical distance to centre

vert.dist <- outer.ellipse[,2][vert.index] - B.mu

#draw the line one end of ellipse to the other

lines(c(A.mu, A.mu), c(outer.ellipse[,2][vert.index], outer.ellipse[,2][vert.index] - 2*vert.dist))

#find the value of y that is closest to the mean valye of B

horiz.index <- which.min(abs(B.mu-outer.ellipse[,2]))

#calculate the horizonal distance to centre

horiz.dist <- outer.ellipse[,1][horiz.index] - A.mu

#draw the line one end of ellipse to the other

lines(c(outer.ellipse[,1][horiz.index], outer.ellipse[,1][horiz.index] - 2*horiz.dist), c(B.mu,B.mu))

#add a legend

legend("topleft", c("Central 95%","Central 99%"), col=c("orange","red"), lty=c(1,1))

plot of chunk unnamed-chunk-10

And if you wanted a little more groove you can add shading.

dataEllipse(A, B, levels = 0.99, xlim = c(0,10), ylim = c(0,30), pch = 19, col = c("#00000080", "red"), center.pch = FALSE, plot.points = TRUE, fill = TRUE, fill.alpha = 0.1)
dataEllipse(A, B, levels=0.95 ,xlim = c(0,10), ylim = c(0,30), pch = 19, col="orange", center.pch = FALSE, plot.points = FALSE, add = TRUE, fill = TRUE, fill.alpha = 0.3)
lines(c(A.mu, A.mu), c(outer.ellipse[,2][vert.index], outer.ellipse[,2][vert.index] - 2*vert.dist), col = "blue")
lines(c(outer.ellipse[,1][horiz.index], outer.ellipse[,1][horiz.index] - 2*horiz.dist), c(B.mu,B.mu), col = "blue")
legend("topleft", c("Central 95%","Central 99%"), col=c("orange","red"), lty=c(1,1))

dataEllipse(A, B, levels = 0.99, xlim = c(0,10), ylim = c(0,30), pch = 19, col = c("#00000080", "red"), center.pch = FALSE, plot.points = TRUE, fill = TRUE, fill.alpha = 0.1)

dataEllipse(A, B, levels=0.95 ,xlim = c(0,10), ylim = c(0,30), pch = 19, col="orange", center.pch = FALSE, plot.points = FALSE, add = TRUE, fill = TRUE, fill.alpha = 0.3)

lines(c(A.mu, A.mu), c(outer.ellipse[,2][vert.index], outer.ellipse[,2][vert.index] - 2*vert.dist), col = "blue")

lines(c(outer.ellipse[,1][horiz.index], outer.ellipse[,1][horiz.index] - 2*horiz.dist), c(B.mu,B.mu), col = "blue")

legend("topleft", c("Central 95%","Central 99%"), col=c("orange","red"), lty=c(1,1))

plot of chunk unnamed-chunk-11

Build a Function

We can also fold all of this into a function:

youden <- function(A,B,shape){

    A.mu <- mean(A)
    A.sd <- sd(A)
    B.mu <- mean(B)
    B.sd <- sd(B)

    plot(A,B, pch=19, col = "#00000080", xlim = c(A.mu - 5*A.sd,A.mu + 5*A.sd), ylim = c(B.mu - 5*B.sd,B.mu + 5*B.sd))
    grid()

  if (missing(shape)){
    shape <- "ellipse"
  }

  if (shape == "rectangle"){
    rect(xleft = A.mu - 2*A.sd, ybottom = B.mu - 2*B.sd, xright = A.mu + 2*A.sd, ytop = B.mu + 2*B.sd, lwd = 2, border = "orange")
    rect(xleft = A.mu - 3*A.sd, ybottom = B.mu - 3*B.sd, xright = A.mu + 3*A.sd, ytop = B.mu + 3*B.sd, lwd = 2, border = "red")
    lines(c(A.mu - 3*A.sd, A.mu + 3*A.sd),c(B.mu - 3*B.sd, B.mu + 3*B.sd))
    lines(c(A.mu - 3*A.sd, A.mu + 3*A.sd), c(B.mu, B.mu))
    lines(c(A.mu, A.mu), c(B.mu - 3*B.sd, B.mu + 3*B.sd))
    legend("topleft", c("2SD limit","3SD limit"), col=c("orange","red"), lty=c(1,1))

  } else if (shape=="ellipse") {
    outer.ellipse <- dataEllipse(A, B, levels = 0.99, pch = 19, col=c("#00000080", "red"), draw = TRUE, center.pch = FALSE, plot.points = FALSE)
    dataEllipse(A, B, levels = 0.95, fill = FALSE, pch = 19, col=c("#00000080", "orange"), center.pch = FALSE, plot.points = FALSE, add = TRUE)
    vert.index <- which.min(abs(A.mu-outer.ellipse[,1]))
    vert.dist <- outer.ellipse[,2][vert.index] - B.mu
    lines(c(A.mu, A.mu), c(outer.ellipse[,2][vert.index], outer.ellipse[,2][vert.index] - 2*vert.dist))
    horiz.index <- which.min(abs(B.mu-outer.ellipse[,2]))
    horiz.dist <- outer.ellipse[,1][horiz.index] - A.mu
    lines(c(outer.ellipse[,1][horiz.index], outer.ellipse[,1][horiz.index] - 2*horiz.dist), c(B.mu,B.mu))
    legend("topleft", c("Central 95%","Central 99%"), col=c("orange","red"), lty=c(1,1))
  }
}

#use the function
par(mfrow=c(1,2))
youden(A,B, shape = "rectangle")
youden(A,B, shape = "ellipse")

youden <- function(A,B,shape){

A.mu <- mean(A)

A.sd <- sd(A)

B.mu <- mean(B)

B.sd <- sd(B)

plot(A,B, pch=19, col = "#00000080", xlim = c(A.mu - 5*A.sd,A.mu + 5*A.sd), ylim = c(B.mu - 5*B.sd,B.mu + 5*B.sd))

grid()

if (missing(shape)){

shape <- "ellipse"

}

if (shape == "rectangle"){

rect(xleft = A.mu - 2*A.sd, ybottom = B.mu - 2*B.sd, xright = A.mu + 2*A.sd, ytop = B.mu + 2*B.sd, lwd = 2, border = "orange")

rect(xleft = A.mu - 3*A.sd, ybottom = B.mu - 3*B.sd, xright = A.mu + 3*A.sd, ytop = B.mu + 3*B.sd, lwd = 2, border = "red")

lines(c(A.mu - 3*A.sd, A.mu + 3*A.sd),c(B.mu - 3*B.sd, B.mu + 3*B.sd))

lines(c(A.mu - 3*A.sd, A.mu + 3*A.sd), c(B.mu, B.mu))

lines(c(A.mu, A.mu), c(B.mu - 3*B.sd, B.mu + 3*B.sd))

legend("topleft", c("2SD limit","3SD limit"), col=c("orange","red"), lty=c(1,1))

} else if (shape=="ellipse") {

outer.ellipse <- dataEllipse(A, B, levels = 0.99, pch = 19, col=c("#00000080", "red"), draw = TRUE, center.pch = FALSE, plot.points = FALSE)

dataEllipse(A, B, levels = 0.95, fill = FALSE, pch = 19, col=c("#00000080", "orange"), center.pch = FALSE, plot.points = FALSE, add = TRUE)

vert.index <- which.min(abs(A.mu-outer.ellipse[,1]))

vert.dist <- outer.ellipse[,2][vert.index] - B.mu

lines(c(A.mu, A.mu), c(outer.ellipse[,2][vert.index], outer.ellipse[,2][vert.index] - 2*vert.dist))

horiz.index <- which.min(abs(B.mu-outer.ellipse[,2]))

horiz.dist <- outer.ellipse[,1][horiz.index] - A.mu

lines(c(outer.ellipse[,1][horiz.index], outer.ellipse[,1][horiz.index] - 2*horiz.dist), c(B.mu,B.mu))

legend("topleft", c("Central 95%","Central 99%"), col=c("orange","red"), lty=c(1,1))

}

#use the function

par(mfrow=c(1,2))

youden(A,B, shape = "rectangle")

youden(A,B, shape = "ellipse")

plot of chunk unnamed-chunk-12

Comparison with the Classic Youden Plot

If the data happens to be uncorrelated, and has identical average A and B values, the elliptical approach will generate a (nearly) circular Youden plot according to the original description. The youden.classic() function code shown below is borrowed from the stackoverflow post mentioned above.

youden.classic <- function(A, B){
    plot(A,B,asp = 1, xlab = "A", ylab = "B", pch=".") 
    mB <- median(B)
    mA <- median(A)
    abline(h = mB, v = mA)
    curve(x-(mA-mB),add=TRUE)
    d <- mean(A-B)
    d_prime <- A-B-d
    r <- 2.45*mean(abs(d_prime))*sqrt(pi)/2
    t <- seq(0,2*pi,by=0.01)
    x <- r*cos(t)+mA
    y <- r*sin(t)+mB
    lines(x,y)
}

#generate 10000 pairs of bivariate data with a mean of 10 for both A and B and 15% CV
A <- rnorm(10000,10,1.5)
B <- rnorm(10000,10,1.5)
youden.classic(A,B)

#overlay ellipse
dataEllipse(A, B, levels = 0.95, fill = TRUE, fill.alpha = 0.3, col="orange", center.pch = FALSE, plot.points = FALSE, add = TRUE)

youden.classic <- function(A, B){

plot(A,B,asp = 1, xlab = "A", ylab = "B", pch=".")

mB <- median(B)

mA <- median(A)

abline(h = mB, v = mA)

curve(x-(mA-mB),add=TRUE)

d <- mean(A-B)

d_prime <- A-B-d

r <- 2.45*mean(abs(d_prime))*sqrt(pi)/2

t <- seq(0,2*pi,by=0.01)

x <- r*cos(t)+mA

y <- r*sin(t)+mB

lines(x,y)

}

#generate 10000 pairs of bivariate data with a mean of 10 for both A and B and 15% CV

A <- rnorm(10000,10,1.5)

B <- rnorm(10000,10,1.5)

youden.classic(A,B)

#overlay ellipse

dataEllipse(A, B, levels = 0.95, fill = TRUE, fill.alpha = 0.3, col="orange", center.pch = FALSE, plot.points = FALSE, add = TRUE)

plot of chunk unnamed-chunk-13

Conclusion

There you have it. With a Youden Plot it is easy to separate the sheep from the goats. There are lots of ways that you can dress up your plot to suit your needs. Of course, this could be embedded into an automated EQA report generated with R, Rmarkdown and knitr.

I hope that was helpful.

-Dan